6.3.2
π’ Spark NLP 6.3.2: Scala 2.13 Support, Layout-Aware Images, and Enhanced LightPipeline Tracking
Spark NLP 6.3.2 is a foundational release that introduces official support for Scala 2.13, alongside important improvements in document layout understanding and lightweight inference workflows.
This release improves long-term model portability through JSON-based serialization, enriches document image extraction with spatial metadata, and enhances LightPipeline with document ID tracking and output filtering.
π₯ Highlights
- Official Scala 2.13 support
- Layout-aware image extraction with spatial coordinates added to
Reader2Imagefor HTML, DOCX, and PPTX documents. - Enhanced
LightPipelinewith document ID propagation and output column filtering for better batch inference workflows.
π New Features & Enhancements
Scala 2.13 Support
Spark NLP now supports Scala 2.13 with this release! This will enable you to run your Spark NLP pipelines on Spark versions that run on Scala 2.13, such as used by Databricks and Dataproc. See our Installation Instructions for Scala 2.13 on how to use it with our project.
There are some things you have to consider when using the Scala 2.13 version
- You need to adjust your dependency from
spark-nlp_2.12tospark-nlp_2.13. - If you install PySpark from PyPi, then the session will be Scala 2.12 by default. If you need to start a Scala 2.13 instance, you can set the
SPARK_HOMEenvironment variable to a Spark Scala 2.13 installation, or install PySpark from the official Spark archives. - If you want to load
DependencyParserModelorTextMatcherModelfrom Scala 2.12 into Scala 2.13, you will need to manually export them again with the latest version. See the notebook
Layout-Aware Image Metadata in Reader2Image
The Reader2Image annotator now extracts spatial image coordinates from rich document formats, adding layout awareness to image annotations.
- Supported formats:
- HTML
- Word (DOCX)
- PowerPoint (PPTX)
- New metadata fields:
x,y,width,height
- Coordinates are included alongside existing metadata such as image format, type, and DOM position
This enables:
- Layout-aware document and multimodal pipelines
- Visual reconstruction of documents
- More accurate association of images with surrounding text content
Document ID Support in LightPipeline
LightPipeline now supports passing document IDs together with text inputs, improving traceability in batch and production inference scenarios.
Key capabilities:
- New overloads:
fullAnnotate(ids, texts)annotate(ids, texts)
- Document IDs are propagated as annotation metadata (
doc_id) - New
output_colsparameter to restrict returned annotation types
Benefits:
- Reliable document-to-result mapping
- Easier debugging and downstream integration
- Reduced memory usage through selective outputs
Existing LightPipeline usage remains unchanged and backward compatible.
π Bug Fixes
- Fix out of memory error when copying big models to a cloud storage
β€οΈ Community Support
- Slack β real-time discussion with the Spark NLP community and team
- GitHub β issue tracking, feature requests, and contributions
- Discussions β community ideas and showcases
- Medium β latest Spark NLP articles and tutorials
- YouTube β educational videos and demos
π» Installation
Python
pip install spark-nlp==6.3.2
Spark Packages
CPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2
Apple Silicon
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2
Maven
Supported on on Apache Spark 3.x.
spark-nlp
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.3.2</version>
</dependency>
spark-nlp-gpu
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.3.2</version>
</dependency>
spark-nlp-silicon
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.3.2</version>
</dependency>
spark-nlp-aarch64
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.3.2</version>
</dependency>
FAT JARs
- CPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.3.2.jar
- GPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.3.2.jar
- Apple Silicon: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.3.2.jar
- AArch64: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.3.2.jar
What's Changed
- [SPARKNLP-1136] JSON Serialization for Features #14722 by @DevinTDHa
- [SPARKNLP-1329] Adding image coordinates to metadata for Reader2Image #14725 by @danilojsl
- [SPARKNLP-1333] Adding ids input for LightPipeline #14726 by @danilojsl
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.3.1...6.3.2