📢 Spark NLP 6.3.2: Scala 2.13 Support, Layout-Aware Images, and Enhanced LightPipeline Tracking

Spark NLP 6.3.2 is a foundational release that introduces official support for Scala 2.13, alongside important improvements in document layout understanding and lightweight inference workflows. This release improves long-term model portability through JSON-based serialization, enriches document image extraction with spatial metadata, and enhances LightPipeline with document ID tracking and output filtering.

🔥 Highlights

Official Scala 2.13 support
Layout-aware image extraction with spatial coordinates added to Reader2Image for HTML, DOCX, and PPTX documents.
Enhanced LightPipeline with document ID propagation and output column filtering for better batch inference workflows.

🚀 New Features & Enhancements

Scala 2.13 Support

Spark NLP now supports Scala 2.13 with this release! This will enable you to run your Spark NLP pipelines on Spark versions that run on Scala 2.13, such as used by Databricks and Dataproc. See our Installation Instructions for Scala 2.13 on how to use it with our project.

There are some things you have to consider when using the Scala 2.13 version

You need to adjust your dependency from spark-nlp_2.12 to spark-nlp_2.13.
If you install PySpark from PyPi, then the session will be Scala 2.12 by default. If you need to start a Scala 2.13 instance, you can set the SPARK_HOME environment variable to a Spark Scala 2.13 installation, or install PySpark from the official Spark archives.
If you want to load DependencyParserModel or TextMatcherModel from Scala 2.12 into Scala 2.13, you will need to manually export them again with the latest version. See the notebook

Layout-Aware Image Metadata in `Reader2Image`

The Reader2Image annotator now extracts spatial image coordinates from rich document formats, adding layout awareness to image annotations.

Supported formats:
- HTML
- Word (DOCX)
- PowerPoint (PPTX)
New metadata fields:
- x, y, width, height
Coordinates are included alongside existing metadata such as image format, type, and DOM position

This enables:

Layout-aware document and multimodal pipelines
Visual reconstruction of documents
More accurate association of images with surrounding text content

Document ID Support in `LightPipeline`

LightPipeline now supports passing document IDs together with text inputs, improving traceability in batch and production inference scenarios.

Key capabilities:

New overloads:
- fullAnnotate(ids, texts)
- annotate(ids, texts)
Document IDs are propagated as annotation metadata (doc_id)
New output_cols parameter to restrict returned annotation types

Benefits:

Reliable document-to-result mapping
Easier debugging and downstream integration
Reduced memory usage through selective outputs

Existing LightPipeline usage remains unchanged and backward compatible.

🐛 Bug Fixes

Fix out of memory error when copying big models to a cloud storage

❤️ Community Support

Slack – real-time discussion with the Spark NLP community and team
GitHub – issue tracking, feature requests, and contributions
Discussions – community ideas and showcases
Medium – latest Spark NLP articles and tutorials
YouTube – educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.3.2

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2

Maven

Supported on on Apache Spark 3.x.

spark-nlp

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-gpu

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-silicon

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-aarch64

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

FAT JARs

What's Changed

[SPARKNLP-1136] JSON Serialization for Features #14722 by @DevinTDHa
[SPARKNLP-1329] Adding image coordinates to metadata for Reader2Image #14725 by @danilojsl
[SPARKNLP-1333] Adding ids input for LightPipeline #14726 by @danilojsl

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.3.1...6.3.2

spark-nlp

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze

6.3.2

📢 Spark NLP 6.3.2: Scala 2.13 Support, Layout-Aware Images, and Enhanced LightPipeline Tracking

🔥 Highlights

🚀 New Features & Enhancements

Scala 2.13 Support

Layout-Aware Image Metadata in `Reader2Image`

Document ID Support in `LightPipeline`

🐛 Bug Fixes

❤️ Community Support

💻 Installation

Python

Spark Packages

CPU

GPU

Apple Silicon

AArch64

Maven

spark-nlp

spark-nlp-gpu

spark-nlp-silicon

spark-nlp-aarch64

FAT JARs

What's Changed

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze

📢 Spark NLP 6.3.2: Scala 2.13 Support, Layout-Aware Images, and Enhanced LightPipeline Tracking

🔥 Highlights

🚀 New Features & Enhancements

Scala 2.13 Support

Layout-Aware Image Metadata in Reader2Image

Document ID Support in LightPipeline

🐛 Bug Fixes

❤️ Community Support

💻 Installation

Python

Spark Packages

CPU

GPU

Apple Silicon

AArch64

Maven

spark-nlp

spark-nlp-gpu

spark-nlp-silicon

spark-nlp-aarch64

FAT JARs

What's Changed

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze

Layout-Aware Image Metadata in `Reader2Image`

Document ID Support in `LightPipeline`