6.2.1
π’ Spark NLP 6.2.1: Enhanced hierarchical document processing and training optimizations
Spark NLP 6.2.1 brings significant improvements to document ingestion with expanded hierarchical support, XML processing enhancements, and optimizations for NerDL training. This release builds on the foundation of 6.2.0, continuing to focus on structure-awareness, flexibility, and performance for production NLP pipelines.
π₯ Highlights
- Hierarchical Document Processing: Extended support for PDF, Word, and Markdown with parent-child element relationships
- NerDLApproach Training Optimizations: Reduced memory footprint and improved training performance with BERT based embeddings
- Improved Document Output Format: Single document annotations by default for more intuitive behavior with large documents
- Enhanced XML Reading: Attribute extraction and improved tag handling in
Reader2Doc
π New Features & Enhancements
Hierarchical Support for Multiple Document Formats
Building on the HTMLReader hierarchical features introduced in 6.2.0, this release extends structured element tracking to additional document formats: