PDF Deidentification, MPNet Classifier and Pipeline Tracer in NLU 5.4.0
We are excited to announce NLU 5.4.0 has been released!
It comes with support for deidentifying PDFs leveraging a combination of OCR and Medical NLP models.
Additionally you can leverage MPnet for sequence classifcation and Pipeline Tracer is now supported
Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined de-identification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location.
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf
#provide the input and the output path
input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']
#predict and save the deidentified pdf's.
dfs = model.predict(input_path, output_path=output_path)
MPNetForSequenceClassification is a state-of-the-art annotator in Spark NLP, designed for sequence classification tasks. It uses the MPNet architecture, which combines the strengths of BERT and XLNet, addressing their limitations.
MPNet, or Masked and Permuted Pre-training for Language Understanding, improves token dependency understanding and sentence position information. This enhances sentence structure comprehension and reduces position discrepancies seen in XLNet.
The annotator excels in tasks like document classification and sentiment analysis, offering superior performance due to its innovative pre-training and fine-tuning on large datasets. Integrated into Spark NLP, it ensures scalable, efficient, and high-accuracy sequence classification.
The PipelineTracer is now accessible on NLU pipelines which is a versatile class designed to trace and analyze the stages of a pipeline, offering in-depth insights into entities, assertions, deidentification, classification, and relationships. It also facilitates the creation of parser dictionaries for building a PipelineOutputParser. Key functions include printing the pipeline schema, creating parser dictionaries, and retrieving possible assertions, relations, and entities. Also, provide direct access to parser dictionaries and available pipeline schemas