three spacy tricks for faster text processing

source: kdnuggets: 3 spacy tricks for efficient text processing & entity recognition

level: technical

loading a full spacy model runs many components you may not need. if you only want named entities, you can exclude the parser and tagger at load time. this cuts memory use and speeds up processing. you can also disable components like the lemmatizer during runtime with a context manager. in a test on 1,000 documents, skipping unused parts made the pipeline 1.6 times faster without hurting accuracy.

processing texts one by one in a loop is slow and makes it hard to track metadata. using nlp.pipe with as_tuples=true lets you stream texts in batches and pass along ids or other context. it also supports parallel processing across cpu cores. while the overhead can slow small jobs, a test with 10,000 records showed a 2.4x speedup over a sequential loop. this method keeps your data aligned and scales well for large datasets.

statistical ner models often miss domain-specific terms like ticket ids. instead of patching with regex after the fact, you can add an entityruler to the pipeline. it matches patterns before or after the statistical ner, so all entities end up in one clean list. for example, a rule for 'tkt-\d+' tags custom ticket numbers, while the model still catches standard entities. this avoids messy offset math and keeps your code simple.

why it matters: these tricks help data scientists and engineers build faster, more accurate text processing pipelines that handle custom entities and scale to millions of documents.

source: kdnuggets: 3 spacy tricks for efficient text processing & entity recognition