pp-ocrv6 brings 50-language ocr to hugging face

source: hugging face blog: pp-ocrv6 on hugging face: 50-language ocr from 1.5m to 34.5m parameters

level: technical

pp-ocrv6 is the latest generation of paddleocr's universal ocr model family. it handles text detection and recognition across documents, screenshots, multilingual images, digital displays, industrial labels, and scene text. the model family scales from 1.5m to 34.5m parameters with three tiers: tiny, small, and medium. the medium and small tiers support 50 languages, including simplified chinese, traditional chinese, english, japanese, and 46 latin-script languages. on paddleocr's in-house multi-scenario benchmarks, pp-ocrv6_medium reaches 86.2% detection hmean and 83.2% recognition accuracy, improving over pp-ocrv5_server by 4.6 and 5.1 percentage points respectively.

the model uses pplcnetv4 as a unified backbone for both detection and recognition, ensuring consistency across tiers. for text detection, it employs replkfpn, a lightweight large-kernel feature pyramid network that handles multi-scale, rotated, or low-resolution text efficiently. recognition uses encoderwithlightsvtr, which combines local context modeling with global attention to improve accuracy on challenging crops like multilingual text, screen text, and noisy regions. these design choices aim to keep models small while boosting real-world performance.

pp-ocrv6 is available on hugging face with multiple inference backends. users can run it via paddle inference, onnx runtime, or a transformers backend for pytorch-based workflows. the release includes safetensors, paddle inference models, and onnx models. an online demo and a model collection are provided for quick evaluation. paddleocr 3.7 offers a unified interface where the engine parameter selects the runtime. structured json output from ocr results can feed downstream tasks like document parsing, search, retrieval-augmented generation, or analytics.

why it matters: small, accurate multilingual ocr models that run on edge devices or servers make text extraction from images more accessible for ai pipelines, document processing, and data science workflows.

source: hugging face blog: pp-ocrv6 on hugging face: 50-language ocr from 1.5m to 34.5m parameters