The next version of Grooper - Grooper 2021 - will be entering beta soon! If you want to get a head start on some of our exciting new features, check out the  article over Smart PDFs  on the Grooper Wiki!

OCR issue with Color format

Is there a reason why color caused issue with OCR?  Even though the font is readable, OCR have issue with reading it.  If I binarize the image then it works fine; however there are time when image have highlighting over the text which I can read from the image, but binarize will cause issue.  Is there a reason why color shading and the correct font color to make it readable to a human, but OCR cant read it?

Example of original image value that OCR have issue reading

But if I binarize it then it will read it



problem with binarize is if there is highlighting it will be an issue.


Answers

  • GrooperGuruGrooperGuru Posts: 463 admin
    Thought I should shed some light on this.

    Any time you pass a color or grayscale image to the OCR engine and your OCR's IP Profile does not perform binarization, the OCR engine will perform "Simple" thresholding to the image behind the scenes. Simple thresholding has a very high likelihood of destroying the highlighted text as you've observed in your second screenshot. The only way to prevent this is to perform Binarization in an IP Profile. It is very likely that Simple or Auto binarization will have issues. But if you try Adaptive or Dynamic Thresholding, the black text within the yellow zone should come forward in a highly legible way. 
    Matt Harrison
    Director of Strategy
    [email protected]
Sign In or Register to comment.