The next version of Grooper - Grooper 2021 - will be entering beta soon! If you want to get a head start on some of our exciting new features, check out the article over Smart PDFs on the Grooper Wiki!
OCR issue with Color format
Is there a reason why color caused issue with OCR? Even though the font is readable, OCR have issue with reading it. If I binarize the image then it works fine; however there are time when image have highlighting over the text which I can read from the image, but binarize will cause issue. Is there a reason why color shading and the correct font color to make it readable to a human, but OCR cant read it?
Example of original image value that OCR have issue reading




But if I binarize it then it will read it

problem with binarize is if there is highlighting it will be an issue.


Example of original image value that OCR have issue reading

But if I binarize it then it will read it

problem with binarize is if there is highlighting it will be an issue.


0
Answers
Any time you pass a color or grayscale image to the OCR engine and your OCR's IP Profile does not perform binarization, the OCR engine will perform "Simple" thresholding to the image behind the scenes. Simple thresholding has a very high likelihood of destroying the highlighted text as you've observed in your second screenshot. The only way to prevent this is to perform Binarization in an IP Profile. It is very likely that Simple or Auto binarization will have issues. But if you try Adaptive or Dynamic Thresholding, the black text within the yellow zone should come forward in a highly legible way.
Director of Strategy
[email protected]