OCR Transposing Lines

Sdurbin · July 2020

I'm running into issues where OCR'd lines of text sometimes transpose midway through making it very difficult to write effective extractors. Granted, this typically happens when image quality is poor, such as in the following horrible example. but I was hoping this was a known issue for which there might be some recommendations. In the example I had to obscure sensitive data, unfortunately, but if necessary I can try to produce the same results with sterile data.

Sample from image:

Sample from OCR'd text:

Grooper version: 2.72.0022

The OCR profile we're using is one we were provided initially named "Full Text - Accurate Global IP OCR3", if that means anything. if there is an easy way to extract the IP and OCR profile settings, please let me know.

strotelli · July 2020

@Sdurbin Create a new OCR profile with your own Image Processing Profile. In this profile try tweaking the line removal command's properties, such as, weight/orientation.

Note it's important you make a new OCR profile, and IP profile as to not affect other processes/content models that depend on the profile to work the way it does right now.

GrooperGuru · July 2020

We refer to this process as Synthesis.... determining where spaces should be placed between two characters on the same line and which characters belong together on the same line to begin with. The synthesis logic inside of Grooper takes place after all other logic is performed to collect the characters form the image. In the example you posted above, it appears that there are specs and other noise on the image somewhere between two lines of text. This is likely confusing the synthesis engine. I have three possible suggestions.

Fine tune the Image Processing Profile used during OCR. In this profile take a close look at each command available under the Feature Removal category... in particular, Speck Removal and Line Removal. If you can do a better job of removing the non-text artifacts from the image prior to OCR, the issue may disappear completely.
If using the Transym OCR Engine, enable Reject Questionable Lines and/or Reject Questionable Characters. These two settings use techniques to remove characters that are likely junk. There is also an OCR Profile setting that you can use with any profile called Eliminate Isolated Symbols. If these junk characters are removed, the synthesis will likely perform much better.
Disable synthesis in Grooper completely and accept whatever synthesis the OCR gives you. This typically will create undesirable effects, but in rare cases, can prove to better produce improved results.

OCR Transposing Lines

Answers