OCR Correction

GrooperGuruGrooperGuru Posts: 366 admin
This discussion was created from comments split from: Permanently Change Bad OCR for Text Searchable PDF.
Matt Harrison
Director of Strategy
[email protected]
Tagged:

Comments

  • GrooperGuruGrooperGuru Posts: 366 admin
    edited February 2018
    Below is a screenshot of the configuration panel of the Correct OCR activity. This feature contains three techniques for improving existing OCR results. It is worth mentioning that you have the choice of independently using any combination of these three features.

    1. Text Removal - The removal feature simply runs a "Removal Extractor" against the OCR of the document. Any place a match is found, that text will be removed from the OCR results.

    2. Spell Correction - This feature will run a "Correction Extractor" against the OCR of the document. For each match found on the document, that input text string will be replaced with the output text produced by the extractor. Therefore, this feature only impacts the document when using fuzzy matching or output formatting.

    3. Word Splitting - OCR sometimes does a poor job of detecting and inserting spaces in text. This results in multiple words being crammed together as one. The word splitting feature is used to attempt to insert spaces where this situation is detected. Let's discuss a scenario where OCR generated the following sentence:
      "This is a samplesentence where somewordsrun together."

      The feature starts by running a pattern that detects each sequence of non-space characters. At that point, it would see the following results:
      "This", "is", "a", "samplesentence", "where", "somewordsrun", and "together".

      Next, it will determine if any of these are not words. By default, (by not specifiying a Word Split Vocabulary) it will compare each entry to the complete English Words lexicon that ships with Grooper. However, if you'd like to run against a different set of eligible words, you can provide your own Word Split Vocabulary lexicon. We'll just use the out of the out of the box lexicon. At that point, the analysis has determined that two of these entries are not words: "samplesentence" and "somewordsrun".

      Even though OCR didn't see the spaces here, we (as humans) can see that there is at least a little space between each letter... just not enough that OCR intuitively knew to add spaces. We can also see that some spaces are slightly larger than others. The splitting algorithm will start where the largest space occurs, attempt to split each side of the space into separate words, and perform a check to see if either of the new results are words in your lexicon. If so, the split is saved, and the feature moves to the next available non-word it found.

      The minimum word length is referring to a number of characters. If a non-word is found, word splitting will only be attempted if the string is at least this many characters long. A good starting point for this is usually somewhere around 3-5 characters. The idea here is that you wouldn't want to split extremely small strings that are onyl one or two characters long.

    Matt Harrison
    Director of Strategy
    [email protected]
Sign In or Register to comment.