Correcting OCR Data after Form Identification

hjanum · October 2018

We handle multiple forms in our solution. After the documents have been processed by Grooper they proceed to a downstream process where they are digitally signed. The digital signature solution places visible PDF signature(s) based on the positions of certain words and phrases in the OCR text of the document. It is therefore crucial that certain words/phrases are (mostly) correct in the document. We do have some fuzzy matching, but not as advanced as in Grooper.

We do use a multi-engine OCR approach in Grooper that uses an extractor to fix some of the labels, but that will not always work for us.

Take the following example.

Say we include the string "Form 118" in the lexicon and set a match percentage for say 90%. This will fix the string "Form l18" where the first one has been read as a the lower case letter L. We have one form 118 document where the OCR result is "Form 116". I could allow replacement of "6" with an "8" using weights and that would fix that particular document. The problem is that we also have a form 116 that would say "Form 116", so in that case we would not want to correct the "6" to an "8".

We do however have a well tuned form identification in Grooper using weights. It works VERY well and it has not mis-identified a form yet. So, later in the process we know for sure we are dealing with a form 118, so in that case it is perfectly fine to replace the "6" with an "8" in the OCR results.

There is a "Correct" type process that can be used to correct OCR results based on replacement/lexicons like the multi-OCR engine approach?
I see the Correct process, but I don't see the option to use extractors in that one. I would also need somehow to pick different lexicons based on document type, as the corrections I would want to make would depend on the form type.

Is this possible?

GrooperGuru · October 2018

There is a Correct activity designed to change the raw OCR results of the doc/page it is run against. This is configured with a Correction extractor. The idea here is that you can write whatever pattern you want, and the input string of each match is replaced with the output of the pattern. Therefore, you can use techniques like Fuzzy Matching, Hardcoded output formats, and translation within the pattern itself.

There is also a removal extractor you can use within the Correct activity. It's job is simply to find text in the OCR results and delete it.

hjanum · October 2018

Thanks for the response.

The question was mostly doing the correction specific to the identified form type. Thinking some more on this I think I can do it with the "Correct" activity if I put a "Should Submit Expression" that checks the form type, so for Form A I run one type of correction and for Form B I run another type of correction.

I would then have one Correct activity per form type each with its own Correction Extractor.

Correcting OCR Data after Form Identification

Answers