Recognition incorrect text issue

Prasadchitikela · March 2022

Hi All
while i am trying to process pdf or excel formates after recognize text is mismatch on some cases,like 'S' become 5 or G become 6. how to rectify this issues any techinque is available to rectify this issue to extracte text correct without mismatch once find the below screen shot and please help us.
Thanks in Advanced.

tgarnett · March 2022

Hi @Prasadchitikela,

First, I just want to check - since you're dealing with PDFs, do these files already have native text? If the original PDF allows you to highlight text like this, you can use Native Text Extraction instead of an OCR Profile to get perfect results.

Image: https://us.v-cdn.net/6030453/uploads/editor/pr/yk5tic742g46.png

If we do need to rely on OCR, there are a lot of ways we can try to improve the results but they will never be perfect. A good place to start would be the IP Profile attached to your OCR profile. Go to this IP profile and make sure the text on the black & white image it creates looks nice and clean. This looks like a nice clean image though, so I don't think you will find many issues. I've attached a generic OCR profile of mine. You can import it and see if it gets better results.

If we can't get the OCR to read the letters properly, you can still fix this at Extract. A pattern looking for 3 digits - 2 digits and a letter will not pick up 111-345, but if you turn on Fuzzy matching with weightings applied, it will correct the value.

In the example below, my pattern is \d{3}-\d{2}[A-Z]. By applying the Fuzzy Match Weightings lexicon, it gets the correct result with 96% confidence.

Recognition incorrect text issue

Answers