Introduction to Grooper OCR

jongoesboom · January 2018

Introduction to Grooper OCR

OCR is an acronym for optical character recognition. To put it simply, it’s how the computer reads letters from images. Before someone learns to read, letters on a page just look like an assortment of symbols without any underlying meaning. For a computer it’s even worse. For a scanned document, a computer doesn’t even know the symbol is a letter, but instead no more than a combination of pixels. Running a document through the OCR process is how the computer takes an image, and liny by line finds combinations of pixels that it ultimately determines are letters, spaces, special characters, etc.

To use all the power of Grooper, pages must go through an OCR process. There are perhaps very simple batch processes that only convert an image to a PDF, and in that instance OCR would not be needed. However, that’s like buying a Ferrari and only driving it to and from the grocery store – not practical. All the power of Grooper, from separation, to classification, extraction and so forth leverages the computer’s ability to read the document, and therefore require a page to be OCR’d.

How OCR Works

To get the most out of Grooper OCR, it is important to have a basic understanding of how OCR engines work.

Image Segmentation: Finding where the lines are in the image is the first step in the OCR process. This is done by running and analyzing a horizontal projection profile. As you can see below, to the left of each line is a graphical count of how many black pixels appear on each row of the document. As you can see, the line breaks occur where there are no black pixels in the image.
Line Segmentation: Now that we have the lines, we have to break the characters out in a similar fashion. Instead of running the projection profile horizontally, we now run it vertically to find the white space between the letters. In the image below, you can see the characters break at the white space the same as the line breaks.
Character Identification: Now that we know where the line breaks and characters breaks are, we take each character, break it into a 5x5 matrix, and measure the intensity in each cell. This gives us a calculated vector that we can then compare to the training data that exists for each character. The result is a confidence percentage for each possible character.
Synthesis: The last step is to take the character data and order it into a text flow based on where the characters appeared on the page. This will be converted into an x,y coordinate and spaces and control characters will be inserted where necessary.

Why Grooper?

OCR engines are responsible for most of the heavy lifting to recognize characters, but Grooper has unique functionality, as listed below, that works with the engine to come up with some amazing results.

Iterative OCR

OCR engines while powerful, are not perfect. Grooper’s iterative OCR allows for this by allowing the OCR engine to go back and rerun OCR on areas it did not get the first time. Iterative OCR improves accuracy by performing OCR in multiple passes. On each successive pass, characters recognized in previous passes are digitally dropped out of the image.

Cell Validation

When you think of words on paper, you most likely think of a book. Books normally contain words that start on the left and flow to the right. OCR engines work by reading a document as if it was written like a book. What if you have a document that has multiple columns, or documents with tables that don’t have a normal book flow? Cell validation essentially divides the page into a grid and then performs OCR on each piece of the grid independently. It then takes these results and merges them into the full page results.

Synthesis

OCR engines work best when text is in straight lines across the page. I’m sure we’ve all had instances where you run across images where the text is angled, either at a consistent angle across the page, or maybe the beginning or ending of the lines is skewed. In some cases, this can be corrected by running a simple de-skew on the image. However, if the text is not all at the same angle de-skew is not going to be able to fix this. With synthesis, Grooper can take the results from the OCR engine and re-synthesize them.

Segments

When Grooper is using synthesis to process OCR results it breaks the results into segments. Segments are blocks of text separated by white space that is larger than an average space character (see below: all the green highlighted blocks are segments). These segments have a confidence value associated with them that is essentially how confident Grooper is that characters that the OCR engine has identified are correct. With segment processing you can specify a confidence value at which Grooper will go back and OCR only the segments below that value to then reprocess.