Extracting Data horizontally and vertically

hcox · February 2022

Hello,

Is there a way to extract data from a document, horizontally first [collecting court case number, date filed, etc.] and then rotate the page to collect data listed vertically on the same page [for example, Judge name]? Unfortunately, not all of our documents will follow this pattern. Sometimes the Judge name will also be listed horizontally along with the other data we're extracting.

Thank you,
Heather

tgarnett · February 2022

Hi @hcox ,

I spoke with our developer team and they confirmed that Grooper will not be able to obtain vertical text segments native to the PDF when running "Native Text Extraction" during Recognize. However, you could rotate the page 90 degrees and run a first pass at Recognize just to extract this one line of text.

Here's how I see this working:
1. If you aren't already, use the Split Pages activity to create page objects from your imported PDF file.
1. Run images through an Image Processing step set to rotate pages 90 degrees.
2. Recognize activity (page scope) with no OCR profile specified and Native Text Extraction set to Full. This should grab the vertical text from the right side of the page.
3. An Extract step populates any necessary data from this vertical text based on your extractors.
4. Another IP step set to rotate pages back to their initial orientation.
5. Repeat step 2. Native Text Extraction will now grab the horizontal lines from the page.
6. Another Extract step, this time set to Additive mode. This will grab all other data necessary from the document, and the Additive mode will prevent it from overriding your previous results.

hcox · March 2022

Hi @tgarnett,

I gave this a try. It keeps the data extracted first from the vertical text however I couldn't get the Additive Mode to work on the second Extract step. I also couldn't get steps 2 and 5 to run without an OCR profile. This is the error message I received when I tried that "No OCR profile is specified. Recognize cannot process image-based content without an OCR Profile."

tgarnett · March 2022

@hcox

Which version of Grooper are you currently using? You can check by clicking Help -> About in the top left menu. As long as the PDF you're working off of has native text like the example you provided, Recognize should work without OCR whether you run it on a page or folder object.

If the document you're working with doesn't have native text, you'll just have to use an OCR profile instead on steps 2 and 5.

hcox · March 2022

We have Version 2021.00.0023. We don't get the error message in a published batch process, only when we try it in an ad hoc apply activity.

Extracting Data horizontally and vertically

Answers