Grooper 21.00.0082 is available as of 12-12-2023! Check the  Downloads Discussion  for the release notes and to get the latest version.
Grooper 23.1.0016 is available as of 03-15-2024! Check the  Downloads Discussion  for the release notes and to get the latest version.
Grooper 23.00.0042 is available as of 03-22-2024! Check the Downloads Discussion for the release notes and to get the latest version.

Extracting Data horizontally and vertically

hcoxhcox Posts: 4
edited February 2022 in The Astronauts (Q&A)
Hello,

Is there a way to extract data from a document, horizontally first [collecting court case number, date filed, etc.] and then rotate the page to collect data listed vertically on the same page [for example, Judge name]? Unfortunately, not all of our documents will follow this pattern. Sometimes the Judge name will also be listed horizontally along with the other data we're extracting.

Thank you,
Heather

Heather S. Cox  |  Business Analyst  |  Manley Deas Kochalski LLC

Answers

  • tgarnetttgarnett Posts: 76 ✭✭✭
    edited February 2022
    Hi @hcox ,

    I spoke with our developer team and they confirmed that Grooper will not be able to obtain vertical text segments native to the PDF when running "Native Text Extraction" during Recognize. However, you could rotate the page 90 degrees and run a first pass at Recognize just to extract this one line of text.

    Here's how I see this working:
    1. If you aren't already, use the Split Pages activity to create page objects from your imported PDF file.
    1. Run images through an Image Processing step set to rotate pages 90 degrees.
    2. Recognize activity (page scope) with no OCR profile specified and Native Text Extraction set to Full. This should grab the vertical text from the right side of the page.
    3. An Extract step populates any necessary data from this vertical text based on your extractors.
    4. Another IP step set to rotate pages back to their initial orientation.
    5. Repeat step 2. Native Text Extraction will now grab the horizontal lines from the page.
    6. Another Extract step, this time set to Additive mode. This will grab all other data necessary from the document, and the Additive mode will prevent it from overriding your previous results.
  • hcoxhcox Posts: 4
    Hi @tgarnett,

    I gave this a try. It keeps the data extracted first from the vertical text however I couldn't get the Additive Mode to work on the second Extract step. I also couldn't get steps 2 and 5 to run without an OCR profile. This is the error message I received when I tried that "No OCR profile is specified. Recognize cannot process image-based content without an OCR Profile."

    Heather S. Cox  |  Business Analyst  |  Manley Deas Kochalski LLC

  • tgarnetttgarnett Posts: 76 ✭✭✭
    @hcox

    Which version of Grooper are you currently using? You can check by clicking Help -> About in the top left menu. As long as the PDF you're working off of has native text like the example you provided, Recognize should work without OCR whether you run it on a page or folder object.

    If the document you're working with doesn't have native text, you'll just have to use an OCR profile instead on steps 2 and 5.
  • hcoxhcox Posts: 4
    We have Version 2021.00.0023. We don't get the error message in a published batch process, only when we try it in an ad hoc apply activity.

    Heather S. Cox  |  Business Analyst  |  Manley Deas Kochalski LLC

Sign In or Register to comment.