Reminder! Until the end of the month, our Learning Track over Advanced Normalization is unlocked and available to all Grooper xChange users!
Grooper 2.80.0043 is now available! Check the  Downloads Discussion for the release notes and to get the latest version.
Grooper 2.90.0051 is now available! Check the  Downloads Discussion  for the release notes and to get the latest version.

Header - Value Table Extractor (Values Extend Beyond Headers)

Ran into a table where the values are longer than the header and start prior to the header text.  When using a Header - Value extractor, the values only include the portion directly under the header.  Is there a way to capture all the data?


In the example below the extraction for Mol% will return 
497  
75
74

but I need to extract
82.497
8.075
4.074


Answers

  • RandoCalrisianRandoCalrisian Posts: 182 admin
    The short answer is, make a column a Primary Column, this of course requires that values always be present under that header.  Alternately, if it's not a Primary Column, give the column a very specific extractor, and don't define the header column extractor.  There's logic built into this extraction method that will understand the value and where it falls, which will associate with the correct column.

    If you need more details, I'd be glad to build a table extractor based on the image you submitted to show you how I'd do it.

    Lemme know...
    Randall Kinard
    [email protected]

  • tcoatestcoates Posts: 7
    I would appreciate the help.  This one has the size of the value challenge, plus missing values within columns.  It would be good to see how you would approach it.  On a tangent, was also wondering if we could leverage the lines (boxes) on our forms to help with the extraction.  Might not help with this specific table, but I know we talked about something similar in the past.  Thanks and I look forward to learning from your extractor.
  • RandoCalrisianRandoCalrisian Posts: 182 admin
    Okay, I'll build something from what you sent.  It'll work for what is shown, and you'll have to elaborate for how/if it doesn't on your end.
    Gimme a minute to build something...

    Yes, lines can absolutely be leveraged to assist with extraction.  If you have an example of a table using lines ... or something else in general, feel free to ask about it in another topic and I'll try to help by showing.
    Randall Kinard
    [email protected]

  • RandoCalrisianRandoCalrisian Posts: 182 admin
    edited April 2019
    Okay, finally got a chance to build something out, and the approach is interesting.
    I'd like to write up the explanation, but it'll take a moment.  In the mean time, I'll attach a Grooper ZIP file that you can import into your environment.
    Let me know when you take a look.  In the meantime, I'll be preparing a write up for this approach.


    Randall Kinard
    [email protected]

  • RandoCalrisianRandoCalrisian Posts: 182 admin
    I went ahead and mocked the previous table up with lines to demonstrate how useful lines can be.
    Attached is a Grooper ZIP file with a Content Model and Batch object.  Open it up and poke around to see how I built it.  I'll also make a write up for it.
    A brief explanation is that because there are lines present in the table, a new feature on Data Types known as Post Processing•OCR Reader allows the green "box" drawn around a value to expand to the detected "table box" around the value.  The lines are being found either on the fly and processed each time OCR Reader is run (slower, uses more compute), or as metadata stored on the page object when an IP Profile with Line Removal is run during an OCR step (faster, as it processes it all at once, and is used each time line detection is needed.)  This will be changing slightly in a new version of Grooper coming that will allow you to Recognize page layout information, like lines, without running an OCR Profile with an associated IP Profile.
    Anyway, because the value of a header, like GPM expands to its box container, it allows for a very simple InferGrid table to be created.  This is a much simpler setup than the previously attached file, but obviously only usable in the case that lines are present.  Without the lines, and the ability to expand to the header box, the column used for GPM, and even Mol% would be drawn only around the values (Mol% and GPM) and would end up truncating desired values underneath those headers.


    Randall Kinard
    [email protected]

Sign In or Register to comment.