Grooper 23.00.0020 is available as of 5-15-2023! Check the Downloads Discussion for the release notes and to get the latest version.
Grooper 21.00.0073 is available as of 5-19-2023! Check the Downloads Discussion for the release notes and to get the latest version.
Header - Value Table Extractor (Values Extend Beyond Headers)
Ran into a table where the values are longer than the header and start prior to the header text. When using a Header - Value extractor, the values only include the portion directly under the header. Is there a way to capture all the data?
In the example below the extraction for Mol% will return
497
75
74
but I need to extract
82.497
8.075
4.074

In the example below the extraction for Mol% will return
497
75
74
but I need to extract
82.497
8.075
4.074

Tagged:
0
Answers
If you need more details, I'd be glad to build a table extractor based on the image you submitted to show you how I'd do it.
Lemme know...
[email protected]
Gimme a minute to build something...
Yes, lines can absolutely be leveraged to assist with extraction. If you have an example of a table using lines ... or something else in general, feel free to ask about it in another topic and I'll try to help by showing.
[email protected]
I'd like to write up the explanation, but it'll take a moment. In the mean time, I'll attach a Grooper ZIP file that you can import into your environment.
Let me know when you take a look. In the meantime, I'll be preparing a write up for this approach.
[email protected]
Attached is a Grooper ZIP file with a Content Model and Batch object. Open it up and poke around to see how I built it. I'll also make a write up for it.
A brief explanation is that because there are lines present in the table, a new feature on Data Types known as Post Processing•OCR Reader allows the green "box" drawn around a value to expand to the detected "table box" around the value. The lines are being found either on the fly and processed each time OCR Reader is run (slower, uses more compute), or as metadata stored on the page object when an IP Profile with Line Removal is run during an OCR step (faster, as it processes it all at once, and is used each time line detection is needed.) This will be changing slightly in a new version of Grooper coming that will allow you to Recognize page layout information, like lines, without running an OCR Profile with an associated IP Profile.
Anyway, because the value of a header, like GPM expands to its box container, it allows for a very simple InferGrid table to be created. This is a much simpler setup than the previously attached file, but obviously only usable in the case that lines are present. Without the lines, and the ability to expand to the header box, the column used for GPM, and even Mol% would be drawn only around the values (Mol% and GPM) and would end up truncating desired values underneath those headers.
[email protected]