Grooper 21.00.0082 is available as of 12-12-2023! Check the Downloads Discussion for the release notes and to get the latest version.
Grooper 23.00.0044 is available as of 06-20-2024! Check the Downloads Discussion for the release notes and to get the latest version.
Grooper 23.1.0024 is available as of 09-03-2024! Check the Downloads Discussion for the release notes and to get the latest version.
Using Fuzzy Matching in Data Rules
hjanum
Posts: 113 ✭✭
I am wondering if I can use Fuzzy Matching in Data Rules.
I am using data rules to remove certain values/prefixes in tables and it only works if the OCR is 100% correct, so I would like to use Fuzzy Matching rather than just traditional string compare.
I have a test table. The following is the empty table. Note the preprinted cell content.
Next here is the table with some user content.
I want to exclude the PrePrintedX values from the cell values, as it is not user input.
I set up a test table.
I do a raw extract of the table. The data validation is part of another test, so ignore that for now.
I define a simple data rule for Column 1 where I remove the PrePrintedX part of the cell value.
The rule uses simple string compare, and works to remove the prefix if it exists.
As you can see the simple rule works and we get the correct result.
We do however run into issues if the OCR has any imperfections. Say we read "PrePrintedI" instead of "PrePrinted1", or other variations on OCR imperfections.
I would like to use fuzzy matching/replacement instead of just traditional string compares in my Data Rule. Is this possible, or is there another way to go about this? I did use subtraction extractors initially
I am using data rules to remove certain values/prefixes in tables and it only works if the OCR is 100% correct, so I would like to use Fuzzy Matching rather than just traditional string compare.
I have a test table. The following is the empty table. Note the preprinted cell content.
Next here is the table with some user content.
I want to exclude the PrePrintedX values from the cell values, as it is not user input.
I set up a test table.
I do a raw extract of the table. The data validation is part of another test, so ignore that for now.
I define a simple data rule for Column 1 where I remove the PrePrintedX part of the cell value.
The rule uses simple string compare, and works to remove the prefix if it exists.
As you can see the simple rule works and we get the correct result.
We do however run into issues if the OCR has any imperfections. Say we read "PrePrintedI" instead of "PrePrinted1", or other variations on OCR imperfections.
I would like to use fuzzy matching/replacement instead of just traditional string compares in my Data Rule. Is this possible, or is there another way to go about this? I did use subtraction extractors initially
0
Answers
Fuzzy Matching is related entirely to extraction. Data Rules normalize data after extraction takes place.
My question would be can you use fuzzy matching to correct the data before passing it to a Data Rule?
Using your example "PrePrintedI" would match "PrePrinted1" with fuzzy regex enabled. If your pattern is literally "PrePrinted1", fuzzy matching will not just match the OCR data as is, but swap the "I" for a "1", resulting in the normalized value you want "PrePrinted1", allowing your Data Rule to function correctly.
Now, this can get trickier with numerical fuzzy matching as you might use a pattern like "PrePrinted\d". Obviously, Grooper can fuzzy match it. You'd just swap the "I" for a digit to satisfy the "\d" in the regex. But how does it "know" that "I" character is a "1"? You can get there with weightings. If you were to drop the weighting cost to swap an "I" for a "1", Grooper would not only make the match, but swap the "I" for a "1", resulting in the value you want "PrePrinted1".
Does this help answer your question? Let me know if you need further clarification.
Here are my column extractors. Only for 1 and 3 in this case.
Three cases.
1. Only the preprinted label
2. The empty field values (NA, N/A etc)
3. The preprinted label and a valid value
(note: I do not handle preprinted value and NA at this point)
The Only Label matches the preprinted label and replaces it with a period.
The empty field subtraction is a reference as it reused a lot.
We will get back to that.
The preprinted label and value looks like this.
It basically subtracts the preprinted label.
Here is the empty field substitution. It find s the empty markers such as NA and replaces them with a period.
Here is the NA example.
Then I have to define a data rule for each column.
If the value is a single period then I clear the cell value.
Here is the extraction of the table we saw first.
I now run the rule on column 1 and we see the values are blanked out.
This seems a lot of steps for something relatively simple.
It would be simpler if a subtraction extractor was allowed to subtract the whole value, or if in your output of a subtractor you could put an empty string somehow instead of the period (or another marker) where you then have to do a data rule to clear the field.
Bonus question: Is there any way to do a test extraction in Grooper Design Studio and have all the Data Rules applied. This would be nice as you are working on fields. The only way I found was to do the extraction and then go up to the data rule and then check one column at a time. It would be nice to be able to have the data rules run as you work on and test the data extraction.
So if the value is "NA" or "N/A" or "--" you want an empty value for the cell. My strong intuition is you're using the Grid Layout (previously known as Infer Grid) to extract your table data.
Let's start with some background. You've probably already encountered an issue where a Subtraction Extractor subtracting the whole string "NA", "N/A" or "--" doesn't get you what you want. This is because subtracting the entire text string in a result actually removes the result entirely, rather than returning a blank/empty value.
So, take this extractor extracting the line "What's up doc?"
If you set a Subtraction Extractor up to subtract part of the string, you'll get a mutated result. No surprises there.
But if you match the entire string, you won't get an empty result. Rather the entire result will be removed. Or in other words, you won't extract anything at all.
With Grid Layout specifically, it's designed to populate table cells with OCR text data once the grid has been established. It just extracts whatever the underlying text data is given the positional coordinates of the cells in the table.
While you can run an extractor against each cell's text data, if you're using a Subtraction Extractor that subtracts everything in the cell, you're essentially returning no result. With no extraction result, Grid Layout falls back on whatever the text data is, returning the OCR text. The current way Subtraction Extractors are designed, there is no way to say "If you match everything in the string return a result, but make it an empty value". Extractors return text, not no text.
Which is why you've probably gone the empty field string replacement route, swapping those values ("NA" N/A" and "--") for a period character.
I have three possible solutions to simplify things.
Option 1) Instead of replacing with a period, replace with a space. Then trim whitespace. No need for Data Rules
I have not tested this robustly, but I think this should work.
Again, the problem with using a Subtraction Extractor to subtract everything is you end up with no result whatsoever. Not an empty result. But I think I have a workaround.
Instead of entering a period in your Field Empty Substitution child extractors, enter a single space character.
On your parent Data Type, find the "Result Options" properties under "Output".
Press the ellipsis button to bring up the Results Options property panel. Then, under "Processing Options" set "Trim Whitespace" to "True".
This will delete that space character you used in the Output Format, resulting in empty/blank results.
This is a bit of a workaround but should functionally accomplish what you want. And, you don't need to touch Data Rules to clear out the values. They should be empty at this point.
You should test this for yourself though. The theory is sound, but I haven't 100% verified it.
Option 2) Get rid of your substitution extractors. Just use Data Rules.
If there is a known list of empty field values (ie "NA", "N/A" or "--"), there's no need to match them and replace them with a period. You can just alter your Trigger condition a bit.
See here, I've altered your Trigger a bit, using "Col1 = "NA" OR Col1 = "N/A" OR Col1 = "--""
Now, I'm assuming there's a fuzzy problem with these specific empty field values, such that it's not predictable the string values will always exactly be "NA", N/A" or "--". While you can't use fuzzy in Data Rules, you can use the Match function.
If (and I understand this is a big "if") you can craft a pattern that will match bad OCR permutations (and not match whatever the valid data in that column is), the Match function will also work.
Obviously, this pattern matches the clean text versions of these empty values, but you can at least verify the expression works in the screenshot below.
Option 3) Use Tabular Layout
The Tabular Layout method was added in version 2021 to (among other things) improve on the Header-Value method (and to some extent Infer Grid/Grid Layout).
Unless you absolutely must use Grid Layout for some reason, I would encourage you to explore Tabular Layout as a new way of doing what you want to do.
This table extraction method has a variety of "Secondary Extract Modes" that may achieve your intended goals without using Data Rules at all.
While I can't go too deep into detail on how to set up Tabular Layout here. A good place to start is the Tabular Layout article on the Grooper Wiki, linked below. You'll want to pay close attention to the "Primary VS Secondary Extraction" section for your use case.
https://wiki.grooper.com/index.php?title=Tabular_Layout_-_2021
BONUS QUESTION: You can, but be careful.
On Data Models, Data Tables, and Data Sections, you will find a property called "Validate Rule" under the "Behavior" section.
This will execute the selected Data Rule at extraction time. The rule I assigned here is the example I used earlier. Upon testing extraction, you can see the ultimate end results. Not just the extracted table results, but the extracted results after the Data Rule is applied.
BUT BEWARE! I say be careful because this may or may not be what you want to happen as far as your actual Batch Process goes. Whenever the Extract step executes, it will execute whatever Data Rules you've assigned to the Validate Rule property too (just like it did when you tested extraction). If you need to execute the rule after Extract runs, you'll need to make sure that property is blank by the time your Data Model is ready for production (and use the Apply Rules activity instead).
I hope this was helpful and informative. Let me know if anything needs further clarification.
Option 1:
The suggestion was to replace the filler values NA, N/A etc with a space character instead of the period I was using. You could then use the 'trim' function to get rid of the space and then end up with the blank value. I had tried this earlier and ran into issues. Let me show you what happens.
First here's my table extraction with nothing special applied:
I then do the substitution in the Pattern Match. Note the single space in the Output Format.
Here is the table after the Output Format has been applied. Note the single space in the value I have highlighted in blue.
The same happened in the last row in the table.
This means that the substitution works.
Funny thing happens though if I look at the field inspector. It does not show a space, but an empty string. This is the first column in the second row (the one that said PrePrinted2).
I now add the Trim to the Processing Options.
The problem is that when the Trim removes the space Grooper 'panics' and replaces the value with the original field value. The same result as when using the subtraction extractor.
I really think that this is a bug. Also I am still not sure why the subtraction extractor is not allowed to remove the entire value... if this is really what I want.
Option 2:
Using just a string compare is not really an option for us. We need some type of fuzzy matching as we receive scanned PDF's in our application. They are typically not of great quality as they have been printed/scanned sometime multiple times. I can do a faux fuzzy matching by doing regex substitutions. I do this in another application downstream, but it is not as reliable as Groopers fuzzy matching. Here's an example of my regex substitions:
Option 3:
I have been playing with the tabular layout, but it does not really seem like it adds a lot of functionality for the case I am trying to solve. I still have the substitutions to deal with and my filler values.
Bonus Question:
I like the idea of configuring the Validation Rule to be the Data Rule that does the substitution, but the problem is that I will have one Data Rule per column, so in my test case that is four data rules, and I can only configure one at a time. This means I cannot see the entire table extracted and all relevant rules applied while working on it.
I found another way to do it. I an configure scripting at the data model layer to do the substitution. This way I have a few lines of code for my entire data model for every table rather than having to configure one rule (that I cannot effectively test) for every column in ever table in my data model.
Here is the script:
The sub CheckAndFixFillerValue does the substitution. I fuzzy matched all preprinted labels and filler values with 'NA'. I can then check using a simple string compare and replace with a blank value. This blank value is NOT replaced with the entire cell content and everything works.
Here is the original extraction again:
Here is the end result when I extract the data model.
The only thing that does not work is unit test on just the table. When I do that the script does not run as it is on the data model. I have to extract the entire data model for it to run.
See below the extractor I will use. It's a Data Type extractor, using a simple Pattern Match to match the invalid cell values. The Output Format is set to a single space character. The Trim Whitespace property is enabled in the Data Type's Result Options. I'm getting 5 empty results for the invalid values.
Here's the table results without the extractor assigned. I'm using Grid Layout with no configuration besides the X Axis Extractor.
Here's the results with the extractor assigned. I can confirm the blank spaces are truly blank. There isn't a space character hiding in there. I am not seeing your issue where Grooper reverts back to the OCR text. Let me know what version you're using, and I'll see if there's any difference between versions.
Bonus Question:
You can nest child Data Rules under a parent Data Rule. Like so:
In this case the parent "Removal Test" only has its "Scope" property configured. No "Trigger" or "True Action" is defined. When a parent Data Rule has no "Trigger" defined, it always evaluates to "True", meaning all children Data Rules in the hierarchy will then execute. This is how you can execute multiple Data Rules for your Data Table using the Data Table's "Validation Rule" property.
One rule. All three child Data Rules applied during testing.
Grooper 2022.00.0007.
Did not know you could nest the Data Rules. That is nice!
I wish you could have the Data Rules under the object they are working on (in our case a table) instead of all the way up in the Data Model hierarchy, but I can live with that.
Thanks for all of your help.
FYI: Data Tables and Data Sections have a Validate Rule property too. Scroll down towards the end of the property grid. It should be the second to last property.