Request for Examples and Details on Natural Language Processing (NLP)

tcoates · February 2018

Are there more examples of the natural language processing concept within Grooper? I have used FuzzyRegEx and zones for extraction where values are near features. In one scenario we have a list of names and need to minimize false-positive identifications. Using a Lexicon with Fuzzy matching helps, but it would improve if we could find ways to help Grooper understand the context around the match to make sure it is not embedded in other characters (part of longer words) or identifies multiple words based on length matching. We do not always have situations where we can identify features or anchor words within the flow.

In general, I find the concept of NLP difficult to explain and differentiate from flow or pattern matching. Anyone else have thoughts or ideas?

GrooperGuru · February 2018

We are planning to start performing focused workshops with small user groups in the very near future. This is one of the first ones I believe should be covered.

But to answer a little bit about the specific scenario above, we are introducing some new properties in Fuzzy matching that will assist with this. In Grooper 2.6, you don't have the ability to tell fuzzy matching how to behave in terms of least cost vs. best match. Let me explain this with an example:

Say there is the word "properties" on your document, but there is a typo on one character, so the current OCR reads "properlies".
You are using a fuzzy match list with a lexicon that contains "prop" and "properties". In this scenario, properties would be the best match, since it is greater in length and is more closely related to what is on the document. That's ultimately what you would want it to find in this scenario. However, finding "prop" out of the OCR string would be a perfect match, letter for letter, and thus it has the least cost. You currently have no configuration that lets you decide whether you'd prefer the best match or the least cost solution. In Grooper 2.7, you will be able to decide for every Data Pattern.

However, there is a bit of a trick that you could use today. Rather than using fuzzy lists, you could use normal regex with the pattern [^\t\f]+ and with tab marking enabled and properly configured to your sample set. You would then bounce this off your lexicon and set a fuzzy percentage in your pattern's lookup options. This would force your pattern to first find text segments. Then compare each complete segment to the lexicon. This forces it to always take a more complete result.

A last approach involves making a data type that accurately finds all of the field labels on your standardized forms, then using this as your features data type for all field classes in the data model. I already planned to do a full writeup on that technique in the next week or so. I'll see if I can knock that one out today.

PM me with your phone number if you'd like to have a call today to discuss in further detail. I've got a quick meeting with our team starting at 10:00, then I am available the majority of the remainder of the day.

Request for Examples and Details on Natural Language Processing (NLP)

Answers