Grooper 21.00.0082 is available as of 12-12-2023! Check the  Downloads Discussion  for the release notes and to get the latest version.
Grooper 23.1.0016 is available as of 03-15-2024! Check the  Downloads Discussion  for the release notes and to get the latest version.
Grooper 23.00.0042 is available as of 03-22-2024! Check the Downloads Discussion for the release notes and to get the latest version.

Search Phrase Context

I am looking for a solution to a configuration challenge I am having. Users would like to see the context of the findings that we have. We have certain words/phrases we are looking for in a document. There will typically be multiple hits in a document and users want to quickly review the findings and their context. To illustrate this look at the example below.

The yellow text is the finding and the green text is the context.


In Grooper this would be displayed as a table.


The solution is pretty simple if I only want one word before/after the finding, but I am trying to make it more flexible to include several words or characters before/after the finding. 

Here is the solution for a single word before/after the finding. We can easily add cases for the beginning/end of the document where there are no leading/trailing matches.


The "Row Match Before/After is a datatype set to an Ordered Array with Flow Layout.


The leading/trailing datatypes are just word selectors.


The solution works because the leading/trailing datatypes find every word in the document and thus we have an ordered array of three elements in flow layout.


Let's say though that I wanted 10 characters before/after the search phrase. It will not work if I do a simple select 10 characters.


The problem is that this expression selects characters 1..10 and 11..20 and 21..30, so unless we are lucky that the Search Phrase ends up on one of those boundaries it will not work.


Ideally, I would want to select characters 1..10 and 2..11 and 3..12 so I have a sliding window of 10 characters. It will be a lot of hits, but it should cover every case (except beginning and end of the document, but those are solvable).

This can be done by using a positive lazy look ahead as follows:


Unfortunately, when I run that RegEx within Grooper I don't get any hits in the text. Did I hit a RegEx feature that Grooper does not support, or is it because they are all overlapping, so they are discarded?

Is there another/better way if solving this problem?

Answers

  • jclarkjclark Posts: 60 ✭✭✭
    @hjanum
    Hello,
    For your circumstance would it be possible to simply use named groupings within a single Data Type for the Search Phrase and the Context?
    Please see below for an example:



    Output example showing Search Phrase Keyword as "Number" with the Context showing up to 10 characters to the left and right of the Search Phrase.


    The table Search Phrase column would have a sub-element name set from the Search Phrase Data Type  Value Extractor. The Context column would have the entire Search Phrase Data Type set as the value extractor to get all named groups collected.


    Please let us know if this is helpful or if there is more information available to understand in greater detail what is trying to be accomplished.
    Thank you
  • hjanumhjanum Posts: 110 ✭✭
    edited September 2020
    Thank you for your answer.

    I did play around with that, but just starting out on the project we have 34 search terms that we are looking for (plus variants). I think that we would be adding more complex things that we are looking for. One thing we would be adding would be looking for PII indicators in documents. A home phone number would be considered PII by us.

    In this case, we would be looking for something along the lines of a number that looks like a phone number and words like "home/private/cell" within a certain distance of the phone number left/over/under/right. This would be pretty easy to do in Grooper, but the regex above would not work for that.

    I am wondering if I could just find the hits of the "search phrase" using Grooper data types and then assign the context using scripting. That would leave me free to use Groopers advanced search features to find the context and then programmatically get the context.
  • jclarkjclark Posts: 60 ✭✭✭
    Another possibility may be to use a key value pair, or a key value list, where you find your search phrase hits and look around that for the context. 
Sign In or Register to comment.