RegEx 101

jongoesboomjongoesboom Posts: 26 ✭✭

Introduction

This tutorial will teach you the basics of using regular expressions in Grooper. Regular expressions are basically just patterns that describe a certain amount of text. The most basic regular expression is matching a literal text string. The string happy is a perfectly valid regular expression that will match the word happy in a block of text. Regular expressions can also get quite complicated. As an example, the string \b[A-Z0-9. %+-][email protected][A-Z0-9.-]+\.[A-Z]{2,}\b is also a valid regular expression that will find an email address.

Definitions

Before we go any further, there are a few terms that we will be using frequently. Here are the definitions:

Literal
A literal is any character we use in a search or matching expression, for example, to find ind in the string "windows" the ind is the literal string - each character plays a part in the search, it is literally the string we want to find.
Metacharacter
A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the caret or circumflex character (^) is a metacharacter.
Regular expression or regex
This term describes the expression that we will be using to search our target string, that is, the pattern we use to find what we want.

Following Along

If you have Grooper installed, you can follow the directions below to get setup. If Grooper is not yet installed, you can go to https://regexr.com/3huj0 to follow along.

Included is a zip file that you will need to import into Grooper to follow along with the tutorial. This zip file includes a Batch in the Test folder that has already been rendered with text extracted. It also includes a Data Type called RegEx Tutorial in the Data Extraction > Data Types folder.

If you need help importing, please see the Importing and Exporting into Grooper article.

Literals

The most basic regular expression consists of a single literal character. It can be as simple as the letter a. This will match all occurrences of the letter a in a block of text.

There are special metacharacters in regular expressions: the backslash (\), the caret (^), the dollar sign ($), the period or dot (.), the vertical bar or pipe symbol (|), the question mark (?), the asterisk or star (*), the plus sign (+), the opening parenthesis ((), the closing parenthesis ()), the opening square bracket ([), and the opening curly brace ({). These metacharacters have special usage, which we will see as we proceed, and can produce errors when used as literals.

If you want to use any of these characters as a literal in a regular expression, you can escape them with a backslash. For example, if you want to match 1+1=2 in a string of text, the regex would be 1\+1=2. Using the + without the backslash has a completely different meaning (see Quantifiers)

  1. Go to the RegEx Tutorial Data Type in the Data Extraction folder.

  2. To the right of the Grooper Tree Pane you will see the properties pane.

  3. Click on Pattern under Data Extraction and then click on the ellipsis (...) on the right-hand side to open up the pattern editor.

    image
  4. Once in the Pattern Editor in the Batch dropdown, select the RegEx Tutorial batch.

    image
  5. With the RegEx Tutorial Batch selected, select Document (1) in the Batch Viewer. You should see the document in the Page Viewer to the right.

    image
  6. Click on the Text tab to see the text that was extracted from the document. You may notice the pair of \r\n characters at the end of every line. These are carriage return (\r) and new line (\n) characters that the OCR engine has inserted at the end of every line.

  7. In the Value Pattern box enter abc. You should have 2 results in the results window at the bottom and in the Page Viewer you should see the first 3 characters on the first 2 lines highlighted.

    image
  8. The regular expression abc will match the literal characters abc when found in any block of text on the document.

  9. Literals also work with numbers. Erase abc and enter 123 into the Value Pattern and you should see 4 results. Scroll through to see which strings had a match.

  10. Enter different combinations of strings in the Value Pattern window to see what matches and what does not match.

  11. When you are done click Cancel to exit the Pattern Editor without saving changes.

Character Classes

Next, we will go over character classes. A character class matches only one out of several characters. Simply place the characters you want to match between square brackets. For example, to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. A character class matches only a single character. gr[ae]y does not match graay or graey because there are now 5 characters and it is only looking for 4 characters. The order of the characters inside a character class does not matter.

  1. Go to the RegEx Tutorial Data Type and open the Patter Editor by clicking on Pattern and then clicking the ellipsis on the right-hand side.

  2. Select Document (1) in the Batch Viewer in the bottom left Batch Viewer panel.

  3. In the Value Pattern box enter [abcdefg]. You will notice that it matches every single occurrence of the characters inside the brackets.

    image
  4. Erase the [abcdefg] and replace it with [1234567890]. You can see that it now matches every digit in the document.

    image
  5. Erase [12345676890] and enter [nrt]aping. You will see that it find matches for the words napping, rapping, and tapping.

Ranges

You can use a hyphen inside a character class to specify a range of characters. Instead of having to enter [1234567890], you can use [0-9] to match all single digits, or [A-Z] instead of [ABCDEFGHIJKLMNOPQRSTUVWXYZ] to match all single letter characters. You can use more than one range inside a single character class. [0-9A-Z] matches all single alphanumeric characters. You can combine ranges and single characters so that [0-9X] will match all single digits or the letter X.

  1. With the Pattern Editor still open, delete anything that is currently in the Value Pattern box and enter [A-Z]. All 425 characters will show up in the bottom results window, however not all the characters are highlighted. This is because Grooper will only highlight the first 260 results to help limit memory consumption.

    image
  2. The Value Pattern box Grooper is set to case insensitive by default, so entering [A-Z] or [a-z] will yield the same results. However, this can be changed in the property tab if needed.

  3. Click on the properties tab and change the Case Sensitive property to True. You can see that immediately the results have changed to only return capital letters.

    image
  4. Click back on the Pattern Editor tab and change [A-Z] to [a-z0-9]. It now returns all lowercase alpha characters and all numeric characters. Again, it will only highlight the first 260 results in the Image Viewer, but notice in the Results pane on the bottom it has 506 results.

  5. Go back to the Properties tab and set Case Sensitive back to False.

Shorthand Character Classes

Since certain character classes are used often, a series of shorthand character classes are available.

Shorthand ValueRegEx EquivalentMatches
\d[0-9]All digits 0 to 9
\w[A-Za-z0-9]All alphanumeric characters capital and lowercase a to z and 0 to 9
\s[ \t\r\n\f]Space, tab, carriage return, line break, or form feed

Negations

Typing a caret (^) after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. For example, to match all characters that are non-numeric you would enter [^0-9] This will return all alpha, whitespace, and punctuation characters in the text.

Negations also have some default shorthand values. These are simply the regular shorthand values but capitalized.

Shorthand ValueRegEx EquivalentMatches
\D[^0-9]Anything not a digit 0 to 9
\W[^A-Za-z0-9]Anything not an alphanumeric character A to Z and 0 to 9
\S[^ \t\r\n\f]Anything not a space, tab, carriage return, line break, or form feed
  1. Delete anything already in the Value Pattern window and enter [^a-z0-9]. This will select all punctuation, space, and line break characters. In other terms, anything that is not an alphanumeric character.

    image

Groups

By placing part of a regular expression inside parentheses, you can group that part of the regular expression together. A group by itself does not have much impact on the regular expression. When it is combined with alternation (which you will see shortly) and quantifiers it becomes extremely useful.

Only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by a quantifier with specific limits.

Alternation

You can use the vertical bar or pipe (|) character to match any one of a series of patterns, where the | character separates each pattern. For example, to match the word dog or cat in a string you could enter dog|cat and it would search for either word in the string.

  1. Delete anything already in the Value Pattern window and enter abc|xyz|123

  2. All instances of any one of those 3 strings have been returned.

    image
  3. You can also use alternation with groups. Delete abc|xyz|123 and enter o(nce|ver|nly), and you can see that it now finding the words once, over, and only.

    image

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. The following table lists the quantifiers supported in Grooper.

QuantifierDescription
*Match the previous character zero or more times.
+Match the previous character one or more times.
?Match the previous character zero or one time.
{n}Match the previous character exactly n times.
{n.}Match the previous character at least n times.
{n,m}Match the previous character from n to m times.
  1. Delete anything already in the Value Pattern window and enter http(s)? This is looking for the literal string http and then an optional s. The ? behind the (s) will match the group 0 or 1 time.

    image
  2. Delete http(s)? and enter \d+. This is going to look for a digit character and then attempt to match the character one or more times in a row.

  3. Notice that this grabs all single and consecutive digits.

    image
  4. Delete the \d+ and enter \w+. This does the same thing, but now grabs all the singular and consecutive word characters (a-z and 0-9).

    image
  5. Delete \w+ and enter \d{3} This will grab all groups of digits which have at least 3 digits in a row.

    image
  6. Change the \d{3} to \d{2,4} it is now looking for groups of digits that are from 2 to 4 characters long.

    image
  7. You may notice that the first result is 1234 and not 12 or 123. This is because in looking for results with quantifiers it will try and grab the longest string it can.

Dot

In regular expressions, the dot or period is one of the most commonly used metacharacters. The dot matches a single character, without caring what that character is. The dot is a very powerful regex metacharacter. However, it allows you to be lazy. Put in a dot, and everything matches just fine when you test the regex on valid data. The problem is that the regex also matches in cases where it should not match.

Delete anything already in the value pattern window and enter e.. This returns all instances of the letter e and any character the succeeds the letter e. You can see that this grabs sets of 2 letters, but it also grabs e followed by a space, and e followed by a line break.

Anchors

Anchors do not match any character at all. Instead, they match a position before, after, or between characters. They can be used to “anchor” your regex match at a certain position. The ^ matches the position before the first character of a content block. Similarly, $ matches right after the last character in a content block. In the test data, the ^ character will only match the position immediately before the a on the first line of the page, and the $ will only match the position immediately behind the 9 on the last line of the page.

Word Boundaries

The metacharacter \b is an anchor like the ^ and the $. It matches at a position that is called a "word boundary". This match is also zero-length. There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.

  • After the last character in the string, if the last character is a word character.

  • Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".

\B is the negated version of \b so it will look for characters that are not at a word boundary.

  1. In the Value Pattern, enter \ba\b this should only find a’s that are immediately ahead of and are followed by a word boundary.

    image
  2. Notice that it only found 3 results, all of which are the single “word” a.

  3. Go ahead and replace \ba\b with \Ba\B The results are all the a characters that are in the middle of word.

    image

Whitespace

Another type of character that will produce a match are whitespace characters. While you can match these, they can also work like anchors. See below for supports whitespace characters.

SyntaxDescription
\sAll whitespace characters
\nNew line character
\rCarriage return (end of line)
\tTab character
\vVertical or horizontal tab
\fForm feed character (page break)
  1. In the Value Pattern, enter \n\w+. This will match all word that occur at the beginning of a line.

  2. Notice that it did not grab the first line. This is because a \n character does not occur at the beginning of the document. To match that line as well you would change the regex to (^|\n)\w+ to say match the beginning of a content block, or a new line followed by a word.

    image
  3. To grab all the words that occur at the end of a line we can change the pattern to \w+\r

  4. You can see that it grabbed all the “words” that occur directly before the carriage return. The line that end in punctuation have a non-word character separating the word from the carriage return, so they will not produce a match.

Tagged:
Sign In or Register to comment.