Blog

Managed File Transfer and Network Solutions

Using Regular Expressions in Triggers - Part 2

Posted by John Carl Villanueva on Thu, Mar 15, 2012 @ 01:59 PM

Let's now continue dissecting the regular expression: ".*\.doc"

In Part 1, I introduced you to literal characters and explained that the characters '.', 'd', 'o', and 'c' of the .doc portion of that regex are actually literal characters. I also mentioned that the metacharacter '\' lets our server know that any metacharacter placed right after it should be treated as a literal character.

That leaves us with the two metacharacters: '.' and '*'. Let's have a closer look at these two now. 

The metacharacters dot (.) and asterisk (*) and how to match specific file types

When the metacharacter '.' (assuming it is not preceded by a '\') is compared with any single character, the two will automatically match. It doesn't matter whether the other character is a number, a letter, or any other character. The server will simply treat the pair as a match. There are a few exceptions but we won't be using those here or at least at this stage.

The '*', on the other hand, is not compared with another character. Instead, it is used to tell the server to try to match the character that precedes it from zero to as many times as possible.

In our examples, the character that preceded the '*' was the '.', which, as we mentioned, automatically matches any character. So let's see what the result would be if we tried to match afile.doc with the regex ".*\.doc".

Obviously, the last 4 characters .doc already match, so we can take them out. That leaves us with afile and '.*'. Upon seeing the '.*', the server knows that it should try to match '.' as many times as possible. It starts by comparing it with 'a'. Since '.' can match any character, it can certainly match with 'a'. Because of the '*', the server proceeds to compare '.' with 'f'. Again, they naturally match.

This goes on until '.' is compared with 'e'. Of course, they too match. Since all of their characters match, the server concludes that afile.doc and the regex ".*\.doc" match.

If afile.doc is the value stored in the variable LocalPath, then the trigger condition:

LocalPath ~ ".*\.doc" would return TRUE

You can apply the same concept to the regexes: ".*\.txt", ".*\.pdf", ".*\.jpg", and so on.

Now you know what regex to use when you want a trigger condition to apply to a specific file type.

More examples using ".*"

The ".*" doesn't always have to be placed at the beginning of a regex. Consider the regex "image.*a\.jpg". Since image and a.jpg (again, remember the '.' here is considered a literal character) are made up of literal characters, this regex will match any jpg file with a filename beginning with "image" and ending in 'a.jpg'. So,

  • image001a.jpg,

  • image12xpza.jpg, and

  • imageABCa.jpg

all match with the regex "image.*a\.jpg".

Note that even if there are no characters between "image" and 'a', such as imagea.jpg, the match will still hold. That's because, if you recall, '*' will instruct the server to try to match '.' from zero to as many times as possible.

Feeling empowered already? You ain't seen nothing yet.  

Why don't you test the filename "imageABCA.jpg" against our regex? It's almost like "imageABCa.jpg", which matched earlier. Well, almost but not quite. Apparently, regexes are case-sensitive. We can modify our regex a bit to make it case-insensitive, like this:

"image.*[aA]\.jpg"

That should match that last filename (imageABCA.jpg) as well as those three other filenames we tried earlier. It also introduces us to a new pair of special characters. 

Introducing character classes or character sets and case-insensitive regexes

A pair of square brackets, including the characters they enclose, is collectively called a character class or character set. When the server finds a character class in its most simple form, i.e., a pair of square brackets and two or more alphanumeric characters, the server will try to find a match using only one of the characters found inside.  

For example, the regex image[aAbBcC]\.jpg can match any of the following, which in a way deals with the problem of case sensitivity:

  • imagea.jpg

  • imageA.jpg

  • imageb.jpg

  • imageB.jpg

  • imagec.jpg

  • imageC.jpg 

but it cannot match:

  • imaged.jpg,

  • imagez.jpg, or even

  • imageab.jpg

You might be wondering how come imageab.jpg doesn't match. Remember that, when the server tries to find a match from among the characters in a character class, it will only pick one and only one character. Once the 'a' in the character class has already matched, 'b' can no longer match. And since the regex does not have a literal character 'b' right after the character class, the 'b' in imageab.jpg fails to match at all.

There are two ways to accommodate imageab.jpg if we want to employ only what we've discussed so far.

The first one is to simply insert the literal character 'b' between the character class and the '\.jpg' like this:

image[aAbBcC]b\.jpg

The other option would be to place an '*' right after the character class:

image[aAbBcC]*\.jpg

This will tell the server to match any of the characters in the character class from zero to as many times as possible. But be careful, because the second option will also match the following:

  • imageabcABC.jpg and

  • image.jpg

which may not be what you would like to achieve. 

Numbers and number ranges in character classes

Character classes don't just match letters of the alphabet. They also match numbers or a range of numbers.

To match all .doc files that end in the numbers 1, 3, or 5, you use ".*[135]\.doc". So, the following files will match:

  • word1.doc

  • myletter3.doc

  • myletter135.doc 

  • report33.doc

I highlighted the number on each filename that matched a number in the character class to emphasize that only one character is matched in a character class. All preceding characters on each of those filenames were of course matched by '.*'.

To match a range of numbers, just use a hyphen. So let's say you want to match all .doc files that begin with any number from 6 to 9. You can use this regex: "[6-9].*\.doc". This should match the following files:

  • 6Abatch2.doc

  • 8245file.doc

  • 9.doc

You can even combine letters, numbers, and ranges of numbers like this: "[6-9jpv124]". I leave it to you to try this out.

Regexes for multiple file types

So far, we've only been talking about regexes whose practical applications only apply to one type of file at a time. What if we wanted to match all jpg, doc, and xls files?

For this, you'll need the vertical bar '|'. The vertical bar will allow the server to match one regex from a selection of regexes. So, if the individual regular expressions for matching jpg, doc and xls files are:

  • ".*\.jpg"

  • ".*\.doc"

  • ".*\.xls"

then the combined regex would simply be:  ".*\.jpg|.*\.doc|.*\.xls". 

There are lots of areas in JSCAPE MFT Server where regular expressions can be employed. One of them is the DLP (Data Loss Prevention) module. The regular expressions there are quite more complicated than those that we've seen here so far, so I'll cover that in a separate post. Hope to see you there. 

Downloads

Download JSCAPE MFT Server

Topics: JSCAPE MFT Server, Managed File Transfer, Business Process Automation, Triggers, regular expressions