Exploring Regular Expressions in DLP

Learn how the regular expressions in JSCAPE MFT Server's DLP module work in finding sensitive credit card numbers.
  1. Blog

There are a number of areas in JSCAPE MFT Server where regular expressions can be employed. The regular expressions in the DLP (Data Loss Prevention) module, for instance, play a crucial role in finding sensitive credit card numbers among data stored in your managed file transfer server. This will allow you to automate the tedious task of finding and protecting (e.g. by encryption) these sensitive data and help you comply with PCI DSS requirements.

In this post, we'll tackle one of the regular expressions in the DLP module. The regexes in that module are some of the most sophisticated regular expressions you'll ever encounter here.

As an example, I'm going to show you the regular expression used in the DLP Rule for Master Card cards:

mastercard regular expression dlp.png

Let me paste that regex here:

5[1-5][0-9]{14}|5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}|[ -]*5[ -]*[1-5][ -]*(?:[0-9][ -]*){14}

Looks pretty ugly, huh?

If you review Part 1 and Part 2 of our Using Regular Expressions in Triggers series, you'll notice that there are a couple of characters found on this regex that weren't discussed there. We'll start with those characters.

Take a look at the first instance of {14}. That's new, so let's tackle that first.

Curly brackets { } in regular expressions

Curly brackets { } are repetition operators, just like the *. They tell the server to repeat a character or set of characters placed right before it. The advantage of using { } is that you can specify how many repetitions can be made. You can choose from three basic syntaxes to do that:

1. {min,max} - where min is the minimum number of times and max is the maximum number of times matches can be made.

2. {min,} - where min is the minimum number of times matches can be made and the maximum is infinite.

3. {exact} - where exact is the exact number of times matches should be made.

Since we only see that last syntax in the Master Card regex, we'll just focus on that for now.

[0-9]{14} therefore means exactly 14 numeric characters, each ranging from 0 to 9, should match. If you test the regex [0-9]{14} in the JSCAPE MFT Server regex tester, you'll see that the following strings, which all have exactly 14 numeric characters, match:

  • 01234567891234

  • 12345555592221

  • 11111222223333

  • 98767877656565

But these don't

  • 012345678912345667

  • 00123

  • 01

\d in regular expressions

So far, all the alphanumeric characters we've encountered (except those in character classes) have been treated as literals. But there are some alphanumeric characters which, when placed right after a '\', can take on a special role. 'd' is one such character.

When a 'd' is placed right after a '\' to form '\d', the managed file transfer server can match it with a single digit from 0 to 9. Meaning, it is a shorthand notation for the character class [0-9]. Therefore, \d{2} will match two successive single digit numbers each within the range 0 to 9. For example:

  • 23

  • 77

  • 81

  • 00

What's that ?: after the opening round bracket?

There's one more set of characters that should be new to you. It's the ?: found between the opening round bracket ( and the character class [0-9]. That's just a special syntax that tells the server not to create what is known as a "backreference".

A backreference is a regex feature which, upon finding a match between a portion in a string and a portion in a regex enclosed by a pair of round brackets, stores the matched characters into memory. Let's say you have this string:

secretjpgfilesjpg-0001.jpg

This regex can be used to match it:

secret(jpg)files\1-0001.\1

Once the server successfully matches secret(jpg), it creates a backreference, which essentially stores (jpg) into memory. That backreference can then be reused in later parts of the regex by using the syntax \1. You can also use other numbers like \2, \3, and so forth if you have multiple backreferences in one regex. Because we only have one backreference in our example, we just use \1.

Backreferences can be very handy. But because they entail additional processes, they can slow down your server. Whenever you don't want the contents of a pair of round brackets to create a backreference, you place a ?: right after the opening (. Once the server sees that, it will just use the round brackets for grouping that part of the regex without creating a backreference.

For example, you can use this regex:

(?:[4-9]){3}

to match these strings:

  • 478

  • 555

  • 946

without creating any backreference.

Breaking down regular expressions in the DLP module

We're now in the position to discuss the regular expression used in JSCAPE MFT Server's built-in DLP rule for Master Card cards. Let me paste that again here so you won't have to scroll back up so much.

5[1-5][0-9]{14}|5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}|[ -]*5[ -]*[1-5][ -]*(?:[0-9][ -]*){14}

Notice that there are two vertical bars in that expression. If you recall from the last of Part 2 in our Using Regular Expressions in Triggers series, the vertical bar functions like an "or" operator. More specifically, it gives the server the option to match one out of many regular expressions.

That lengthy regular expression above actually consists of three smaller regular expressions, which are separated by vertical bars (|):

  • 5[1-5][0-9]{14}

  • 5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}

  • [ -]*5[ -]*[1-5][ -]*(?:[0-9][ -]*){14}

In other words, any set of characters that match any of the three regular expressions above can satisfy the DLP rule for Master Card cards. Once the server finds such a set of characters, it will identify it as a Master Card card number. That will prompt it (the server) to perform any action you've instructed it to do (e.g. protect the file containing that card number using encryption).

Let's dissect the first regular expression: 5[1-5][0-9]{14}

This one's easy. This means, any string of characters starting with the number 5, followed by any number from 1 to 5, and then followed by 14 digits, with each of the 14 having values ranging from 0 to 9, can be considered as a Master Card card number. All in all, there should be 16 digits.

Here are some sets of characters that satisfy this particular condition:

  • 5501234567891234

  • 5555555555555555

  • 5102468135792233

And here are some that don't:

  • 4501234567891234 - because it starts with 4

  • 5601234567891234 - because the 2nd digit (6) is outside the range 1-5

  • 531234234234234234234 - because it exceeds 16 characters

Obviously, the general requirements for constructing other regexes for this particular DLP rule should be the same as that first regex. That is, a Master Card card number should consist of 16 digits all in all, with 5 as the first digit and any number from 1 to 5 as the second digit.

However, not all users and software applications will record these card numbers using just pure digits. Others may include dashes and/or spaces.

That's when the second and third regexes can come into play.

Let's now dissect the 2nd regex, 5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}, one portion at a time:

5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}

5[1-5] is just like the one you saw in the first regex. As for \d{2}, remember that \d matches any single digit from 0-9, while {2} tells the server to attempt matches for \d exactly two times.

5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}

[ -]* means that the characters matching \d{2} may be followed by a space, a dash, or nothing. This should then be followed by any 4 digits (\d{4}). After that, either a space, a dash, or nothing ([ -]*).

5[1-5]\d{2}[ -]*\d{4}[ -]*\d{4}[ -]*\d{4}

Lastly, it should have another 4 digits; followed by either a space, dash, or nothing; and then another 4 digits.

Here are some simple examples that would match:

  • 5344 2345 7373 4512

  • 5423-4576-2298-5523

  • 5423457622985523

But these would match as well:

  • 5344    2345    7373    4512 - multiple space characters between each set of 4 digits

  • 5423-4576 2298 5523 - dashes and spaces

  • 5423 - 4576 - 2298 - 5523 - adjacent dashes and spaces placed between sets of 4 digits

How is this possible?

Well, if you take a closer look at [ -]*, it actually tells the server to find matches for a space or a dash, zero to as many times as possible. So after matching a space or a dash, the server will continue to look for another space or dash right after that match until it no longer finds any.

Pretty cool, huh?

Now it's time for the last regex: [ -]*5[ -]*[1-5][ -]*(?:[0-9][ -]*){14}

This regex can identify a Master Card card number even if:

  • the first digit is preceded by any number (even zero) of dashes and/or spaces;

  • the first digit is followed by any number of dashes and/or spaces;

  • the second digit, which can be any number from 1 to 5, is followed by any number of dashes and/or spaces; and

  • each of the 14 succeeding digits is followed by any number of dashes and/or spaces.

Thus, each of the following examples can match the regex:

  • 5 1 0 1 2 3 4 5 6 7 8 9 1 2 3 4

  • 5-1-0-1-2-3-4-5-6-7-8-9-1-2-3-4

  • 5-1 0 1 2345-6-7-89123 4

  •   5- -1-0 1  2---3- 4--5 - 678-9 1  -2-  3 4

It's really quite amazing how regexes can make your DLP rules very flexible.

What you've learned regarding regexes up to this point should be enough to help you interpret the DLP rules for the following credit card numbers:

  • MasterCard;

  • American Express;

  • Diners Club;

  • Discover; and

  • Visa

or even the US social security numbers. I encourage you to check out those rules in the DLP module. That way, you can see if you really absorbed what we've been talking about so far.

Try this out yourself. Download the free, fully-functional evaluation edition of JSCAPE MFT Server. Download Now