HTML Markup | JavaScript | Java | Home & Links

Tutorial 15 - Regular Expressions

Regular expressions are a shorthand notation for matching, extracting, sorting or formatting strings. Their most common use is to reduce the amount of work while validating data input. This tutorial covers the special syntax used, how one can use a regular expression in form validation, and several useful examples.

Escape Sequences and Character Classes

Escape sequences are used to allow print formatting as well as preventing certain characters from causing interpretation errors. Each escape sequence starts with a backslash. The available sequences are:

SeqUsage
\bbackspace
\fformfeed
\nnewline
\rcarriage return
\thorizontal tab
\vvertical tab
SeqUsage
\\backslash
\Bbackslash [alternate format]
\xnnASCII char defined by hex code nn
\onnASCII char defined by octal code nn
\unnnnUnicode char defined by sequence nnnn
\cXControl char defined by X

 

Special character class abbreviations are used to shorten the amount of typing and specifying required when creating a regular expression. For example \w includes all letters, numbers and the underscore character.

SeqMatches
\dAny digit 0-9
\DAny non-digit
\sAny whitespace character
\SAny single non-whitespace
\wAny letter, number or underscore
\WAny char except letter, number
or underscore
CharacterMatches
.Any character except newline
[abcde]Any character in the enclosed set
[^abcde]Any character not in the enclosed set
[a-e]Any character in the enclosed range
x|yEither x or y (ie. logical OR)
()Grouping that is stored (back referenced)
for later use ($1, $2 etc.)

 

Boundary Matches and Greedy Quantifiers

CharacterMatches
^Beginning of string
$End of string
\bWord boundary
\BNon-word boundary
 
 
CharacterMatches Previous Char
*Zero or more times
+One or more times
?Zero or one time
{n}Exactly n occurrences
{n,}At least n occurrences
{n,m}Between n and m occurrences

 

Regular Expression Modifiers

Regular expression modifiers have been added to the syntax to handle global modification of the entire expression. They are placed at the end of the expression outside the quoting brackets as in /[abc]+/i

CharacterModification
gglobal search for all matches
iinsensitive case searches
mmultiple line searches

Using Regular Expressions in Scripts

To use a regular expression for validating an entry in JavaScript, first set up a variable that contains the expression.
Note: Forward slashes are used to quote a regular expression while ' and " are used to quote a string expression.

re = /whatever/

Then apply the regular expression test method on the string to be tested.

if (re.test(entryValue)) {return true;}

To use a regular expression to extract a matching string, first set up a regular expression variable as above. Next use the regular expression exec method on the string. Any match is returned and null indicates no match.

var ar = re.exec(var_string);

To use a regular expression for modifying a string in JavaScript first set up a regular expression variable as above. Next use the string replace method. Note that you can use back references if required.

var x = y.replace(re,"$1");

You should always test a regular expression before using it in your own scripts. A good on-line site to use is provided by Locher.

A cookbook of useful RegExp enabled functions is provided by O'Sullivan.

Example: Trim() Function

The following trim() function removes leading and trailing whitespace characters from a string. If a second parameter is used, it is used instead of the standard whitespace characters. ltrim() and rtrim() can be used independently!

Example: URLs and Files

Often validation of an URL or filename requires a specific extension. One regular expression that will catch all filenames (and more!) is:

/^\S+\.(gif|jpg|jpeg|png)$/

The above expression will match only image files that are Web standard. The expression is not foolproof as it permits subfolders with null names such as a//b.gif and specs like a:/b:/c.gif

Example: Canada Postal Code

The Canada Postal Code rules are:

  1. Letters and numbers alternate for exactly six characters (eg L0S1E0).
  2. D, F, I, O, Q and U are never used as they can cause optical reader issues.
  3. W and Z are not used as the first letter (region designator).

A 'first version' regular expression for Canadian postal codes is:

/^([a-z]\d){3}$/i

This expression makes sure that there is exactly 3 {3} groups of a letter [a-z] followed by a digit \d. The i suffix indicates insensitivity (ie capitals allowed). The ^ and $ guarantee that no other data is provided. However this easy to understand expression does not allow for an optional space after the third character or restricted subsets on each letter. It also doesn't allow for leading/trailing whitespace. The solution is to explicitly do the repeating but place a (\s)? to check for zero or one space after the third character and to reduce the matches on letters to the specific subsets.

/^\s*[a-ceghj-npr-tvxy]\d[a-ceghj-npr-tv-z](\s)?\d[a-ceghj-npr-tv-z]\d\s*$/i

Example: E-mail Addresses

E-mail addresses are of the form xxx@yyy where xxx is the specific mailbox (and can contain underscores and periods) and yyy is the domain which can contain a series of suffixes such as .com.uk. One regular expression that matches 99.99% of valid entries is:

/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/

This is a very complex expression and deserves explanation. All regular expressions start and end with forward slashes to differentiate them from ordinary string expressions. Most regular expressions start matches at the first character ^ and end at the last $.

Now we try to match the mailbox name which can include periods and dashes \w+ states one or more alphanumeric must be at the start of the name. ([\.-]?\w+)* allows periods or dashes to be included in the mailbox name with the trailing \w+ ensuring that those characters can not finish the name. The @ is the mandatory separator.

The domain name can have several .xx or .xyz suffixes such as .com.uk. Once again \w+ ensures that domain starts with an alphanumeric and ([\.-]?\w+)* allows for the dashes and periods. Finally (\.\w{2,3})+ ensures that there is at least one suffix of between 2 and 3 characters preceded by a period.

Note: This is not a completely foolproof validation as it does not account for new domain names of 4 or more characters. Also not all two and three letter combinations are legitimate domains!



JR's HomePage | Comments [jstutorf.htm:2014 03 02]