Tuesday, July 11, 2006

REGULAR EXPRESSIONS IN FIVE MINUTES

As I mentioned above, regular expressions are arguably a complete language. They are comprised of a string of special characters interspersed with sets of characters that are used as a mask against
strings in files and HTML entry fields. The regular expression engine
compares a line of text with your regular expression mask. The
regular expression engine can either simply return a Boolean saying
your text string did not match the mask, or it can update characters
in that string. The following, for instance, is a regular expression
that you can use to compare phone numbers.

/^\(\d\d\d\) \d\d\d-\d\d\d\d$/

That regular expression can be used in JavaScript to test an input
test:

function checkPhoneNumber(phoneNo) {
var phoneRE = /^\(\d\d\d\) \d\d\d-\d\d\d\d$/;
if (phoneNo.match(phoneRE)) {
return true;
} else {
alert("The phone number entered is invalid!");
return false;
}
}

But that regular expression expects a space after the area code (if
given) and a hyphen between the exchange and the four-digit number.
The following accepts an optional area code (with optional
parentheses), a three-digit exchange with one space or no space after
the area code, and a four-digit number with a single space, a hyphen,
or no space between it and the exchange:

/^(\()(\d{3})?(\))( -)(\d{3})( -)(\d{4})$/

Regular expressions have a number of special characters in them to
control how the mask works. The caret (^), for instance, if at the
beginning of the string, says to match the following mask from the
beginning of the string. The dollar sign ($) says to match the
preceding mask from the end of the string. The escape-d, identified
with the forward slash (\) and the lowercase letter d, says to match
a digit. The vertical bar symbol () is the regular expression
Boolean "or" character. The caret control character (^), if not used
at the beginning of the mask, is the Boolean "not" character. The
backward slash (/) is the commonly used delimiter for the complete
mask. It can be replaced with another character if necessary -- say,
for instance, if you are validating a URL, which itself contains back
slashes.

If this is your first exposure to regular expressions, don't be
concerned if I just lost you. Just be aware that regular expressions
are cryptic yet powerful. You could do the same checks with code, but
your code would become lengthy and far more error prone than regular
expressions.

Once I became accustomed to the use of regular expressions, I wanted
a way of globally replacing Java code in all source files of my app.
That's where the Unix sed utility comes into play. The sed utility
takes an input file and runs all its text through a regular
expression. What I do is write a quick shell script that runs files
in a directory (or directories) recursively through sed. I once wrote
a ten-line shell script to convert a client's JavaServer Pages from
the syntax of JSP 0.91 to JSP 1.1.

An alternatve to Sed is Perl. The Perl programming language is
probably 80-to-90 percent regular expressions. And now, with EPIC's
Perl plug-in, you can develop and run your Perl scripts in Eclipse.

I've been using regular expressions in JavaScript for a while, but
with the advent of Struts 1.1, I now use them in my server-side Java
Web applications. Struts 1.1 added the ability to use declarative
edits for HTML input fields. The declarations are placed in an XML
file called validator.xml. The following is a validator.xml snippet
that declares edits for the input form called visits:

<form name="visit">
<field property="sendToCopy" depends="required,mask">
<arg0 key="form.visit.sendToCopy"/>
<var>
<var-name>mask</var-name>
<var-value>^[a-zA-Z]*$</var-value>
</var>
</field>
<field property="contactPhone" depends="required,mask">
<arg0 key="form.visit.contactPhone"/>
<var>
<var-name>mask</var-name>
<var-value>^(\()(\d{3})?(\))( -)(\d{3})( -)(\d{4})$</var-value>
</var>
</field>
<field property="contactEmail" depends="mask">
<arg0 key="form.visit.contactEmail"/>
<var>
<var-name>mask</var-name>
<var-value>^.+@.+\..{2,3}$</var-value>
</var>
</field>
</form>

Note that Struts will automatically edit the qualified fields on the
server. Struts will also add JavaScript code in the JSP input form
that performs the same regular expression edits that are performed on
the server via Java. By selecting that client-side edit option,
performance is enhanced because there isn't a roundtrip to the
server. And you didn't even have to write the JavaScript code.

As I said earlier, regular expressions are directly supported in
JDK1.4. But don't wait until you are using JDK1.4. You can use
Jakarta's ORO package today with JDK1.2 and 1.3. Jakarta's ORO
package provides a dozen or so different mechanisms for running
regular expressions (while the Java 1.4 API has only
java.util.regex.Matcher and java.util.regex.Pattern). My favorite ORO
API is the org.apache.oro.text.perl.Perl5Util class. As its name
suggests, the Perl5Util class adds Perl-like behavior to the
org.apache.oro.text.regex class. The following is a sample Singleton
class that converts 8-digit BigDecimal date values to java.sql.Date
objects:

package com.vpia.utils;

import java.math.BigDecimal;
import java.sql.Date;
import org.apache.oro.text.perl.Perl5Util;

+public class ObjectConverter {

// other converter methods omitted
public static Date toDate(String obj) {
return ObjectConverter.toDate( new BigDecimal(obj));
}

public static Date toDate(BigDecimal obj) {
BigDecimal dec = (BigDecimal) obj;
String str = Integer.toString(dec.intValue());
Perl5Util util = new Perl5Util();
str = util.substitute("s/([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])/$1-$2-$3/", str);
return Date.valueOf(str);
}
}

If you want to learn more about regular expressions, try the
following books. The first two have several chapters on regular
expressions, and the last is considered to be the definitive guide to
regular expressions.

  • "JavaScript: The Definitive Guide," 4th Edition by David Flanagan,
    O'Reilly
  • "Learning Perl," 3rd Edition by Randal Schwartz and Tom Phoenix,
    O'Reilly
  • "Mastering Regular Expressions,", 2nd Edition by Jeffrey Friedl,
    O'Reilly

No comments: