Background and General Information

Updated and Reviewed March 2011

There are various implementations and flavors of "regular expressions", including POSIX and Perl. We've chosen to go with the Perl flavor, because it is generally considered better—more features, more predictable, faster—and seems to be more popular—with Java, Python and .NET for example using similar or derived implementations. The specific implementation we are using is called PCRE (Perl-Compatible Regular Expression); see the website for more details. Under Windows, it is housed in PCRE3.DLL, which is now included with the standard A-Shell distribution and loaded dynamically the first time you access a regular expression function. For the dynamic linked versions of A-Shell/Linux, there are RPMs named PCRE-#.## where #.## is the current version. Most likely you have one installed already, but if not, you can easily locate and install the current version. The "generic" (static-linked) version of A-Shell/Linux includes the library within the executable module. As of this moment, the AIX version is not yet ready, but will be included in the A-Shell executable, as it is with the generic Linux version.

For details on the syntax of regular expressions, see the Perl Regular Expression documentation or any number of web sites which offer tutorials and examples.

The two most common uses of regular expressions are to extend the power and flexibility of string searches, and to check for valid syntax in a string. Another possible use is for parsing and extracting specific portions of strings, using the capture group mechanism to return the subexpression matches of interest.

The main downside of regular expressions is that they are rather cryptic and can become so complex as to consume massive computing resources (although that is generally not an issue for common usage). As an example, a simple (incomplete) regular expression to match a valid email address word-delimited within a larger string, is:

"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"

A more complete expression for validating email addresses (based on RFC 2822) is:

"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|""(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*"")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

While it may be doubtful that many users would be able to conjure up such an expression during an ad-hoc search, application developers may build up a collection of useful regular expressions (often you can just copy them from helpful websites devoted to the subject) that can be used for common search or validation purposes.

Regular expression processing internally consists of two separate operations:

• compiling the expression (checking for syntax errors)

• using the expression to match against subject string(s).

Because the operation of compiling the expression can be as CPU intensive as matching it against subject strings, both REGEX.SBR and INSTR() support means of reusing previously compiled expressions. In the default case, if the current pattern matches the previously used one, then the previous compilation will be used automatically. This strategy works well when applying a single expression repetitively against many subject strings (as when searching for a pattern in a text file). But it doesn't work so well if you are searching through a file or database and comparing each line/record against more than one regular expression. To maximize efficiency in such cases, as well as for cases where you have a collection of common patterns used throughout your application, you can precompile and store up to 20 patterns, which can then be used on demand without having to re-compile them.

Note: the original implementation of REGEX.SBR in 5.1.1100 treated a null pattern string as referring to the previously compiled pattern. This mechanism has been dropped, since it is somewhat confusing to implement at the application level, and also introduces the problem of having to specifically check for null patterns. The new implementation just compares the current pattern to the last one (for non-precompiled patterns) to determine when recompilation can be avoided). Null patterns return 0 (failed match) in all cases.

See the sample programs in [908,46] of the EXLIB.

See Substring search right-to-left on the A-Shell Forum for an example of using REGEX to split a path spec into the directory and filename.