SQL | Using regular expressions to carry out pattern matching
A. The Greatest Thing Since Sliced Bread
Anyone familiar with regular expressions will no doubt be aware of the incredible power they offer in regards to carrying out text searches. If you’re not familiar with regular expressions (or ‘regex’ as they’re often abbreviated to), you’ll hopefully be sold by the end of this post, looking to use them whenever you can.
Regular expressions provide another way to search for a pattern of text within a string. So in this respect, they have some similarity with an SQL ‘like’:
select country_name from country where country_name like '%land';
However, because regular expressions come with their own unique ‘language’ for carrying out text searches, they offer a much richer range of options. They also execute faster than the equivalent ‘like’ search. We won’t get into the full details of regular expressions as part of this post, apart from those constructs that relate to the Oracle SQL functions. And even what’s mentioned in this post is only a small subset of what’s available. Carry out a Google search for ‘oracle sql regular expressions’ for more examples.
B. REGEXP_LIKE
If you want to do a search of a text field, the ‘regexp_like’ command works in a similar way to the regular ‘like’ command, returning either True or False depending on whether a match was found. The basic version of ‘regexp_like’ works like this:
regexp_like(field_to_search, 'search string', 'optional flags')
Here’s a simple example, searching for all countries containing the word ‘land’:
select country_name from country where regexp_like(country_name, 'land');
The search is automatically looking for a string of characters within a larger block of text, so there is no need to include search parameters such as ‘*’. Pattern matching is implied by the use of the command.
The regular expression search can be extended by taking advantage of a wide range of special characters. Here are some common examples below:
. – match any single character (including a space)
^ – search for text at the start of the string
$ – search for text at the end of the string
(abc|xyz) – search for either ‘abc’ or ‘xyz’
[:upper:] – search for an upper case character only
[:alpha:] – search for an alphabetic character only
[:digit:] – search for a numeric character only
* – match the preceding string 0 or more times
+ – match the preceding string 1 or more times
{n} – match the preceding string n times exactly (curly brackets)
\ – treat the next character as a normal character (not a special character)
The ‘optional flags’ of ‘regexp_like’ take the following values:
i – ignore case when doing the search
c – perform a case sensitive search
So putting all that together, here are five examples of using ‘regexp_like’:
(a) Search for all degree descriptions that contain the word ‘cert’, irrespective of case:
select degree from degrees where regexp_like(degree, 'cert', 'i');
(b) Search for all phone numbers that start with a UK country code (44).
select phonenum from phone where regexp_like(phonenum, '^\+44');
In this case, we have to add a ‘\’ character before the ‘+’, since we want the ‘+’ to be interpreted as a normal character. Recall that in our table above, the ‘+’ has a special meaning.
(c) Search for all names that start with ‘Mc’ or ‘Mac’, immediately followed by an uppercase character (in other words, we want to retrieve ‘MacDonald’, but not ‘Mackay’):
select first_name, last_name from students where regexp_like(last_name, '^(Mac|Mc)[:upper:]', 'c');
(d) Select all inventory item IDs that do not match our preferred format of 4 alpha characters followed by 2 numeric characters:
select item_id from inventory where not regexp_like(item_id, '[:alpha:]{4}[:digit:]{2}');
(e) Select all email addresses that do not contain at least one ‘@’ and one ‘.’ character somewhere in the middle of the string:
select email_addr from cust_email where not regexp_like(email_addr, '.*@.*\..*');
I’ll leave the exact interpretation of the final example as a challenge for the reader. (Step through it carefully, one character at a time. All the information is provided in this post.)
C. REGEXP_INSTR
The ‘regexp_instr’ function works in a similar way to ‘regexp_like’. Instead of returning a True/False value though, the string returns the position number of a match.
regexp_instr(field_to_search, 'search string', 'optional flags')
Consider the following example, based on our original country example:
select regexp_instr(country_name, 'land') from country where country_code = 'FIN';
This returns a value of ‘4’, indicating the position at which the string ‘land’ exists in ‘Finland’.
In the next example, we are checking for all phone numbers that contain a ‘space’, returning the position that the space exists. If there is more than one space, the position of the first space only will be returned:
select phonenum, regexp_instr(phonenum, ' ') from cust_phone where regexp_like(phonenum, ' ');
D. REGEXP_REPLACE
Our next command is used for performing a search and replace operation. The basic format of the command is as follows:
regexp_replace(field_to_search, 'search string', 'replacement string', 'optional flags')
The ‘regexp_replace’ function returns either two values: (1) if no match is found, the original string is returned without any changes, or (2) if a match is found, the string with the replacement pattern is returned.
So starting with our country table example again, if we wanted to display all countries and replace the word ‘land’ with ‘****’, we could use the following:
select regexp_replace(country_name, 'land','****') from country;
In the next example, we want to ‘mask’ all numeric digits in a phone number, while leaving the spaces and the punctuation intact:
select regexp_replace(phonenum, '[[:digit:]]', 'x') from cust_phone;
When specifying the replacement string, you can use the ‘\1’ construct to indicate that the search string should re-appear as part of the replacement text. This only works if the search string has been enclosed in normal brackets. In the next example, we are searching for either ‘error’ or ‘warning’ in a line of text (this already uses brackets, so there is no need to specify another set). Then as part of the replacement text, we are adding three asterisks at either side of the word ‘error’ or ‘warning’. So ‘error’ will now appear as ‘*** error ***’:
select regexp_replace(message_line, '(error|warning)', '*** \1 ***') from logs;
E. REGEXP_SUBSTR
The final command is the regular expression equivalent of the Oracle ‘substr’ (substring) command. Similar to the other commands, ‘regexp_substr’ searches for a match based on the passed string. If a match is found, a string is returned, but only of the matched portion. So if we searched for ‘^.{3}’ in ‘Finland’, ‘regexp_substr’ would give us ‘Fin’ as a response (the first three characters at the start of the string). If a match is not found, then a null value is returned.
regexp_substr(field_to_search, 'search string', 'optional flags')
In the example below, the SQL scans a table of file names, and displays the extension only of each file. The ‘where’ clause ensures that only files with extensions are selected.
select regexp_substr(filename,'\..*$') from file_tmp where regexp_substr(filename,'\..*$') is not null;
See Also:
Tip 051: Database-Wide Text Searches