A regular expression, or in short regex, is a string of characters that specifies a pattern. It is a very powerful tool, and is used in a wide variety of applications from search (and replace) and string validation to lexical analysis (which is the first step in a compiler stack where source code is converted into a stream of tokens).
The languages of regular expressions coincides with the languages recognized by finite state automata, which means that for every regular expression, there exists an automaton that can recognize it.
Most programming languages support regex, including for example Python, C, C++, Java, JavaScript and Dart.
If you don’t have a cat, bad news, you’ll have to learn it 😀
Basics of regex
A single character is it self a regular expression that recognized once and only that character.
Let’s consider the following string: ‘I am excited to learn regular expressions!’, characters in bold are considered matched.
Using regular expression r = ‘a’ will match ‘I am excited to learn regular expressions!’
We can use the boolean or operator | to match either r1 or r2: r = r1 | r2
r = ‘am|to’ matches ‘I am excited to learn regular expressions!’
Operators can be applied recursively to extend the regular expression.
Let’s look at what we can do with regex:
Regex | Matched set of strings |
hi | {hi} |
hi | hello | {hi, hello} |
zz* | a mandatory z followed by zero or more z: {z,zz,zzz,zzzz,zzzzz,…} |
(haha)+ | at least one occurrence of haha: {haha,hahahaha,hahahahahaha,…} |
analy(s|z)e | {analyse, analyze} |
analog(ue)? | ? means optional: {analog, analogue} |
o{2} | exactly two occurrences of letter o: {oo} |
[a-z] | matches any lowercase letter in the range from a to z: {a,b,c,d,e,f,g,…,z} |
^football | matches any string that begins with football |
football$ | matches any string that ends with football |
\d | matches any digit: {0,1,2,3,4,5,6,7,8,9} |
[0-9] | matches any digit: {0,1,2,3,4,5,6,7,8,9} |
. | matches any character |
Regex for validation
Let’s say we are building an application for a specific university or workplace, and we would like to allow users to register using emails that belong only to the domain of the university or company, and it enforces some rules on the special characters used (only – or _)
Allowed emails:
user@mycompany.com
firstname_lastname@mycompany.com
firstname-lastname@mycompany.com
The string we are trying to match here is:
[Any alpha numeric string with characters (- _) ]@mycompany.com
We can use the following regex: [A-Za-z0-9_-]+@mycompany\.com (we need to use \. to escape the dot character since using only . matches has another semantic of matching every character)
There are two groups we can focus on here:
The outer group: ( )+ matches at least one occurrence of the contents of the parentheses
The inner group: [A-Za-z0-9_-] matches either A-Z or a-z or 0-9 or _ or –
followed by a fixed string @mycompany\.com
You can find and interact with the above regex example here.
Thanks for reading 🙂
Enjoy regexing!