INTRO TO REGULAR EXPRESSIONS
As a software developer, you’ve probably encountered regular expressions several times and were confused when seeing this daunting set of characters grouped together like this:
And you may have wondered what this gibberish means...
Regular expressions (Regex or Regexp) are extremely useful in stepping up your algorithm game and will make you a better problem solver. The structure of regular expressions can be intimidating at first, but it is so rewarding once you grasp the patterns and implement them in your work properly.
WHAT IS REGEX AND WHY IS IT IMPORTANT?
A Regex, or regular expression, is a type of object that is used to help you extract information from any string data by searching through text to find what you need. Whether it’s numbers, letters, punctuation, or even white space, Regex allows you to check and match any character combination in strings.
For example, let’s say you needed to match the format of a social security number or email address. You can utilize Regex to check for patterns in the text strings and use it to replace or validate another substring. Think of Regex as your own search bar - it gives you the freedom to define your own search criteria for a pattern that fits your needs and assists you in finding what you were looking for.
TWO WAYS TO CREATE A REGULAR EXPRESSION
1. To create a regular expression literal, you start and end with forward slashes ( /) to enclose the Regex pattern.
Syntax:
/pattern/flags
2. For a Regexp constructor, this method builds the expression for you.
Syntax:
new RegExp(pattern[, flags])
Rule of thumb:
If your regular expression is constant and does not change its value, you should use the regex literal for better performance. In cases where it is dynamic and not a literal string (i.e., an expression), it is best to use the regex constructor (see above example).
REGULAR EXPRESSION METHODS
There are three common Regex methods that you should be familiar with: test, match, and replace.
- This .test method returns a boolean - checking if the string contains a match or no match in the search pattern.
- Now instead of using RegExp.test(String) which just returns a boolean if the pattern is matched, you can use the .match method. This method returns an array with the whole matched string. Though it’s great to have the test method check whether the pattern is true or not, there will be times where we want to be in control of actually doing the match. That’s where the match method comes in handy! It returns an array of the match which can be helpful information depending on your use case.
Here is a very basic example below. Later on, you will see how match can be a powerful tool when combining the Regex with flags.
- This .replace method searches for a string for a specified value (or regular expression) and returns a new string where the specified value is replaced.
NOTE:
You CANNOT replace multiple instances using a regular value, but you CAN do this with Regex. The example below is using a regular value.
BRACKET EXPRESSIONS
Inside the bracket expressions, you can place any special characters you want to use to specify the character sets.
For example,
const regex = /[A-Z]/
. Notice that A-Z is inside the square brackets so this will search for all uppercase letters in the alphabet.
-
[a-z] matches a string that has all lowercase letters in the entire alphabet
-
[A-Z] matches a string that has all the uppercase letters in the entire alphabet
-
[abcd] matches a string that has a, b, c, d
-
[a-d] exactly the same as previous example so you can either specify each character or group them
-
[a-gA-C0-7] matches string that has lowercase letters a-g, uppercase letters A-C, or numbers 0-7
-
[^a-zA-Z] matches a string that DOES NOT have all lowercase or uppercase letters
*Inside a character set, the ^ character means all the characters that are
NOT in the a-z or A-Z.
FLAGS
After we end with a slash character, we can either choose one specific flag or combine them. Regex uses flags to be more specific on how to properly find and match the defined custom characters.
Before we go into the specific flags, you should keep in mind that flags are optional like the example below:
Without flags, Regex will find the first character that returns true in an array within the slashes. So in this case, our code will return: [‘T’] because it found the first uppercase letter in the sentence.
-
The g flag stands for global which means it will return what is true within the entire regular expression. In other words, it will not only return after the first match, but ALL the occurrences that matched.
If we added the g flag at the end of our slash, it will return all the characters from the regular expression that is upper case.
Let’s say we changed
const ariable to be const regex = /[a-z]/m
. The m flag will be checking to see the first instance of a lowercase letter from a-z so it will return [‘h’].
As an additional sidenote, there are three other character classes that can help when using multiple character sets to match.
The negations of \d, \w, and \s will be \D, \W, and \S. It will find the following:
-
\D matches any non digit character (same as [^0-9])
-
\W matches any non word character (same as [^a-zA-Z0-9_])
-
\S matches a non whitespace character
QUANTIFIERS
Quantifiers are basic symbols in regular expressions that have a special meaning.
-
* matches previous item zero or more times
-
+ matches previous item once or more times
-
? matches previous item zero or one times; makes preceding item optional
-
^ matches the beginning of the string
-
$ matches the end of the string
-
. matches any single character (except line breaks)
-
{m, n} min is 0 or positive integer number that indicates minimum # of matches, and max is an integer equal to or greater than min indicating the maximum number of matches
Let’s go through this example to demonstrate our understanding of quantifiers.
You can see that the regular expression is checking all the lowercase letters from a-z and using the + symbol to match up all the previous items. So when you console log found, it will return [ 'for', 'if', 'rof', 'fi' ].
Let’s say that + symbol was
not there and the Regex was only:
Then it will return [ 'f', 'o', 'r', 'i', 'f', 'r', 'o', 'f', 'f', 'i' ].
PUTTING IT ALL TOGETHER
Remember this long string of characters we saw at the beginning of this article?
This is actually a very common use case where the regular expression is applied for email address formatting. Now that we have learned the basic methods and terminologies used in Regex, let’s break down this once daunting but now understandable string of characters one step at a time.
-
First, let’s take a look at this Regex piece by piece. So from the beginning of the string, we have <strong>^\w+</strong>. We can see that ^ character is simply starting off the regular expression and then checking for an alphanumeric & underscore character using the w flag. The + quantifier is there to match up the previous items. From our example, this first piece is checking the ‘student’ characters from the email: student-id@alumni.school.edu
-
Next, we got our second piece of the Regex broken up as <strong>([\.-]?\w)+</strong>. The opening/closing parenthesis is used as the first capturing group where inside we have a character set which will search for either a “.” character or “-” character in our email. The ? is a quantifier that matches between 0 and 1 of the preceding characters so it checks to make sure that there is only one “-” or “.” followed by the w flag. There cannot be more than one of those characters consecutively in a valid email. So this second piece represents the ‘-id’ characters from the email example. If it was ‘student--id@alumni.school.edu’ with two hyphens, this would come out to be an invalid email.
-
The third piece is <strong>@\w+</strong> and this will be checking for the @ character in the given email followed by the w flag to check for any alphanumeric character. This covers for the ‘@alumni’ piece of the email. The + quantifier continues to match up the previous sections of the email address.
-
The following piece of <strong>([\.]?\w)+</strong> is the same search pattern as our second piece except it’s only checking for the “.” character and alphanumeric character, excluding our “-” symbol. This represents “.school” in the email.
-
The next chunk <strong>(\.[a-zA-Z]{2,3})+</strong> is a crucial piece in checking an email format. This piece is for the top-level domain (TLD) of an email address. It’s the part of a domain that comes after the dot, for example - com, org, or net. This Regex will match a “.” character and another character set that will check for any lowercase and uppercase letters. The {2, 3} will be matching between 2 and 3 of the previous matches where 2 indicates the min number of matches and 3 stands for the max number of matches. So the letters can only be up to 2-3 characters. In this case, it is ‘.edu’.
-
Finally, we have the <strong>$</strong> character to end our Regex string.
And that’s it! Now we know how to use Regex for a basic email validation. Additionally, you can implement brackets, flags, and/or quantifiers in your Regex to accommodate for other edge cases not considered in our Regex string.
CONCLUSION
Regex can be very beneficial for developers to gain knowledge in. They are most commonly used in situations where security validation is needed. Another can be when you need to parse through some text and extract certain information such as a date in the yyyy-mm-dd format. You can use Regex to extract the year, month, and day to find a known match. One last example is when developers need to match URLs or need to extract links from specific HTML pages. You can define your routes using regular expressions under the hood. Regex is everywhere!
People can easily excuse themselves from knowing Regex because it seems difficult to understand. But it doesn’t have to be. You can see it as a gradual curve and start from the basics today.
Thanks for reading and I hope you all feel more comfortable using Regex in your algorithms!