Snippets – // :D

Regex for the dilettante

Regular Expressions are pretty great. There’s just an early hump to get over where they seem confusing, after that it’s all good times. Let’s get you over that hump.

So, let’s find the letter ‘a’. Ready, here’s the regex to find ‘a’:

Yep. You’re peeking behind the curtain and seeing the wizard in all his glory. All regex does is search for characters, that’s it. Now you can find the letter ‘a’ in any string of text. Let’s use an example (nb: I’ll be using Javascript for my examples, but regex is very similar in most languages so you should be able to follow along just fine).

How about:

Useful!

Ok, not useful. We need to ramp things up a bit. What if we want to find any character from a to z, that might be a bit more useful. Enter the square bracket: [ ], which in regex means “anything inside of me counts as a search for a single character”. This means we can put [abcd] and it will find one character that is either a, b, c, or d. It’s still only finding one character, this is the bit that some people find tricky. Using our most recent example again:

What if we wanted to find any alphabet character from a to z? I guess we could type out all 26 letters into the square brackets, but that doesn’t seem too cool. Instead we can use ranges like so: [a-z], which will find all lower case letters from a to z. Notice that I specifically noted lower case. If we wanted to find all capital letters we’d use [A-Z]. If we wanted to find numbers we’d use [0-9]. If we wanted to find all three, letters of both cases and numbers, we’d use: [a-zA-Z0-9]. Again this doesn’t seem so great, we’re starting to get into silly territory again, so there’s another shortcut to find “word characters”. In regex word characters are things you’d find in words, a-z of both cases, and numbers. The shortcut still for a single character is: \w

Now we have something that might be useful. A way to use this that springs to mind: basic form validation if we want to check users aren’t trying to enter simple XSS strings for example. Double checking an email address for legal characters perhaps? That said, we’re still limited to finding a single character, putting \w\w\w one after the other to keep finding word characters is moving back into silly territory so let’s take a look at one of regex’s special modifiers, the plus symbol: +. So if we use \w+ it will greedily find as many characters as it can. eg:

Getting better! But what if we wanted to find the whole string, including spaces? Well we can just bring ol’ friend square bracket back in: [\w ]+

Remember that inside the square brackets we are looking for a single character, this can be a word character (\w) or the space character ( ) <– there’s a space in there. We add the + operator to make it greedy so it grabs everything. Now we get:

Great but it missed the period, which sucks cause we’re proud of how cool that string is and we want to make sure our statement seems authoritative. How about [\w .]+ – that seems like it would work. Actually there’s a bit of trick with some special characters, mostly punctuation of various types. ‘.’ in regex means ‘match any character’, which is useful sometimes but we’ll need a way to escape the special interpretation of these characters and just find the literal version. You’ve actually already seen the way we do this, or the opposite I guess, it’s the \ character. The \ acts as a modifier on the next character, either making it special, or removing it’s special nature (depending on how it starts). So in the case of \w it turns a lowly w character into a powerful word-char seeker. In the case of \. it turns the amazing wildcard ‘.’ back into a regular workaday full-stop. The backslash is the thing that makes regular expressions look so freakin’ weird when you don’t know what’s going on. Special bonus: if \ is a special modifier, how do we search for a literal \? We turn its power inwards, forcing it to work on itself: \\.

So let’s put it all together in a simple example: checking an email address for validity. First, we can mentally break down the bits that appear in a normal email address:

Rad-Guy_420.XX-69@terrible-domain5.com

So we have a couple of parts to check: prior to the @ we can have word characters (that’s letters and numbers), but we can also have periods, hyphens, and underscores. The @ always needs to be an @ symbol, cool. The domain name is always word characters, and the extension is three alphabet characters only (at least let’s pretend that’s the case, for sake of example).

Thus:

Name:

\w: for a-z, A-Z, 0-9

\- : for -‘s

\. : for .’s

\_ : for _’s

wrapped in [ ]: to check each character for each of the above.

modified by +: to make it greedy.

@: we can just search for @. Smooth.

Domain name:

\w: for a-z, A-Z, 0-9

\- : for -‘s

+ : to make it greedy.

period: \.

domain extension:

a-zA-Z: to match both cases.

{ 3} : limit the preceding to three characters only.

I snuck that last bit in there, curly braces, { }, are used to limit the preceding character search by whatever amount you put inside the brackets, so we use three for our three letter com/net/org address.

Putting it together we get:

[\w\-\.\_]+@[\w\-]+\.[a-zA-Z]{3}

And now you know just enough regex to do something silly like try to protect your website from XSS attacks with the above, hooray! Get yourself over to https://regex101.com/ for some more practice!

xoxo