Select Page

Match everything between two words

 Quite often we are trying to find if there is something (specific) or anything between a certain strings/words/characters. Let’s use a simple shopping list as an example: “Fruits to buy: apples, bananas, kiwis, lemons, oranges.” Now, if we want to check if there is anything between, let’s say, apples and lemons, we can use our friend, regex, like so:

.*apples.+lemons.*

If this expression evaluates as match, that means there is something between two constraining words. But what if we need to see what that is (extract it)? Well, worry not, all we need to do is just slightly modify our expression:

.*apples(.+)lemons.*

And this will capture everything between our two strings, so we can later substitute everything matched with any of our captured groups.. If this still doesn’t make sense, don’t worry, we’ll cover it through some examples later 🙂

Match everything between characters

This is basically the same usecase, though this time our “constraints” are not words, but a single character. Let’s use a list of numbers to demonstrate this:

“4 1 8 3 9 7 2 5”

So, our expression will only change the values we use as bounds, so if we want to match everything between 8 and 2, it will look like this:

.*8 (.+)2.*

See, simple enough 🙂

Let’s take a look at some examples with the most common usecases..

Match everything between parentheses

Since we are dealing with the same pattern, only different values (characters) we can use any kind of parentheses or brackets we want, but I think one of quite common use cases is extracting values from html/xml tags. So, let’s say we want to get a name of every list item, meaning, a value between li tags:

<li>apples</li>

<li>bananas</li>

<li>kiwis</li>

<li>lemons</li>

<li>oranges</li>

Expression to get everything between the tags (“>” and “<“) would be as follows:

.*>(.*)<.*

What the expression does is basically match everything until closing angle bracket “>”, which is the end of our opening tag, then capture everything until opening angle bracket “<” of the closing tag, and matching everything after. Now, some of you might think that this wouldn’t work, since we are using a greedy approach to match. Given that, it should match the last “>”, right? Well, looking at the expression as a whole, we can see there has to be “<” after “>” (since “<” is not an optional match), meaning it will match the last “>” after which there is a “<“, which in our case works perfectly 🙂

Match everything between quotes

So, let’s consider the following sentence: He said:

“Hi, my name is Bob.”, and she replied “Alice, nice to meet you!”.

Let’s try to capture everything between double quotes. Using our previous pattern, expression should look like this:

.*”(.*)”.*

This expression works, but if you try to use it in the provided sentence, you will notice it captures the content between the ending pair of double quotes (“), which is Alice’s response, not Bob’s. So how can we capture the part between the first pair? Let’s cover that in the next topic 🙂

Match everything between first occurrence of two characters

Using our previous example, let’s see how we can modify the expression to catch Bob’s introduction:

^[^”]*”([^”]*).*

or, maybe a bit shorter variant:

^.*?”(.*?)”.*

What happens here is that we capture the wanted part lazily, not in a greedy fashion as we did in our earlier examples. Let’s break down the following expression to get the better idea of how this works:

^ – Stands for beginning of the line, meaning expression will start matching from the beginning. [^”] – Range can be used to specify range of characters to be matched or excluded

(if we prefix the character(s) with ^, which is exactly what we are doing)

* – Stands for zero or more, so let’s see what that means in our situation.

Given that previous range will capture any character that is not double quote (“),

* will match all of the character until it reaches either double quote (“) or the end of the line.

– After reaching the double quote (“), we want to include it as well, so we can start capturing

 the next character.

() – Stands for capturing group, meaning whatever is captured here can be used later.

Expression between () is the same as the one we used before,

since we again want to match everything that is not double quote (“).

.* – Dot (.) stands for any character, and we already explained *,

so they combined basically mean match everything that comes after our capture group.

The shorter expression is basically the same, the only part that differs is the .*?” Which will lazily match every character until it encounters what comes after the ?, so this is quite a bit more flexible since it can be used with a sequence of characters, not only a single one like our expression range.

Matching everything between delimiter

Let’s take a line from a csv (comma separated values) file for this example, something like:

First Name, Last Name, Age, Address

Using expressions we mentioned so far, we learned we can match the last (greedily – by default) or the first occurrence between the provided delimiters (“,” – comma), like so:

.*,(.*),.*

Or

^.*?,([^,]*).*

This can be what we need in many instances, but what if we need a specific occurrence? Let’s discuss that next 🙂

Match nth occurrence between characters/delimiter

Let’s continue with the last example, but this time let’s say we want to match specifically Age: 

^((.*?,)([^,]*)){2}.*

This one is a bit more complex, but if you pay close attention you’ll see it just combines everything we mentioned before, so we should be fairly familiar with most of it 🙂

We are using multiple capturing groups (()), namely we are capturing two separate groups into a single, bigger one. First “nested” group is catching everything until and including the next “,” (not the last one, as greedy approach would do), and the second group captures everything until (but not including) the next “,”. We capture those two groups in a single one that can be repeated using the {} operator. This way we can say how many times we want to catch the part until next “,”, and everything after it. Since we wanted to get Age, we specified 2 repetitions, but that number may vary according to your needs.

Let’s demonstrate how this works by breaking it down into each repetition, so there won’t be any confusion 🙂

(.*?,) – First group will get everything including next “,”, so it will look like this:

(First Name,)

([^,]*) – Second group will get the part until the next “,”, meaning:

( Last Name)

So our big group containing both of these groups will look like this:

((First Name,)( Last Name))

If we repeat it once again ({2} – means 2 repetitions), it will be like this:

((First Name,)( Last Name)) x 1

((,)( Age)) x 2

Since we’re interested only in the “Age” part, which is what the second “nested” (but third one opened in order) group captures, we can simply print only the 3rd group 🙂