Everything after the word
There is often a case when we need to match a certain text/string, but with regards to its appearance, like before or after some other string/word/pattern. We can target the word we want to use as a position reference, and match everything after. For example, if we want to greet someone, we might want to capture their name.. Let’s see how it works on our favorite “Hello, world” example:
Hello(,?.*)
What this will do is look for the word Hello, then start capturing everything after. The ,? are for matching optional “,” (comma), so it will work both with or without it.
Everything after string
This is practically the same as the last section “everything after the word”. Using an example text of “Hello, world”, we can match for everything after the string “Hello,” using this regex:
Hello(,?.*)
Everything after character
Capturing everything after a single character works very much in the same fashion. So, let’s say for example we want to get everything after the letter l in from our previous example, we can use something like this:
.*l(.*)
The above will capture everything after the l. You may notice though that our regex will capture “o, world”, but there are two “l” in the word Hello, so why did this happen. Well, by default .* part of the regex we are using to get to “l” is greedy, meaning it will capture everything until the last occurance of the string/work/character that follows. Don’t worry, there is a way to do it non greedy, but we’ll touch on that a bit later.
Everything after comma
So, let’s use a bit more practical/realistic example, like matching everything after the comma. To demonstrate this, let’s use the following sentence: “Hey, look, there is a bird on that branch”. So, to match everything after the last “,” (comma), we can simply use regex from our previous example, with a slight modification:
.*,\s*(.*)
This time, instead of “l”, we will get everything until the last “,”, and the rest will stay captured. You may also notice \s* in the parentheses, so let’s break that expression down real quick:
\s – stands for any kind of white space
* – stands for zero or more occurrences, meaning expression will match regardless if there is or isn’t space after the comma
Everything after space
Following the same formula, here are also examples for some of the most commonly used characters in this scenario, “ “ (whitespace) and / (slash):
.* (.*)
This will greedily “consume” everything up to the last space character, and “catch” only what’s left. We can also use above syntax for matching any whitespace, \s:
.*\s(.*)
Everything after last slash
This works in the same fashion. And, for our last example, “/”, I’m sure you already got a pretty good idea:
.*/(.*)
Greedy vs Non greedy approach
Thus far we saw only examples which matched the wanted pattern in a greedy fashion, meaning if there are multiple matches that fit a given pattern, it will stop at the last one. But what if we want to catch the text after the first, or say n-th occurrence? Well, luckily for us, the regex system is quite powerful, and doing that is relatively simple. Let’s demonstrate matching the word after the first comma, considering the following sentence: “Hello, Bob, pleasure to meet you!” So, in the given sentence, let’s say we want to capture the name after the greeting. We know that if we use the same technique to capture the word after “,” (comma) we used before, it will stop at the last “,” (comma), and we’ll be left with “nice to meet you!”. Instead, let’s make the expression that will capture everything up to the first “,” (comma), and keep only the name.
^[^,]*,?\s*(\w*).*
This expression is a bit more complex than the previous we used, so let’s break it down and explain each segment:
^ – stands for the beginning of the line, meaning it will start the matching process at the beginning of the line
[^,] – [] stands for range, meaning we can match more than a single character.
Let’s say we are trying to match all the summer dates in a file, e.g.
1.3.2022
1.4.2022
1.5.2022
1.6.2022
1.7.2022
1.8.2022
1.9.2022
We can simply use range to select all the months between 5 and 8 (inclusive), which
would represent the period from June to August. The expression for that would look like
this:
\.[5678]\.
or shorter:
\.[5-8]\.
Now, you might ask why we need a range since all we are interested in is a single
character, “,” (comma). Well range also allows us to exclude a certain character(s),
by using ^, which leads us to the “^,” part inside [].
NOTE the difference between the ^ at the beginning of the expression,
standing for the beginning of the line, and the ^ inside range, standing for exclusion. So, with all that, we are basically just saying match the character that is different than
“,” (comma)
* – Stands for zero or more occurrences, as we saw earlier. Given the previous segment,
which says match any character that is not “,” (comma), this just says match zero or
more characters which are not “,” (comma)
,? – stands for optional comma, which we saw earlier as well. If we match all the characters
different from “,” (comma), the only character after that can be either “,” (comma) or the
end of the text, so we want to check if it’s “,” (comma) and match that as well
\s* – By this point, we should be quite familiar with the * operator, this will match zero or more
white spaces
(\w*) – \w stands for word character [a-zA-Z0-9_], meaning this will match all the characters
after the “,” (comma), in this case, the name of the person we are greeting
.* – The rest of the text
We can also match the n-th occurrence of any particular character(s), but that’s out of scope for this article, so stay tuned if you’re interested to learn more! 🙂
Some examples using common Unix matching tools
This is just a brief practical demonstration of all the above examples, using Unix most common tools for this kind of task, such as grep and sed. Most relevant difference for our usecase is that grep can be used just to print a matched line/part of the line, but it can’t print a captured group, while sed can. Let’s say we have the dates from earlier example saved in a file called dates.txt. We can use our regex to print all the matched lines using both grep and sed:
$ grep ‘\.[5-8]\.’ dates.txt
$ sed -n ‘/\.[5-8]\./p’ dates.txt
Both of which produce:
$ 1.5.2022
$ 1.6.2022
$ 1.7.2022
$ 1.8.2022
But if we want to “catch” the months instead, sed is the only way to go:
$ sed -nE ‘s/.*\.([5-8)\..*/\1/p’ dates.txt
Which will print:
$ 5
$ 6
$ 7
$ 8
So, given that we are using capturing in most of our examples, sed is our friend.
Everything after the word
$ sed -E ‘s/Hello(,?.*)/\1/’ <<< ‘Hello, world!’
$ world!
Everything after character
$ sed -E ‘s/.*l(.*)/\1/’ <<< ‘Hello, world!’
$ o, world!
Everything after comma
$ sed -E ‘s/.*,\s*(.*)/\1/’ <<< ‘Hey, look, there is a bird on that branch.’
$ there is a bird on that branch.
Now, it’s like a common convention to use / as a delimiter with sed, but in the case when we want to also match the /, it can be interpreted as a delimiter rather than part of the pattern, unless we escape it. To do that, we can prefix the character we want to escape with “\” (backslash) like this:
$ sed -E ‘s/.*\/\s*(.*)/\1/’ <<< ‘Path to my home directory: /home/bob’
$ bob
The alternative way to do this without escaping / is to use a different delimiter. Sed allows us to use any character as a delimiter, so since we have the freedom to choose, it’s best to use something that is not factoring in our expression. In this example, we can use “|” (pipe):
$ sed -E ‘s|.*/\s*(.*)|\1|‘ <<< ‘Path to my home directory: /home/bob’
$ bob
Everything after the first comma (non greedy approach)
$ sed -E ‘s/^[^,]*,?\s*(\w*).*/\1/’ <<< ‘Hello, Bob, pleasure to meet you!’
$ Bob