Search:  
Gentoo Wiki

Regex

This article is part of the HOWTO series.
Installation Kernel & Hardware Networks Portage Software System X Server Gaming Non-x86 Emulators Misc

Contents

Introduction

Regex expressions can be used to match virtually anything. Regex is an abbrivation of "regular expressions". An example would be that you want to find in a list everything that starts with "abe", but you need to exclude words like "label". The regular expression ^abe would do this.

What is a regexp?

Regexp is short for "Regular expressions", and it's a way to match a string/pattern. Usualy when you search for something you see a "search" field and you enter something like "exact string I'm looking for", and it searches for the whole string as-is, you might be familiar with so called wildcards, like the asterisk "*", which matches "anything and nothing".
Regular expressions is more like a sentace explaining what you're looking for, for instance the regexp:
"/[abc]/" could be read as "matches 'a', 'b', or 'c'", and would match 1 character if it's 'a', 'b' or 'c'. there are more "advanced questions", like for instance:
"/[abc]*/", 'meaning' "matches any (and none) combination of 'a', 'b' or 'c'" - eg "abc", "a", "aabcacb", it would also match an empty string (please refer to "The * wildcard" under " Wildcards" for an more indepth explanation).
"/d[iuo]g/" - 'means' "matches 'd' followed by any of 'i', 'u' or 'o' followed by 'g' " - or you could say (which is much more easily understood for this 'short/simple' regexp) "matching 'dig', 'dug' or 'dog'". (had we used [a-z] instead of [iuo] the later would be a little too long ;)test

Wildcards

Wildcards are characters that match something other than what they stand for. These are used if you want to match a range of options, but you don't know exactly what you want to return.

The . (dot) wildcard

The "." character will match any one character. This is equivilent to the "?" in some shells. If you need to match a literal dot, preface it with a backslash. So the regex \.. will match anything that is a dot followed by one character. So ".a" would match, ".6" would match, and so would ".mp3". The reason that ".mp3" matches is because regex will match the expression anywhere, so unless you tell it that the . should be the last character (this will be explained later), then basically \.. will match any dot that isn't at the end of a line.

The * wildcard

The * will match any number of the character before, including 0 characters. Also, keep in mind that it will match only the character immediately preceeding it. This is quite different then what most people are used to using the * wildcard for in most shells.

The expression O*h will match "h", "Oh", and "OOOOOOOOOOOh".

Now, remember the "." character that matched anything? If you combine it with a "*" then it matches a string of any length. Why would you match everything? Usually its just there in the middle if you want to match something at the beginning and the end.

So if you were editing HTML, the expression <font .*> would match "font", then a space, then anynumber of characters followed by a ">". (You'd use this to match the tag, and not the sentence "Look at this cool font" or "I hate you! >.<Fonts suck!" (The last one was a stretch, but you see what I'm getting at).

Matching either/or with []

As you're probably aware, GNU/Linux is case sensitive. As a result, distribution is a totally different word that Distribution. If you wanted to match either of them, put both the D and the d in brackets.

The expression [Dd]istribution will match both the word Distribution and distribution.

Note: If you follow a "[" with a "]", the "]" loses its special meaning. So if you wanted to match either a "[" or a "]" you'd use the expression [][]. (Confusing, no?)

Positioning

As mentioned earlier, \.. would match a dot followed by any character, but not necessarily at the end of a line. If you wanted to match ".6" (perhaps looking through /usr/lib entries), you'll need to specify that there is nothing after the second character.

The $ Character

The "$" character stands for the end of the line. Pretty simple, nothing fancy -- yet.

The expression \..$ will match "somelib.a", "somelib.so.6", but not somemusic.mp3.

The ^ Character

The ^ will match the beginning of a line. Pretty simple, no?

The expression ^at will match "attack" but not "Mathew".

Combining ^ and $

Frequently you'll need to match a blank line (more often for exclusion than not), and simply slamming ^ and $ together will do so (^$). However, if you remember what you learned earlier, you can expand that to include a space and a tab within the brackets to match lines that appear blank, yet may have spaces on them:
^[ 	]*$

i.e. (showing the chars):

^[<space here><tab here>]*$

External links

Retrieved from "http://www.gentoo-wiki.info/Regex"

Last modified: Wed, 23 Jul 2008 04:56:00 +0000 Hits: 13,366