Updated: 2016-10-28 17:25 EDT
[...]
– similar to GLOB^
to complement a character class set: [^abc]
]
as part of a character class set[:digit:]
^[[:digit:]]*$
[.-.]
[a-z]
?
+
|
{
}
(
)
\.
grep
^
and $
vi
sed
less
There are two different pattern matching facilities that we use in Unix/Linux: GLOB patterns and Regular Expressions.
Regular Expressions are another way to match patterns in text, similar to but more powerful than simple GLOB patterns.
Pay close attention to which of the two situations you’re in, because some of the same special characters common to GLOB and Regular Expressions have different meanings!
There are several major places where GLOB patterns are used:
*.txt
IndexIn the shell, GLOB patterns may be used to match existing pathnames in the file system:
$ ls *.txt
$ echo ?????.txt
$ touch [ab]*.txt
The shell tries to expand the GLOB to match existing pathnames before the associated command runs.
case
statement GLOB in the ShellIndexGLOB patterns are used in shell case
statements to match the text at the top of the case
statement:
case "$1" in
/* ) type='Absolute Pathname' ;;
* ) type='Relative Pathname' ;;
esac
find
command: -name '*.txt'
IndexThe find
command --name
operator also matches GLOB patterns against the file system, but it does so recursively in every directory, not just in one directory:
$ find . -name '*.txt'
$ find . -name '?????.txt'
$ find . -name '[ab]*.txt'
We quote the patterns above to hide them from the shell so that the find
command receives the pattern and the shell doesn’t try to expand them.
Regular Expressions (short form: regexp) are text matching patterns similar to GLOB patterns but more powerful. Regexp patterns use all the GLOB pattern matching characters and add more. The characters work slightly differently between GLOB and regexp.
Regexp are used by many Unix/Linux programs and programming languages such as grep
, sed
, awk
, vim
, less
, more
, man
, Perl
, python
, etc.
In an editor (such as vim
or sed
), a Regular Expression may be used to select characters to be deleted, replaced, or exchanged:
:%s/colou*r/COLOUR/g # vim replacement regular expression
$ echo "Colouur bad. Colour red. Color tan." | sed -e 's/Colou*r/COLOUR/g'
COLOUR bad. COLOUR red. COLOUR tan.
Regexp have a Basic set of pattern matching characters and an Extended set of characters. The grep
program family is a very popular user of both Basic and Extended Regular Expressions.
The grep
command itself accepts Basic Regular Expression syntax, and needs backslashes in front of some operators to access Extended Regular Expression features. The egrep
command accepts Extended Regular Expression syntax and does not need the backslashes. You can do the same text search using either command, but the syntax changes:
$ grep 'publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log # Basic
$ egrep 'publickey for (idallen|cst8207[abc]?)' /var/log/auth.log # Extended
From the section REGULAR EXPRESSIONS
in the man page for the grep
command:
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).
Even the bash
shell has extended syntax that allows the use of regular expressions instead of simple GLOB patterns.
IMPORTANT: Regular Expressions use some of the same special characters as GLOB patterns, but they mean different things! In particular,
*
,?
, and.
work differently! There are others!
GLOB patterns are said to be anchored to the start and end of the line; they must always match the entire text string (usually a file name) from the start to the end.
The GLOB pattern a*b
matches only text that starts with a
and ends with b
– that GLOB pattern doesn’t match just the ab
in the middle of xxxabxxx
.
The modified GLOB pattern *a*b*
now matches the whole text that contains a
followed by b
anywhere in the text. The modified GLOB pattern does match the entire text xxxabxxx
.
Regular Expressions are not by default anchored. They “float” down the text and they may match anywhere in the text string unless you explicitly anchor them to either the start or end of the text using using regexp characters ^
and/or $
.
The Regular Expression a.*b
matches inside any text that contains a
followed by b
anywhere in the text. The floating regexp does match the ab
in the middle of xxxabxxx
.
The modified Regular Expression ^a.*b$
is now anchored to the start and end of the text. The modified expression now matches exactly the same text as the GLOB pattern a*b
because it forces the a
to match at the start and the b
to match at the end. It does not match inside xxxabxxx
.
You must remember to anchor the ends of your Regular Expressions if you want to be sure that they match the whole piece of text and not just some part of the text.
Summary:
a*b
matches (only) the text ab
inside xxxabxxx
.^a*b$
does not match the ab
inside xxxabxxx
because the a
has to be at the start and the b
has to be at the end. It does match the string aaaaab
.Like algebraic expressions, more complex Regular Expressions are built up by combining simpler expressions. Regular Expressions have operators similar to algebraic operators, but they mean different things than in algebra. Like algebraic operators, Regular Expression operators have bindings and precedence when combined with other operators.
Before we look at Regular Expressions, let’s take a look at some Algebraic Expressions you’re already comfortable with. Larger Algebraic Expressions are formed by putting smaller expressions together:
Expression | Meaning | Comment |
---|---|---|
a |
a |
a simple expression |
b |
b |
another simple expression |
ab |
a x b |
ab is a larger expression formed from two smaller ones concatenating two expressions together means to multiply them |
b2 |
b x b |
we might have represented this with b^2, using ^ as an exponentiation operator |
ab2 |
a x (b x b) |
not (a x b) x (a x b) |
(ab)2 |
(a x b) x (a x b) |
parentheses for grouping |
*
repetition (zero or more) and parenthesesIndexSimilar to an algebraic exponent, the asterisk/star *
Regular Expression operator binds tightly to the immediately preceding Regular Expression and repeats it zero or more times. Parentheses (a feature of Extended Regular Expressions) can be used for grouping, e.g.
$ grep 'suc*eed' document.txt # find sueed, suceed, succeed, succceed, etc.
$ grep 'Bar\(bar\)*a' document.txt # find Bara, Barbara, Barbarbara, etc.
$ egrep 'Bar(bar)*a' document.txt # use egrep Extended regexp syntax
Rhabarbara: https://www.youtube.com/watch?v=dD2mhVc6C_8
Parentheses need backslashes in front of them when using a program such as grep
that uses Basic Regular Expression syntax. The egrep
program accepts Extended Regular Expression syntax and does not need the backslashes.
Expression | Meaning | Comment |
---|---|---|
|
match single ‘a’ |
a simple expression |
|
match single ‘b’ |
another simple expression |
|
match strings consisting of single ‘a’ followed by single ‘b’ |
“ab” is a larger expression formed from two smaller ones concatenating two regular expressions together means “followed immediately by” and we’ll say “followed by” |
|
match zero or more ‘b’ characters |
a big difference in meaning from the ’*’ in globbing! This is the regular expression repetition operator. |
|
‘a’ followed by zero or more ‘b’ characters |
why not repeating the two characters ‘ab’ zero or more times? Hint: think of “ab2” in algebra. |
|
(‘a’ followed by ‘b’), zero or more times |
We can use parenthesis; in Basic Regular Expressions, we use |
*
and \(...\)
IndexAs with algebraic multiplication, there is no operator to concatenate Regular Expressions to match longer strings. Simple write one expression and follow it with the next one.
Similar to an algebraic exponent, the asterisk/star *
Regular Expression operator binds tightly to the immediately preceding Regular Expression and repeats it zero or more times. Parentheses can be used for grouping, e.g.
Expression | Matches | Example | Example Matches | Comment |
---|---|---|---|---|
one expression followed by another |
first followed by second |
|
“xy” |
like globbing |
expression followed by |
zero or more matches of the immediately preceding expression |
|
“” or “x” or “xx” or “xxx” …etc |
NOT like the |
expression in parentheses |
the expression |
|
“ab” |
parentheses are used for groups |
expression in parentheses, followed by |
the expression repeated zero or more times |
|
“” or “ab” or “abab” or “ababab”, etc. |
parentheses are used for groups |
Regular Expressions have more special characters than GLOB patterns. Some special characters need backslashes in front of them to enable them in Basic Regular Expressions.
Character | Matches | Example | Example Matches | Comment |
---|---|---|---|---|
non-special character |
itself |
|
“x” |
like globbing |
|
any single character |
|
“x” or “y” or “!” or “.” or “*" …etc |
like the ‘?’ in globbing |
|
beginning of a line of text |
|
“x” if it’s the first character on the line |
anchors the match to the beginning of a line |
|
|
|
“a^b” |
^ has no special meaning unless its first |
|
end of a line of text |
|
“x” if it’s the last character on the line |
anchors the match to the end of a line |
|
|
|
“a$b” |
$ has no special meaning unless its last |
|
that character with its special meaning removed |
|
“.” |
like globbing |
|
the non-special character (no change) |
|
“a” |
\ before a non-special character is ignored |
|
character class |
|
“abc” |
see Class below |
^
and $
IndexGLOB Patterns are said to be anchored to the start and end of the string being matched. The GLOB pattern a*b
matches text axb
but not abx
or xab
. The a
has to be at the start, and the b
has to be at the end.
To allow a GLOB pattern to be unanchored and match anywhere inside a string, you need to pad the GLOB with *
on both sides:
$ echo a*b # anchored: matches axb not abx or xab
$ echo *a*b* # now matches abx or xab or xabx or xaxbx
The GLOB pattern has to match the whole string, and may need *
at each end to allow it do that.
Unlike GLOB Patterns, which are anchored, Regular Expressions are not anchored unless you make them so using the explicit anchor characters ^
and/or $
. Unanchored Regular Expressions “float” down the string until a match is found, and they don’t have to extend to the end of the string.
Regular Expressions can match just a piece of text in the middle of a line; they don’t have to match the whole line.
The GLOB pattern a*b
doesn’t match the string xabx
because GLOB is anchored and has to match the whole string, but the Regular Expression a.*b
does match inside the line, because it is unanchored at either end and floats down the string and matches the ab
in the middle of string. The regexp starts unanchored (no ^
at the start) and thus “floats” down the string to do the match.
Use the line start ^
and line end $
meta-characters to anchor a Regular Expression to the start or end of a line. Here are some examples of how GLOB patterns and regexp compare:
GLOB Regular Expression (may use anchors)
---- ------------------------------------
foo ^foo$
bar[abc] ^bar[abc]$
[!abc] ^[^abc]$ # note in complement GLOB uses ! vs. ^
foo? ^foo.$
a*b ^a.*b$
*foo* foo # unanchored GLOB needs * at ends
*a*b* a.*b # unanchored GLOB needs * at ends
Remember that an unanchored Regular Expression may match only part of a line, e.g. the text ab
matches only the ab
part of xxxabxxx
, not the whole xxxabxxx
. GLOB patterns must always match the entire line from start to end; they can’t match a substring inside a line the way regexp can.
When testing regular expressons with grep
:
grep --color=auto
These grep
commands select lines that match these Basic Regular Expressions:
grep 'ab' # a followed by b
grep 'a*b' # zero or more a followed by b
grep 'aa*b' # one or more a followed by b
grep 'aaa*b' # two or more a followed by b
grep 'a.b' # a then one of anything then b
grep 'a.*b' # a then zero or more of anything, then b
grep 'a..*b' # a then one or more of anything then b
grep 'a...*b' # a then two or more of anything then b
grep '^a' # a must be the first character
grep 'b$' # b must be the last character
grep '^a.*b$' # a must be first, zero or more anything, b must be last
Find any line that contains at one, two, or three characters of any kind (“any kind” includes spaces and other unprintable characters):
grep '.' # contains at least one character (or more)
grep '..' # contains at least two characters (or more)
grep '...' # contains at least three characters (or more)
grep '^.$' # contains exactly one character
grep '^..$' # contains exactly two characters
grep '^...$' # contains exactly three characters
[...]
– similar to GLOBIndex[az3]
[az3]
matches one single character that is a
or z
or 3
[!z3c]
to invert but regexp uses [^az3]
to mean: any single character that is not a
or z
or 3
*
(something you can’t do with GLOB)The characters inside the square brackets of a character class form a set of characters where order doesn’t matter and repeats don’t affect the meaning. All these below are equivalent and match only one single character a
or z
or 3
:
grep '[az3]' # match one single a or z or 3
grep '[3az]' # same - order doesn't matter
grep '[aaazzzz3333]' # same - bad form - no need to repeat characters
Most Regular Expression special characters lose their meaning when inside square brackets, but watch out for ^
, ]
, and -
which do have special meaning inside square brackets, depending on where they occur.
Expression | Matches | Example | Example Matches | Comment |
---|---|---|---|---|
character classes |
a SINGLE character from the list |
|
“a” or “b” or “c” |
like globbing |
complement of a character class |
a SINGLE character not in the list |
|
any SINGLE character not a or b or c |
NOT like GLOB! GLOB uses ! as in [!abc] |
special character inside |
as if the character is not special |
|
|
conditions: |
^
to complement a character class set: [^abc]
Index^
used immediately inside the opening square bracket of a class complements the whole character class set: [^az3]
. The resulting character class expression matches any single character that is not in the set.[^az3]
means “any single character that is not a
, z
, or 3
”^
only works this way if it is the first character inside the square brackets, otherwise it has no special meaning.[a^z3]
or [az^3]
or [az3^]
all match one of a
, z
, 3
, or ^
^
used in a Regular Expression outside of square brackets has the special meaning “match at beginning of line”. Don’t confuse it with ^
used inside a character class.Note that GLOB patterns complement character sets using !
and not ^
:
GLOB Regular Expression
[!abc] [^abc]
Don’t confuse GLOB with Regular Expressions.
]
as part of a character class setIndexA ]
character can be placed inside square brackets to be part of the character class set, but it has to be the first character in the set. []az3]
means one of the four characters ]
, a
, z
, or 3
and [^]azh]
means any single character that is not one of the four characters ]
, a
, z
, or 3
.
Attempting to put a closing square bracket ]
inside square brackets in any other position is a syntax error:
[ab]d]
is a failed attempt at [ab][d]
[]
is a failed attempt at []]
You can put an opening [
anywhere in a character class, e.g.
$ grep '[([{]` doc.txt # search for lines with '(' or '[' or '{'
[:digit:]
IndexPOSIX Character Class expressions represent an entire range of characters, such as “all the digits” or “all the letters”. The classes have an awkward syntax: The POSIX class name is preceded by [:
and followed by :]
, e.g. [:digit:]
. These are the resulting class names:
POSIX Class | Description |
---|---|
[:alnum:] |
alphanumeric characters |
[:alpha:] |
alphabetic characters |
[:cntrl:] |
control characters |
[:digit:] |
digit characters |
[:lower:] |
lower case alphabetic characters |
[:print:] |
visible characters, plus [:space:] |
[:punct:] |
Punctuation and other symbol characters |
[:space:] |
White space (space, tab, CR, LF) characters |
[:upper:] |
upper case alphabetic characters |
[:xdigit:] |
Hexadecimal digit characters |
[:graph:] |
visible characters (anything except spaces and control characters) |
a-z
and A-Z
.é
, ç
, etc.These POSIX class names only work inside an enclosing Regular Expression character class expression using (more) square brackets. What looks like double square brackets is really an enclosing square bracket character class expression containing a POSIX class name (which unfortunately also uses square brackets and colons as part of its name), e.g.
grep '[0123456789]' # a digit (a list of all the digits)
grep '[[:digit:]]' # a digit - the POSIX class name [:digit:] inside []
grep '[abcd[:digit:]]` # a digit or letter a or b or c or d
grep '[ab[:digit:]cd]` # same -- a digit or a or b or c or d
grep '[[:digit:]abcd]` # same -- a digit or a or b or c or d
Of course you can use multiple POSIX class names inside the character class expression:
grep '[[:alpha:][:digit:]]` # a letter or a digit
grep '[^[:alpha:][:digit:]]` # *NOT* a letter or a digit
WARNING: You cannot interchange the [:alpha:]
class and a list of all the upper- and lower-case letters; they are not always the same because the POSIX [:alpha:]
class changes depending on the local language:
grep '[[:alpha:]]` # a letter, using the POSIX class name [:alpha:]
grep '[a-zA-Z]' # NOT THE SAME AS [:alpha:] - DO NOT USE !
grep '[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]' # NOT THE SAME
^[[:digit:]]*$
IndexThese expressions could be given to grep
:
Any line containing nothing or only alphabetic characters from start to end:
^[[:alpha:]]*$
Any line containing only alphabetic characters from start to end, but must have at least one such character (can’t be an empty line):
^[[:alpha:]][[:alpha:]]*$
Any line that begins with a digit (followed by anything or nothing):
^[[:digit:]]
[.-.]
Index-
between two characters inside square brackets, e.g. [0-9]
, represents a range of characters between the two, unless the dash is first or last in the set of characters, e.g. [-09]
or [09-]
[-09]
and [09-]
mean one of the three characters 0
, 9
, or -
[0-9]
means any one character from the set of characters located between characters 0
and 9
inclusive.What determines what characters line between other characters? The result depends on your current Locale and is not well-defined.
[a-z]
IndexDo not use alphabetic ranges (e.g. [a-z]
)! The ranges change depending on your system Locale and may change in unexpected ways:
$ touch A B C Z a b c z
$ LC_ALL=C
$ echo *
A B C Z a b c z
$ echo [a-z]
a b c z
$ LC_ALL=en_CA.UTF-8
$ echo *
A a B b C c Z z
$ echo [a-z]
a B b C c Z z
[a-z]
meaning “any one character between a
and z
inclusive” used to mean something when there was only one ASCII English locale.a
and z
inclusive” is ambiguous because it means different things in different locales.[:alpha:]
.[abcdefgh]
not [a-h]
?
+
|
{
}
(
)
IndexSome features of Regular Expressions are called Extended features. These features are described below and use more special characters: ?
+
|
{
}
(
)
\|
vs. |
IndexThe difference between Basic and Extended Regular expressions is whether the program requires you to use a backslash to make use of the Extended features:
Basic: . * ^ $ \ \| \? \+ \{ \} \( \) # must use backslash
Extended: . * ^ $ \ | ? + { } ( ) # do *NOT* use backslash
The ordinary grep
program uses Basic Regular Expressions, so you have to use backslashes in front of the Extended characters to turn on Extended features. The egrep
Extended Regular Expression program (short for grep -E
) doesn’t need the backslashes:
$ grep 'Accepted publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log
$ egrep 'Accepted publickey for (idallen|cst8207[abc]?)' /var/log/auth.log
Basic Regular Expressions are used in these programs and you need to use backslashes to turn on Extended features:
vi
, more
, sed
, awk
, grep
Extended Regular Expressions are used in these programs and you do not need backslashes to enable Extended features:
less
(e.g. man
pages)egrep
and grep -E
perl
and grep -P
The
perl
program (andgrep -P
) has its own set of special Perl-compatible Regular Expression features, not described here.
?
+
{n,m}
IndexExtended Regular Expressions give you more options when repeating a preceding expression:
Basic | Extended | Repetition Meaning |
---|---|---|
|
|
zero or more times |
|
|
zero or one times |
|
|
one or more times |
|
|
n times, n is an integer |
|
|
n or more times, n is an integer |
|
|
m or fewer times, m is an integer (GNU extension) |
|
|
at least n, at most m times, n and m are integers |
Examples:
$ egrep 'colou?r' doc.txt # color or colour not colouur
$ egrep 'has +spaces' doc.txt # one or more spaces between
$ egrep '[0-9]{9}' doc.txt # 123456789
$ egrep '[0-9]{3}-[0-9]{3}-[0-9]{4}' doc.txt # 123-456-7890
$ egrep '^.{80}$' doc.txt # 80 character lines
$ grep '^.\{,80\}$' doc.txt # 80 character or fewer lines
Note that the {,m}
capability is not available in all Extended Regular Expressions, since it is a GNU extension.
ab|cd
IndexExtended Regular Expressions give you a way of matching one expression or another expression using the logical or bar |
operator:
$ grep -E 'dog|cat' doc.txt # find lines with dog or cat
$ grep 'dog house\|cat fight' doc.txt # find lines with "dog house" or "cat fight"
You can do a crude form of alternation using the -e
option to give the alternatives (as many as you like) in the grep
family of programs:
$ fgrep -e 'dog' -e 'cat' doc.txt # find lines containing dog or cat
$ grep -e '^dog$' -e '^cat$' doc.txt # find lines with *only* dog or cat
The or |
operator binds very loosely. Everything else has higher precedence:
$ grep -E '^a|b$' doc.txt # lines starting with a or ending with b
a(b|c)d
IndexParentheses (
and )
are an Extended feature that can be used to group Regular Expressions for repetition, and to override the precedence rules.
$ egrep 'ab|cd' doc.txt # ab or cd
$ egrep 'a(b|c)d' doc.txt # a followed by "b or c" followed by d
$ grep -E '^a|b$' doc.txt # lines starting with a or ending with b
$ grep -E '^(a|b)$' doc.txt # lines containing only a or only b
$ egrep 'Bar(bar)+a' doc.txt # Barbara, Barbarbara, etc.
(Visit Barbara at the Rhababer-Barbara-Bar.)
Alternation has the loosest or lowest precedence (think addition).
$ grep 'ab*' doc.txt # matches a followed by multiple b
$ grep 'ab|cd' doc.txt # matches ab or cd
As in mathematics, Regular Expression precedence can be overridden with explicit parentheses to do grouping.
Operation | Regex | Algebra |
---|---|---|
grouping |
() or \(\) |
parentheses brackets |
repetition |
* or ? or + or {n} or {n,} or {n,m} * or \? or \+ or \{n\} or \{n,\} or \{n,m\} |
exponentiation |
concatenation |
ab |
multiplication or division |
alternation |
or \| |
addition or subtraction |
\.
IndexTo remove the Regular Expression meaning of any Regular Expression meta character, put a backslash in front of it. This applies to both Basic and Extended Regular Expressions. In all types of Regular Expressions:
\*
matches a literal asterisk\.
matches a literal period\\
matches a literal backslash\$
matches a literal dollar sign\^
matches a literal circumflexIn Extended Regular Expressions, you need more backslashes to hide the additional Extended Regular Expression meta-characters, e.g. \+
hides the meaning of +
and matches a real plus sign in an Extended Regular Expression, just as \?
matches a real question mark:
$ egrep 'foo\++` doc.txt # match one or more plus signs (Extended)
$ grep 'foo+\+` doc.txt # match one or more plus signs (Basic)
$ egrep 'foo\??` doc.txt # match an optional question mark (Extended)
$ grep 'foo?\?` doc.txt # match an optional question mark (Basic)
The POSIX class name includes the surrounding colons and square brackets and nothing should ever be placed inside those brackets. This is a common mistake:
grep '[[^:digit:]]' # WRONG ! no longer a POSIX class name !
grep '[^[:digit:]]' # correct - match any single non-digit character
Using what you think is a POSIX character class outside of the enclosing character class square brackets does not work. On some systems, grep
will warn you that it doesn’t work:
$ grep '[:alnum:]' # WRONG !
grep: character class syntax is [[:space:]], not [:space:]
On other systems, the character class expression will quietly match the list of characters inside the outer square brackets, i.e. match one of the characters :
, a
, l
, n
, u
, or m
!
Any Regular Expression match will be as long as possible. They are called “greedy”:
a.*c
matches all of abc___abc
– it doesn’t only match the first abc
.perl
expression *?
, also available as grep -P
.)grep
IndexAll the expressions below match the same set of lines containing a letter a
, but the first expression uses a lot less processing power than the others:
$ grep 'a' file.txt # this is the cleanest and fastest one
$ grep 'aa*' file.txt
$ grep 'a.*' file.txt
$ grep '.*a' file.txt
$ grep '.*a.*' file.txt
If you’re looking for lines containing a piece of text, don’t complicate the regexp with repeat operators that waste computer time but don’t change which lines the regexp finds.
vim
editor is an exception – it has a special syntax for matching across line ends, e.g. abc\ndef
and abc\_.def
, but this doesn’t work anywhere else (so don’t worry about it here).^
and $
IndexUnlike GLOB Patterns, which are anchored, Regular Expressions are not anchored unless you make them so using the explicit anchor characters ^
and/or $
. Unanchored Regular Expressions “float” down the string until a match is found, and they don’t have to extend to the end of the string.
$ echo a*b # anchored: matches axb not abx or xab
$ ls | grep '^a.*b$' # equivalent anchored Regular Expression
$ ls | grep 'a.*b' # NOT equivalent unanchored Regular Expression
Regular Expressions “float” down the string unless they are anchored.
*
means “zero or more”.For example, if you have a line with any 10 characters in it, the zero-length Regular Expression x*
(meaning zero or more x
characters) could match 11 times, before and after every one of the 10 characters (if it doesn’t match any of the characters themselves):
$ echo '0123456789' | sed -e 's/x*/-/g'
-0-1-2-3-4-5-6-7-8-9-
grep
colour option and web tools such as http://regexpal.com cannot highlight matches of zero characters, but the matches are there!The vim
editor will highlight the entire line when a zero-length expression matches between all the characters.
This Regular Expression below sometimes works, and sometimes does not, depending on what file names match the aa*
GLOB pattern in the current directory:
grep aa* foo.txt # no quotes, GLOB expands: bad idea
aa*
, possibly changing it into existing filenames that begin with a
before grep
runs: we don’t want that.grep
sees the regex.[a-m]
, they do not match what you think they match (only lower-case letters) in many common locales.[[:lower:]]
[abcdefghijklm]
not [a-m]
[0-9]
are accepted. (Are there any locales where [0-9]
does not mean the ten digits zero through nine?)LC_ALL=C
vi
sed
less
Indexvi
reference: http://www.tutorialspoint.com/unix/unix-vi-editor.htm
You can search and replace in vi
using a Basic Regular Expression in a Substitution line command. The substitution command by default uses slashes to delimit the text to match and the replacement text:
:%s/colou\?r/COLOUR/g # make all color and colour upper-case
The program sed
(Stream EDitor) can apply a Basic Regular Expression substitution non-interactively by reading a file (or standard input) and writing to standard output:
$ sed -e 's/colou\?r/COLOUR/g' input.txt >output.txt
You can search using Regular Expressions in the interactive programs vi
, more
, and less
(and also man
, that uses less
) by typing a slash followed by the Regular Expression to search for:
/^ *read # find "read" at the start of a line
(Remember that vi
and more
use Basic Regular Expressions and less
uses Extended Regular Expressions.)
vi
IndexTask: Any lower-case letter following a period and two spaces should be made upper-case. Easy to do using Regular Expressions in vi
:
vi
, type: /\. [[:lower:]]
4~
to make four characters upper-case.n
(next match) and .
(repeat change) as many times as necessary.n
command moves to the next occurrence, and .
repeats the capitalization command.Any upper-case character following a lower case character should be made lower case, e.g. uNcapitalize
or aWkward
or iN
vi
, type: /[[:lower:]][[:upper:]]
l
to move one to the right (off of the lower-case letter)~
to change the capitalizationnl.
as necessaryl
is needed because vi will position the cursor on the first character of the match, which in this case is a character that doesn’t change.Advanced: In
vim
you can also use the syntax/[[:lower:]][[:upper:]]/b1
to both match the text and move the cursor right one position. Then you can just repeat the two charactersn.
as many times as necessary. Thevim
editor has very advanced pattern search and cursor position capabilities; type:help regexp
Lynda.com has a course on regular expressions
The problem is that it covers our material as well as some more advanced topics that we won’t cover
It is a good presentation, and the following chapters should have minimal references to the “too advanced” material
For a quick interactive tutorial on Regular Expressions, see http://regexone.com/ but be aware that this tutorial uses some short-hand expressions that we don’t use in this course because they don’t work everywhere:
Shortcut | POSIX Character Class |
---|---|
\w |
similar to [[:alnum:]_] |
\W |
similar to [^[:alnum:]_] |
\s |
similar to [[:space:]] |
\S |
similar to [^[:space:]] |
The tutorial does not use or understand the POSIX character classes that are more standard in Unix/Linux programs.