Menu
Is free
check in
home  /  Programs / Regular UNIX expressions. Using regular expressions (Regex) in Linux

Regular Unix expressions. Using regular expressions (Regex) in Linux

Original: Linux Fundamentals
Posted by: Paul Cobbaut
Publication date: October 16, 2014
Translation: A.Panin
Translation Date: December 17, 2014

Chapter 19. Regular expressions

The mechanism of regular expressions are a very powerful Linux tool. Regular expressions Can be used when working with a variety of programs, such as Bash, Vi, Rename, Grep, SED and others.

This chapter presents basic information about regular expressions.

Regular expression syntax versions

There are three different versions of regular expressions.

Depending on the tool used, one or more specifies mentioned can be used.

For example, the GREP tool supports a parameter -e, which allows you to use the extended regular expressions (ERE) syntax (ERE) when analyzing a regular expression, which in the time as parameter -g allows forced to use the basic syntax of regular expressions (BRE), and the -P parameter - The syntax of the regular expressions of the PERL programming language (PCRE).

Taking into account the fact that the GREP tool also supports the -f parameter that allows you to read a regular expression without processing.

The SED tool also supports the parameters to select the syntax of regular expressions.

Always read the guidelines of the instruments used!

Utility Grep.

Conclusion of strings matching the template

The GREP utility is a popular Linux tool designed to search for lines that match a specific template. Below are examples of the simplest regular expressions that can be used when working with it.

This is the contents of the test file used in the examples. This file contains three lines (or three symbols of the new line). [Email Protected]: ~ $ Cat Names Tania Laura Valentina

When searching for a separate character, only those lines that contain a specified symbol will be displayed. [Email Protected]: ~ $ Grep u Names Laura [Email Protected]: ~ $ Grep E Names Valentina [Email Protected]: ~ $ Grep I Names Tania Valentina

A comparison with the template used in this example is obvious; In the event that the specified symbol is found in the string, the GREP utility will display this string.

Combining symbols

To search for combinations of characters in lines, the symbols of regular expression should be combined in the same way.

This example demonstrates the principle of operation of the GREP utility, according to which the regular expression Ia will correspond to the Tan Ia string, but not the string V A Lent I Na, and the regular expression in the Valent in A line, but not the Ta Ni a. [Email Protected]: ~ $ Grep a Names Tania Laura Valentina [Email Protected]: ~ $ Grep Ia Names Tania [Email Protected]: ~ $ Grep in Names Valentina [Email Protected]:~$

One or another symbol

Both in the PCRE syntax and in the ERE syntax, a symbol of creating a software channel can be used, which in this case will submit a logical operation "or". In this example, we will search with the GREP utilities of the strings in which the i character is encountered or a symbol a. [Email Protected]: ~ $ Cat List Tania Laura [Email Protected]: ~ $ Grep -E "I | A" List Tania Laura

Please note that we use the parameter -E Grep utilities for the forced interpretation of our regular expression as an expression using regular syntax of regular expressions (ERE).

We will have to shield a symbol for creating a software channel in a regular expression using the basic syntax of regular expressions (BRE) for a similar interpretation of this character as a logical operation "or". [Email Protected]: ~ $ Grep -g "I | A" List [Email Protected]: ~ $ Grep -g "I \\ | a" List Tania Laura

One or more coincidences

The * symbol corresponds to zero, one or more entries of the previous symbol, and the + - subsequent character symbol. [Email Protected]: ~ $ Cat List2 LL LOL LOOL LOOOL [Email Protected]: ~ $ Grep -E "O *" List2 LL LOL LOOL LOOOL [Email Protected]: ~ $ Grep -E "O +" List2 LOL LOOL LOOOL [Email Protected]:~$

Coincidence at the end of the string

In the following examples, we will use this file: [Email Protected]: ~ $ Cat Names Tania Laura Valentina Fleur Floor

In two examples, the following is the method of using a dollar symbol to search for a coincidence at the end of the line. [Email Protected]: ~ $ Grep a $ Names Tania Laura Valentina [Email Protected]: ~ $ Grep R $ Names Fleur Floor

Coincidence at the beginning of the line

The insert symbol (^) allows you to search for the coincidence at the beginning (or from the first characters) string.

In these examples, the file considered above is used. [Email Protected]: ~ $ Grep ^ Val Names Valentina [Email Protected]: ~ $ GREP ^ F Names Fleur Floor

The symbols of the dollar and inserts used in regular expressions are called anchors (ANCHORS).

Separation of words

Shielding of wanted words using gap symbols is not a good solution (as other characters can also be used as word separators). In the example below shows the method of using a sequence of characters \\ b to search for rows with a given word, not a sequence of characters: [Email Protected]: ~ $ Grep "\\ Bover \\ B" Text The Winter Is Over. Can You Get Over There? [Email Protected]:~$

Please note that the GREP utility also supports the -W parameter designed to search for the search. [Email Protected]: ~ $ Cat Text The Governer IS Governing. The Winter Is Over. Can You Get Over There? [Email Protected]: ~ $ Grep -W Over Text The Winter Is Over. Can You Get Over There? [Email Protected]:~$

Settings Utilities Grep.

Sometimes it turns out to be easier to combine a simple regular expression with the parameters of the GREP utility, rather than create a more complex regular expression. These parameters were discussed earlier: Grep -i Grep -v Grep -W Grep -A5 GREP -B5 GREP -C5

Preventing the disclosure of regular expression by the command shell

The dollar symbol is a special symbol of both for regular expression and for the command shell (remember the command shell variables and the embedded command shells). Based on this, it is recommended to shield regular expressions under any circumstances, since the screening of a regular expression allows you to prevent the disclosure of this expression by the command shell. [Email Protected]: ~ $ Grep "R $" Names Fleur Floor Rename

Utility Rename.

Implementation of the Rename utility

In the Debain Linux distribution, the / usr / bin / rename path is a link to the / usr / bin / prename scenario, installed from the PERL package. [Email Protected] ~ $ DPKG -S $ (readlink -f $ (Which Rename) perl: / usr / bin / preneame

In distributions based on the Red Hat distribution, it is not created a similar symbolic reference to specify the described script (of course, with the exception of cases when a symbolic link to the script set manually is created), so this section will not describe the Rename utility from Distribution Red Hat.

In the discussions on the Rename utility on the Internet usually occurs confusion due to the fact that decisions that work perfectly in the Debian distribution (as well as Ubuntu, Xubuntu, Mint, ...) cannot be used in the Red Hat distribution (as well as CentOS , Fedora, ...).

Package Perl

The Rename command is in fact implemented in the form of a script using regular PERL programming languages. With the full manual for using this script, you can read after entering the Perldoc Perlrequick command (after installing Perldoc Package). [Email Protected]: ~ #ptitude Install Perl-Doc The following new packages will be installed: Perl-Doc 0 packets Updated, 1 installed new, 0 packets are noted to delete, and 0 packets are not updated. It is necessary to get 8,170 kb archives. After unpacking 13.2 MB will be occupied. Get: 1 http://mirrordirector.raspbian.org/raspbian/ Wheezy / Main Perl-Do ... 8,170 Kb in 19c (412 kb / s) Select a previously selected PERL-DOC package. (Reading the database ... on this moment 67121 file and directory are installed.) Perl-doc (from ... / perl-doc_5.14.2-21 + rpi2_all.deb) ... adding "DiveSion of / usr / bin / perldoc to / usr / bin / perldoc. StUd by Perl-Doc "Processed triggers for Man-dB ... The Perl-Doc package is configured (5.14.2-21 + rpi2) ... [Email Protected]: ~ # Perldoc Perlrequick

Well known syntax

Most often, the Rename utility is used to search for files with names corresponding to a specific template in the form of a string, and replace this string to another line.

Typically, this action is described using a regular expression S / string / Other line /, as shown in the example: [Email Protected] ~ $ LS ABC AllFiles.txt BlLFiles.txt Scratch Tennis2.txt abc.conf backup cllfiles.txt temp.txt tennis.txt [Email Protected] ~ $ RENAME "S / TXT / TEXT /" * [Email Protected] ~ $ Ls ABC AllFiles.Text BlLFiles.Text Scratch Tennis2.Text abc.conf backup cllfiles.text temp.text tennis.text

And below is another example, which uses the well-known syntax of the RENAME utility to repeatedly change the extensions of the same files: [Email Protected] ~ $ Ls ABC AllFiles.Text BlLFiles.Text Scratch Tennis2.Text abc.conf backup cllfiles.text temp.text tennis.text [Email Protected] ~ $ RENAME "S / TEXT / TXT /" * .TEXT [Email Protected] ~ $ LS ABC AllFiles.txt BlLFiles.txt Scratch Tennis2.txt abc.conf backup cllfiles.txt temp.txt tennis.txt [Email Protected] ~ $

These two examples are workable for the reason that the strings we use are found exclusively in file extensions. Do not forget that file extensions do not matter when working with the bash command shell.

The following example demonstrates the problem with which you can encounter when using this syntax. [Email Protected] ~ $ Touch ATXT.Txt [Email Protected] ~ $ RENAME "S / TXT / PROBLEM /" ATXT.TXT [Email Protected] ~ $ Ls ABC AllFiles.txt Backup Cllfiles.txt temp.txt tennis.txt abc.conf aproblem.txt bllfiles.txt scratch tennis2.txt [Email Protected] ~ $

When executing the command under consideration, a replacement of exceptionally the first entry of the wanted string is carried out.

Global replacement

The syntax used in the previous example can be described as follows: S / Regular expression / string for replacement /. This description is simple and obvious, as you only have to place a regular expression between the two first slashes and a row to replace between the last two slaces.

In the following example, this syntax is slightly expanded due to the addition of the modifier. [Email Protected] ~ $ RENAME -N "S / TXT / TXT / G" ATXT.TXT ATXT.TXT RENAMED AS ATXT.TXT [Email Protected] ~ $

Now the syntax we used can be described as S / Regular expression / string for replacement / G, where the modifier S is a replacement operation (Switch), and the G modifier reports the need to implement global replacement (Global).

Note that in this example, the -n parameter was used to display information about the operation being performed (instead of performing the operation itself, which consists in direct renoving the file).

Replacement without registering

Another modifier that may be useful is a modifier I. The example below shows the method of replacing the string to another line without registering the register. [Email Protected]: ~ / Files $ ls file1.text file2.text file3.txt [Email Protected]: ~ / Files $ Rename "s / .text / .txt / i" * [Email Protected]: ~ / Files $ LS File1.txt file2.txt file3.txt [Email Protected]: ~ / Files $

Changing extensions

Team interface linux strings There is no idea of \u200b\u200bfile extensions similar to the MS-DOS applicable in the operating system, but many users and applications with a graphical interface use them.

This section provides an example of using the Rename utility to change exclusively file extensions. The example uses the dollar symbol to indicate that the reference point for the replacement is the end of the file name. [Email Protected] ~ $ Ls * .txt AllFiles.txt bllfiles.txt cllfiles.txt really.txt.txt temp.txt tennis.txt [Email Protected] ~ $ RENAME "S / .TXT $ /. TXT /" * .TXT [Email Protected] ~ $ Ls * .txt AllFiles.txt bllfiles.txt cllfiles.txt really.txt.txt temp.txt tennis.txt [Email Protected] ~ $

Note that the dollar symbol within the regular expression indicates the end of the line. Without the dollar symbol, the execution of this command should be completed at the time of processing the name of the really.txt.txt file.

SED utility

Data stream editor

The data stream editor (Stream Editor) or, for brevity, the SED utility uses regular expressions to modify the data stream.

In this example, the SED utility is used to replace the string. Echo Monday | SED "S / MONON / SEC /" Tuesday

Slash can be replaced by some other characters that may be more convenient and improved team readability in some cases. Echo Monday | SED "S: Monday: Second:" Tuesday Echo Monday | SED "s_- Region_Vtor_" Tuesday Echo Monday | SED "S | Monday | RAT |" Tuesday

Interactive editor

Despite the fact that the SED utility is designed to handle data streams, it can also be used for interactive file processing. [Email Protected]: ~ / Files $ Echo Monday\u003e Today [Email Protected]: ~ / Files $ Cat Today Monday [Email Protected]: ~ / Files $ Sed -i "S / Monday / Dev /" Today [Email Protected]: ~ / Files $ Cat Today Tuesday

The ampersand symbol can be used to refer to the desired (and found) string.

In this example, ampersand is used to doubling the number of lines found. Echo Monday | SED "S / Monda / && /" Monday Echo Monday | SED "S / NIK / && /" Monday

Round brackets are used to group parts of a regular expression, which can be subsequently installed links.

Consider the following example: [Email Protected]: ~ $ Echo Sunday | SED "S _ \\ (Sun \\) _ \\ 1NY_" SUNNYDAY [Email Protected]: ~ $ Echo Sunday | SED "S _ \\ (Sun \\) _ \\ 1NY \\ 1_" Sunny Sunday

Point to denote any symbol

In regular expression, a simple point symbol may designate any character. [Email Protected]: ~ $ ECHO 2014-04-01 | sed "s /....-..-../ yyyy-mm-dd /" yyyy-mm-dd [Email Protected]: ~ $ Echo ABCD-EF-GH | sed "s /....-..-../ yyyy-mm-dd /" yyyy-mm-dd

In the case of more than one pair of round brackets, the reference to each of them can be carried out by using consecutive numerical values. [Email Protected]: ~ $ ECHO 2014-04-01 | SED "S / \\ (.... \\) - \\ (.. \\) - \\ (.. \\) / \\ 1+ \\ 2+ \\ 3 /" 2014 + 04 + 01 [Email Protected]: ~ $ ECHO 2014-04-01 | SED "S / \\ (.... \\) - \\ (.. \\) - \\ (.. \\) / \\ 3: \\ 2: \\ 1 /" 01: 04: 2014

This feature is called grouping.

Space

The symbol sequence \\ s can be used to refer to such a symbol as a space or tab symbol.

This example provides a global sequence sequences of gap symbols (\\ s), which are replaced by 1 spacecraft. [Email Protected]: ~ $ echo -e "Today \\ Title \\ TD" Today is a warm day [Email Protected]: ~ $ echo -e "Today \\ TRADE \\ TDN" | sed "s_ \\ s_ _g" today a warm day

Optional entries

The question mark symbol indicates that the previous symbol is optional.

In the example below, a sequence of three characters O is searching, and the third character O is optional. [Email Protected]: ~ $ Cat List2 LL LOL LOOL LOOOL [Email Protected]: ~ $ Grep -E "OOO?" List2 LOOL LOOOL. [Email Protected]: ~ $ Cat List2 | SED "S / OOO \\? / A /" LL LOL LAL LAL

Exactly n repetitions

You can specify the exact number of repetitions of the previous symbol.

This example is searching for rows with exactly three symbols O. [Email Protected]: ~ $ Cat List2 LL LOL LOOL LOOOL [Email Protected]: ~ $ Grep -E "O (3)" List2 LOOOL [Email Protected]: ~ $ Cat List2 | SED "S / O \\ (3 \\) / A /" LL LOL LOOL LAL [Email Protected]:~$

From n to m repetition

And in this example, we clearly indicate that the symbol must be repeated from the minimum (2) to the maximum (3) number of times. [Email Protected]: ~ $ Cat List2 LL LOL LOOL LOOOL [Email Protected]: ~ $ Grep -E "O (2.3)" List2 LOOL LOOOL [Email Protected]: ~ $ Grep "O \\ (2.3 \\)" List2 LOOL LOOOL [Email Protected]: ~ $ Cat List2 | SED "S / O \\ (2.3 \\) / A /" LL LOL LAL LAL [Email Protected]:~$

Bash Command Shell History

The BASH command shell can also interpret some regular expressions.

This example shows a manipulation technique with an exclamation mark as part of a search mask in the Bash command shell history. [Email Protected]: ~ $ MKDIR HIST [Email Protected]: ~ $ CD HIST / [Email Protected]: ~ / Hist $ Touch File1 File2 File3 [Email Protected]: ~ / Hist $ LS -L File1 -rw-R - R-- 1 Paul Paul 0 Apr 15 22:07 File1 [Email Protected]: ~ / Hist $! L LS -L File1 -rw-R - R-- 1 Paul Paul 0 Apr 15 22:07 File1 [Email Protected]: ~ / Hist $! L: S / 1/3 LS -L File3 -RW-R - R-- 1 Paul Paul 0 Apr 15 22:07 File3 [Email Protected]: ~ / Hist $

This technique also works in the case of the use of numbers when reading the history of the BASH command shell command. [Email Protected]: ~ / Hist $ History 6 2089 MKDIR HIST 2090 CD HIST / 2091 Touch File1 File2 File3 2092 LS -L File1 2093 LS -L File3 2094 History 6 [Email Protected]: ~ / Hist $! 2092 LS -L File1 -rw-R - R-- 1 Paul Paul 0 Apr 15 22:07 File1 [Email Protected]: ~ / Hist $! 2092: S / 1/2 LS -L File2 -rw-R - R-- 1 Paul Paul 0 Apr 15 22:07 File2 [Email Protected]: ~ / Hist $

In order to fully process texts in Bash-scripts using SED and AWK, it is simply necessary to deal with regular expressions. The implementation of this useful tool can be found literally everywhere, and although all regular expressions are used, based on the same ideas, work with them has certain features in different environments. Here we will talk about regular expressions that are suitable for use in scenarios. command line Linux.

This material is conceived as an introduction to regular expressions, designed for those who can absolutely not know what it is. Therefore, start from the very beginning.

What is regular expressions

For many, when they first see regular expressions, the idea immediately arises that there is no meaningless jourge of characters. But this, of course, is far away. Take a look, for example, on this regular expression


In our opinion, even an absolute novice will immediately understand how it works and why it is necessary :) If you do not understand, just read further and everything will fall into place.
Regular expression is a template that uses the programs like SED or AWK filter texts. In templates, conventional ASCII characters representing themselves, and the so-called metasimlists, who play a special role, for example, allowing you to refer to some groups of characters.

Types of regular expressions

Implementation of regular expressions in various environments, for example, in programming languages \u200b\u200blike Java, Perl and Python, in Linux tools like SED, AWK and GREP, have certain features. These features depend on the so-called engine processing engines, which are engaged in interpretation of templates.
Linux has two regular expression engines:
  • The engine supporting the POSIX Basic Regular Expression (BRE) standard.
  • The engine supporting the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities correspond to at least the standard POSIX BRE, but some utilities (among them - SED) understand only a certain subset of the BRE standard. One of the reasons for such a restriction is the desire to make such utilities as quickly as possible in text processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a large number of funds when developing regular expressions. For example, it can be special characters sequences for frequently used patterns, like search in the text. separate words or digits sets. AWK supports the ERE standard.

There are many ways to develop regular expressions depending on the programmer's opinion, and on the features of the engine, which is created by them. It is not easy to write universal regular expressions that can understand any engine. Therefore, we will focus on the most frequently used regular expressions and consider the features of their implementation for SED and AWK.

Regular expressions POSIX BRE

Perhaps the simplest BRE template is a regular expression to search for accurate sequence of symbols in the text. Here is what the search for the string in SED and AWK looks like:

$ ECHO "This Is a Test" | SED -N "/ Test / P" $ ECHO "This Is a Test" | AWK "/ Test / (Print $ 0)"

Search text by template in sed


Text Search by Template in AWK

It can be noted that the search for a specified template is performed without taking into account the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the specified text anywhere in the string, the string is considered suitable and transmitted for further processing.

Working with regular expressions need to be taken into account that they are sensitive to the register of characters:

$ ECHO "This Is a Test" | AWK "/ Test / (Print $ 0)" $ ECHO "This Is a Test" | AWK "/ Test / (Print $ 0)"

Regular expressions are sensitive to register

The first regular expression of coincidences did not find, since the word "test", starting with the capital letter, does not occur in the text. The second, configured to search for the word written by capital letters, found a suitable string in the stream.

In regular expressions, it is possible not only letters, but also spaces, and numbers:

$ ECHO "This Is A Test 2 Again" | AWK "/ Test 2 / (Print $ 0)"

Search for a fragment of text containing spaces and numbers

Spaces are perceived by the engine of regular expressions as ordinary characters.

Special symbols

When using various symbols in regular expressions, some features should be taken into account. So, there are some special symbols, or metacharacters, the use of which in the template requires a special approach. Here they are:

.*^${}\+?|()
If one of them is needed in the template, it will need to be shielded using a reverse braid (reverse slash) - \\.

For example, if the text you need to find a dollar sign, it must be turned on in the template, after the screening symbol. Say, there is a MyFile file with such text:

There IS $ 10 on My Pocket
The dollar sign can be detected using such a template:

$ AWK "/ \\ $ / (Print $ 0)" MyFile

Use in a special symbol template

In addition, the reverse sinking line is also a special symbol, so if you need to use it in the template, it will also need to be shielded. It looks like two slash, going to each other:

$ ECHO "\\ IS A Special Character" | AWK "/ \\\\ / (Print $ 0)"

Shielding reverse slash

Although the direct slash is not included in the list of special characters above, an attempt to use them in regular expression written for SED or AWK will result in error:

$ ECHO "3/2" | awk "/// Print $ 0)"

Incorrect use of direct slash in the template

If necessary, it should be shielded too:

$ ECHO "3/2" | awk "/ \\ // (Print $ 0)"

Shielding direct slash

Anchor symbols

There are two special characters to bind the template to the beginning or by the end of the text string. The "Cover" symbol - ^ allows you to describe the sequences of the characters that are at the beginning of the text strings. If the desired pattern is in another place of the string, the regular expression does not respond to it. It looks like this symbol like this:

$ ECHO "Welcome to Likegeeks Website" | AWK "/ ^ Likegeeks / (Print $ 0)" $ Echo "Likegeeks Website" | AWK "/ ^ Likegeeks / (Print $ 0)"

Template search at the beginning of the line

The ^ symbol is designed to search for a template at the beginning of the line, while the register of characters is also taken into account. Let's see how this will affect the processing text File:

$ AWK "/ ^ this / (Print $ 0)" MyFile


Template search at the beginning of the line in the text from the file

When using SED, if you place the lid anywhere inside the template, it will be perceived as any other conventional symbol:

$ ECHO "This ^ is a test" | SED -N "/ S ^ / P"

The cover that is not at the beginning of the template in SED

In AWK, when using the same template, this character should be shielded:

$ ECHO "This ^ is a test" | AWK "/ S \\ ^ / (Print $ 0)"

The cover that is not at the beginning of the template in AWK

With search for text fragments, we figured out at the beginning of the line. What if you need to find something located at the end of the line?

This will help us a dollar sign - $, which is an anchor string end symbol:

$ ECHO "This Is a Test" | AWK "/ Test $ / (Print $ 0)"

Text search at the end of the string

In the same template you can use both anchor symbols. We will perform the processing of MyFile file, the contents of which are shown in the figure below, using such a regular expression:

$ AWK "/ ^ this is a test $ / (Print $ 0)" MyFile


Template in which special start and end symbols are used

As can be seen, the template was reacted only on a string fully appropriate specified sequence characters and their location.

Here's how, using anchor symbols, filter empty lines:

$ AWK "! / ^ $ / (Print $ 0)" MyFile
In this template used a symbol of denying, an exclamation mark -! . Through the use of such a template, there is a search for lines that do not contain anything between the beginning and the end of the line, and exclamation mark Only lines that do not correspond to this template are displayed.

Symbol "Point"

The point is used to search for any single symbol, with the exception of the row translation symbol. Let us give such a regular expression MyFile file, the contents of which are shown below:

$ AWK "/.ST/(print $ 0)" MyFile


Using a point in regular expressions

As can be seen according to the displayed data, the template corresponds only to the first two lines from the file, as they contain a sequence of "ST" characters, pretended by another symbol, while the third line of the appropriate sequence does not contain, and in the fourth it is, but is in The very beginning of the line.

Classes of symbols

The point corresponds to any single symbol, but what if you need to more flexibly limit the set of desired characters? In such a situation, you can use the classes of characters.

Thanks to this approach, you can organize the search for any character from the specified set. Square brackets are used to describe the character class of characters -:

$ AWK "/ th / (Print $ 0)" MyFile


Description of the class of characters in regular terms

Here we are looking for the sequence of the characters "Th", in front of which there is a symbol "O" or the symbol "I".

Classes turn out to be very welcoming if the search for words that can start both from the capital and lowercase letters are performed:

$ ECHO "This Is a Test" | AWK "/ HIS is a test / (Print $ 0)" $ ECHO "This is a test" | AWK "/ HIS Is a test / (Print $ 0)"

Finding words that can start with a lowercase or capital letter

Character classes are not limited to letters. Here you can use other characters. It is impossible to say in advance, in what situation classes will need - it all depends on the task being solved.

Denial of classes of symbols

Character classes can also be used to solve the task inversely described above. Namely, instead of finding the characters included in the classroom, you can organize the search for everything that is not included in the class. In order to achieve such behavior of a regular expression, before the list of class characters you need to place a sign ^. It looks like this:

$ AWK "/ [^ oi] TH / (Print $ 0)" MyFile


Search for Symbols not included in the class

In this case, the sequences of the characters "TH" will be found, before which there is no "O", nor "I".

Symbol bands

In symbolic classes, you can describe the ranges of characters using a dash:

$ AWK "/ ST / (Print $ 0)" MyFile


Description of the symbol range in the symbolic class

In this example, the regular expression responds to the sequence of "ST" symbols, in front of which there is any symbol located, in alphabetical order, between the characters "E" and "P".

Ranges can be created from numbers:

$ ECHO "123" | awk "//" $ ECHO "12A" | awk "//"

Regular expression to search for three any numbers

The symbol class may include several ranges:

$ AWK "/ ST / (Print $ 0)" MyFile


Symbol class consisting of several ranges

This regular expression will find all the "ST" sequences, in front of which there are symbols from A-F and M-Z bands.

Special classes of symbols

Bre has special characters classes that can be used when writing regular expressions:
  • [[: ALPHA:]] - corresponds to any alphabetical symbol recorded in the upper or lower register.
  • [[: Alnum:]] - corresponds to any alphanumeric symbol, namely, the symbols in the ranges 0-9, A-Z, A-Z.
  • [[: Blank:]] - corresponds to the gap and tab of the tab.
  • [[: Digit:]] - any digital symbol from 0 to 9.
  • [[: Upper:]] - Alphabetical characters in the upper case - A-Z.
  • [[: Lower:]] - Alphabetical characters in the lower case - A-Z.
  • [[: Print:]] - corresponds to any printed symbol.
  • [[: Punct:]] - corresponds to punctuation marks.
  • [[: Space:]] - Blind characters, in particular - space, tab sign, symbols NL, FF, VT, CR.
Use special classes in templates like this:

$ ECHO "ABC" | AWK "/ [[: Alpha:]] / (Print $ 0)" $ ECHO "ABC" | AWK "/ [[: Digit:]] / (Print $ 0)" $ ECHO "ABC123" | AWK "/ [[: Digit:]] / (Print $ 0)"


Special classes of symbols in regular expressions

Symbol "Star"

If in the template after the symbol, put the star, this will mean that the regular expression will work if the symbol appears in the line any number of times - including the situation when the character in the line is missing.

$ Echo "Test" | AWK "/ TES * T / (Print $ 0)" $ ECHO "TESSST" | AWK "/ TES * T / (Print $ 0)"


Using a symbol * in regular expressions

This template symbol is usually used to work with words in which typos, or for words allowing different options for spelling:

$ ECHO "I Like Green Color" | AWK "/ Colou * R / (Print $ 0)" $ ECHO "I LIKE GREEN COLOR" | AWK "/ Colou * R / (Print $ 0)"

Search words having different writing options

In this example, the same regular expression reacts to the word "Color" and the word "Color". It is so thanks to the fact that the symbol of "U", after which the stars stands, can either be absent or occurring several times in a row.

Another useful opportunity arising from the features of the stars symbol is to combine it with a point. Such a combination allows regular expression to respond to any number of any characters:

$ AWK "/this.*Test/(print $ 0)" MyFile


Template reacting to any number of any characters

In this case, no matter how many characters is between the words "this" and "test".

The stars can be used with symbol classes:

$ ECHO "ST" | AWK "/ S * T / (Print $ 0)" $ ECHO "SAT" | AWK "/ S * T / (Print $ 0)" $ ECHO "SET" | AWK "/ S * T / (Print $ 0)"


Using stars with symbol classes

In all three examples, the regular expression is triggered, since the stars after class of characters means that if any number of characters "A" or "E" are found, and if they cannot be found, the string will match the specified template.

Regular POSIX ERE Expressions

Templates pOSIX standard Ere, which support some Linux utilities may contain additional characters. As already mentioned, AWK supports this standard, but SED is not.

Here we will look at the most frequently used symbols that will be useful to you when creating your own regular expressions.

▍Shisant sign

The question mark indicates that the preceding symbol can meet in the text once or not to meet at all. This symbol is one of the metasimvols of repetitions. Here are some examples:

$ Echo "Tet" | AWK "/ TES? T / (Print $ 0)" $ ECHO "TEST" | AWK "/ TES? T / (Print $ 0)" $ ECHO "TESST" | AWK "/ TES? T / (Print $ 0)"


Question mark in regular expressions

As can be seen, in the third case, the letter "S" meets twice, so the word "TESST" does not respond regular expression.

The question mark can be used with symbol classes:

$ ECHO "TST" | AWK "/ T? ST / (Print $ 0)" $ ECHO "TEST" | AWK "/ T? ST / (Print $ 0)" $ ECHO "TAST" | AWK "/ T? ST / (Print $ 0)" $ ECHO "TAEST" | AWK "/ T? ST / (Print $ 0)" $ ECHO "Teest" | AWK "/ T? ST / (Print $ 0)"


Question mark and symbol classes

If there are no characters from a class in a row, or one of them occurs once, the regular expression is triggered, but it is necessary in the word to appear two characters and the system no longer finds in the text matching text.

▍Simol "Plus"

The plus symbol in the template indicates that the regular expression will detect the desired if the previous symbol will meet in the text one or more times. At the same time, there will be no such design on the absence of a symbol:

$ Echo "Test" | AWK "/ TE + ST / (Print $ 0)" $ ECHO "TEEST" | AWK "/ TE + ST / (Print $ 0)" $ ECHO "TST" | AWK "/ TE + ST / (Print $ 0)"


The symbol of "plus" in regular expressions

In this example, if there is no "E" symbol in the word, the engine of regular expressions does not find the template in the text matching text. The symbol of "Plus" works and with the classes of symbols - this is similar to the sticker and question mark:

$ ECHO "TST" | AWK "/ T + ST / (Print $ 0)" $ ECHO "TEST" | AWK "/ T + ST / (Print $ 0)" $ ECHO "Teast" | AWK "/ T + ST / (Print $ 0)" $ ECHO "Teeast" | AWK "/ T + ST / (Print $ 0)"


Plus sign and symbol classes

In this case, if there is any character from the class in the line, the text will be detected by the appropriate pattern.

▍Figure brackets

Figure brackets that can be used in ERE templates are similar to the symbols discussed above, but they allow you to more accurately set the necessary number of entries of the symbol previously. You can specify the restriction in two formats:
  • n is a number that specifies the exact number of desired entries
  • n, M - two numbers that are interpreted as follows: "At least n times, but not more than M."
Here are examples of the first option:

$ ECHO "TST" | AWK "/ TE (1) ST / (Print $ 0)" $ ECHO "TEST" | AWK "/ TE (1) ST / (Print $ 0)"

Figured brackets in templates, search for an exact number of occurrences

In the old versions of AWK, it was necessary to use the command line key --re-interval in order for the program to recognize the intervals in regular expressions, but it is not necessary to do this in new versions.

$ ECHO "TST" | AWK "/ TE (1,2) ST / (Print $ 0)" $ ECHO "TEST" | AWK "/ TE (1,2) ST / (Print $ 0)" $ ECHO "Teest" | AWK "/ TE (1,2) ST / (Print $ 0)" $ ECHO "Teeest" | AWK "/ TE (1,2) ST / (Print $ 0)"


Macoba Interval

In this example, the symbol "E" should meet in a string of 1 or 2 times, then the regular expression will respond to the text.

Figured brackets can be used with symbol classes. Here are already familiar to you principles:

$ ECHO "TST" | AWK "/ T (1,2) ST / (Print $ 0)" $ ECHO "TEST" | AWK "/ T (1,2) ST / (Print $ 0)" $ ECHO "TEEST" | AWK "/ T (1,2) ST / (Print $ 0)" $ Echo "Teeast" | AWK "/ T (1,2) ST / (Print $ 0)"


Big brackets and symbol classes

The template will respond to the text if it will meet the "A" symbol or the "E" symbol.

▍Mimvivo logical "or"

Symbol | - Vertical trait, means in regular expressions a logical "or". Processing a regular expression containing several fragments separated by such a sign, the engine will consider the analyzed text suitable in the event that it will correspond to any of the fragments. Here is an example:

$ ECHO "This Is a Test" | AWK "/ Test | Exam / (Print $ 0)" $ ECHO "This is an exam" | AWK "/ Test | Exam / (Print $ 0)" $ ECHO "This is Something Else" | AWK "/ Test | Exam / (Print $ 0)"


Logical "or" in regular expressions

In this example, a regular expression is configured to search in the text of the words "Test" or "Exam". Please note that between template fragments and their sharing symbol | There should be no gaps.

Fragments of regular expressions can be grouped using round brackets. If a certain sequence of characters is grouped, it will be perceived by the system as a normal symbol. That is, for example, it will be possible to apply the metacharacters of repetitions. Here's what it looks like:

$ ECHO "LIKE" | AWK "/ LIKE (Geeks)? / (Print $ 0)" $ Echo "Likegeeks" | AWK "/ LIKE (GEEKS)? / (Print $ 0)"


Grouping Fragments of Regular Expressions

In these examples, the word "geeks" is enclosed in round brackets, after this design there is a question mark. Recall that the question mark means "0 or 1 repetition", as a result, the regular expression will react to the "Like" string, and on the Likegeeks string.

Practical examples

After we disassemble the foundations of regular expressions, it's time to do something useful with them.

▍ The number of files

Write a bash script that counts the files in the directories that are recorded in variable environment Path. In order to do this, you will need to start, form a list of ways to directories. Let's do it with SED, replacing the colon on the spaces:

$ Echo $ Path | SED "S /: / / G"
The replacement command supports regular expressions as templates to search for text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use here and something else - it all depends on the specific task.
Now you have to go through the list received in the loop and perform the number of action files necessary for the calculation. The general script scheme will be like this:

MyPath \u003d $ (Echo $ Path | SED "S /: / / G") for directory in $ mypath do one
Now write the full text of the script using the LS command to get information about the number of files in each of the directory:

#! / bin / bash mypath \u003d $ (echo $ path | sed "s /: / / g") count \u003d 0 for directory in $ mypath do check \u003d $ (LS $ Directory) for item in $ check do count \u003d $ [$ COUNT + 1] DONE ECHO "$ Directory - $ Count" Count \u003d 0 Done
When you start the script, it may turn out that some directories from PATH do not exist, however, it does not prevent him from calculating files in existing directors.


Counting files

The main value of this example is that using the same approach, you can solve much more complex tasks. What exactly - depends on your needs.

▍ Check email addresses

There are websites with huge collections of regular expressions that allow you to check addresses email, phone numbers, and so on. However, one thing is to take ready, and quite another - create something yourself. Therefore, write a regular expression to check email addresses. Let's start with the analysis of the source data. Here, for example, a certain address:

[Email Protected]
The username, username, may consist of alphanumeric and some other characters. Namely, it is a dot, dash, a symbol of the adhesion, the plus sign. Behind the username should sign @.

Armed with these knowledge, let's start assembling a regular expression from its left part, which serves to check the username. That's what we did:

^(+)@
This regular expression can be found as follows: "At the beginning of the line, there must be at least one character from those that are in the group specified in square brackets, and after that the @ sign should go.

Now - the name of the host name - HostName. The same rules are applicable here as for the username, so the template for it will look like this:

(+)
domain name top level obeys special rules. There can only be alphabetic characters that should be at least two (for example, such domains usually contain the country code), and no more than five. All this means that the template for checking the last part of the address will be:

\.({2,5})$
You can read it like this: "You must first be a point, then - from 2 to 5 alphabetic characters, and after that the line ends."

Prepare templates for individual parts of a regular expression, we collect them together:

^(+)@(+)\.({2,5})$
Now it remains only to test what happened:

$ echo " [Email Protected]"| awk" /^ (+ )@ (+) \\ .((2,5) $$ /(print $ 0) "$ echo" [Email Protected]"| awk" /^ (+) @(+) \\. ((2,5) $ 0) "/(Print $ 0)"


Check email address using regular expressions

The fact that the transmitted AWK text is displayed on the screen means that the system recognized the email address in it.

RESULTS

If a regular expression for checking email addresses, which was met at the very beginning of the article, it seemed completely incomprehensible, hopefully, now it does not look like a meaningless set of characters. If this is true - it means that this material has fulfilled its destination. In fact, regular expressions are a topic that can be engaged in all life, but even the little thing we disassembled, it is already able to help you with writing scripts that are pretty advanced texts.

In this series of materials, we usually showed very simple examples Bash scripts that consisted literally from several lines. Next time we consider something more large-scale.

Dear readers! Do you use regular expressions when processing texts in command line scripts?

Good time, guests!

In today's article I want to touch such a huge topic as Regular expressions. I think everyone knows that the topic of regquins (so regular expressions are called in slang) - immense in the amount of one post. Therefore I will try briefly, but as you can understand how to collect my thoughts in a bunch and convey them to you.

I will start with the fact that there are several varieties of regular expressions:

1. Traditional regular expressions (they are basic, basic and basic Regular Expressions (BRE))

  • the syntax of these expressions is defined as outdated, but nevertheless is still widespread and used by many UNIX utilities
  • The main regular expressions include the following metasimwols (about their values \u200b\u200bbelow):
    • \\ (\\) - the initial option for () (in advanced)
    • \\ (\\) - the initial option for () (in advanced)
    • \n. where n. - number from 1 to 9
  • Features of using data metasimvols:
    • Star must follow after an expression corresponding to a single symbol. Example: *.
    • Expression \\( block\\) * It should be considered wrong. In some cases, it corresponds to zero or more repetitions block . In others, it corresponds to the string block* .
    • Inside the symbolic class, special characters values \u200b\u200bare mainly ignored. Special cases:
    • To add a symbol ^ into a set, it should be placed there not first.
    • To add a symbol to the set, it should be placed there first or last. For example:
      • dNS-named template where letters, numbers, minus and point-separator can include: [-0-9a-za-z.];
      • any character, besides minus and numbers: [^ -0-9].
    • To add a [or] symbol to the set, it should be placed there first. For example:
      • corresponds to], [, a or b.

2. Extended regular expressions (they are extended Regular Expressions (ERE))

  • The syntax of these expressions is similar to the syntax of the main expressions, except:
    • Canceled the use of the reverse braid line for metasimvols () and ().
    • The reverse sinking line in front of the metacimol is canceled its special meaning.
    • Rejected theoretically irregular design \\ n. .
    • Added metacimol + ,? , | .

3. Regular expressions compatible with Perl(they are Perl-Compatible Regular Expressions (PCRE))

  • have a richer and at the same time predictable syntax than even POSIX ERE, so applications are often used.

Regular expressions consist oftemplates Specify the template Search. Pattern consists of rulessearch, which are compiled from symbolsand metasimvolov.

Search rules Defined as follows operations:

Listing |

Vertical trait (|) Shares the permissible options, one can say - logical or. For example, "Gray | Grey" corresponds gray. or grey..

Grouping or union ()

Round brackets Used to determine the area of \u200b\u200baction and priority of operators. For example, "Gray | Grey" and "GR (A | E) Y" are different samples, but they both describe a set containing gray. and grey..

Quantification ()? * +.

Quantifier after a symbol or group determines how many times previousthe expression may occur.

general expression, repetitions may be from m to n inclusive.

general expression m and more repetitions.

general expression no more than n repetition.

smooth n repetitions.

Question markmeans 0 or 1. times the same as {0,1} . For example, "Colou? R" corresponds to color, I. colour..

Starmeans 0, 1 or any number Once ( {0,} ). For example, "GO * GLE" corresponds ggle, gogle., google and etc.

A plusmeans at least 1. Once ( {1,} ). For example, "GO + GLE" matches gogle., google etc. (but not ggle).

The specific syntax of these regular expressions depends on the implementation. (that is, in basic regular expressions Symbols (and)- shielded backlash)

MetactersBy simply, these are symbols that do not match their real value, that is, a symbol. (point) is not a point, but any one character, etc. Please get acquainted with metasimvols and their values:

. correspond to oneany symbol
[something] Correspond to any onesymbol from the number of prisoners in brackets. At the same time: the "-" symbol is interpreted literally only if it is located directly after opening or in front of the closing bracket: or [-ABC]. Otherwise, it denotes the interval of characters. For example, corresponds to "A", "B" or "C". corresponds to the letters of the Latin Latin alphabet. These designations can be combined: corresponds to A, B, C, Q, R, S, T, U, V, W, X, Y, Z.C. To establish the correspondence of the characters "[" or "]", is enough to the closing bracket It was the first character after opening: corresponds to "]", "[", "a" or "b". If the value in square brackets was presented with the symbol ^, then the value of the expression corresponds to single symbol from among those which are not in brackets. For example, [^ ABC] corresponds to any symbol except "A", "B" or "C". [^ a-z] corresponds to any character except the symbols of the lower register in the Latin alphabet.
^ Corresponds to the beginning of the text (or the beginning of any string if the line is the line).
$ Corresponds to the end of the text (or the end of any string if the line mode).
\\(\\) or () Announces "Named" (grouped expression), which can be used later (see the following element: \\ n.). "Announced Subscription" is also a "block". Unlike other operators, this (in traditional syntax) requires a bexlesh, in an extended and perl symbol \\ is not needed.
\n. Where n. - this is a figure from 1 to 9; correspond to n.-To noted at home (for example (abcd) \\ 0, that is, ABCD characters are noted by zero). This design theoretically irregularShe was not accepted in the advanced syntax of regular expressions.
*
  • Starafter the expression corresponding to the unit symbol corresponds to zeroor more copiesof this (preceding) expression. For example, "*" corresponds to an empty string, "x", "y", "zx", "zyx", etc.
  • \n.*, where n. - This is a number from 1 to 9, corresponds to zero or more entries for conformity. n.- a marked imitation. For example, "\\ (a. \\) C \\ 1 *" corresponds to "ABCAB" and "ABCABA", but not "ABCAC".

The expression concluded in "\\ (" and "\\)" and accompanied "*" should be considered incorrect. In some cases, it corresponds to zero or more occurrences of the string that was enclosed in brackets. In others, it corresponds to the expression enclosed in the bracket, given the symbol "*".

\{x.,y.\} Corresponds to the last ( upcoming) a block occurring at least x. and no more y. time. For example, "A \\ (3.5 \\)" corresponds to "AAA", "AAAA" or "AAAAA". Unlike other operators, this (in traditional syntax) requires a bexlesh.
.* Designation of any number of any characters between two parts of the regular expression.

Metasimwalls We help to use different conformity. But how to imagine the metacimum by the usual symbol, that is, the symbol [(square bracket) the value of the square bracket? Simply:

  • need to prevent ( shield) Metacimol (. * + \\? ()) backlash. For example \\. or \\[

To simplify the task of some characters sets, they were combined into the so-called. classes and categories of characters. POSIX standardized declaration of some classes and categories of characters, as shown in the following table:

POSIX Class similarly designation
[: Upper:] symbols of upper register
[: Lower:] symbols of the lower register
[: Alpha:] symbols of the upper and lower register
[: Alnum:] numbers, upper and lower register symbols
[: Digit:] numbers
[: xdigit:] hexadecimal numbers
[: Punct:] [.,!?:…] signs of punctuation
[: Blank:] [\\ t] space and Tab.
[: Space:] [\\ t \\ n \\ r \\ f \\ v] symbols of pass
[: CNTRL:] control symbols
[: Graph:] [^ \\ t \\ n \\ r \\ f \\ v] symbols of print
[: Print:] [^ \\ t \\ n \\ r \\ f \\ v] print symbols and skip symbols

Regex has such a thing as:

Regex greed

I will try to describe as clear as possible. Suppose we want to find everything HTML Tags In some text. Localled the task, we want to find the values \u200b\u200bof the prisoners between< и >, together with these heels. But we know that tags have different length And the tags themselves, at least 50 pieces. List them all, concluding in the metacharative - the task is too time-consuming. But we know that we have an expression. * (Point asterisk), characterizing any number of any characters in the string. With this expression, we will try to find in the text (

So, How to create a 10/50 RAID on the LSI Megaraid controller (relevant and for: Intel SRCU42X, Intel SRCS16):

) all values \u200b\u200bbetween< и >. As a result, the entire string will correspond to this expression. why, because Remex - Zhaden and tries to capture any all number of characters between< и >, respectively, the whole line starting < p\u003e So, ...and finishing ...> will belong to this rule!

I hope for example, it is clear what greed is. To get rid of this greed, you can go on the next way:

  • take into account the characters not Relevant desired sample (for example:<[^>] *\u003e For the case described above)
  • reliable from greed by adding a quantifier definition as undesirable:
    • *? - "Not greedy" ("Lazy") Equivalent *
    • +? - "not greedy" ("lazy") equivalent +
    • (n,)? - "not greedy" ("lazy") equivalent (n,)
    • . *? - "not greedy" ("lazy") equivalent. *

All the above want to supplement the syntax of extended regular expressions:

Regular expressions in POSIX are similar to the traditional UNIX syntax, but with the addition of some metasimvols:

A plusindicates that previoussymbol or groupmay be repeated one or several times. Unlike the stars, at least one repetition is required.

Question mark Does previousthe symbol or group is optional. In other words, in the appropriate line it may be absent or present smooth onetime.

Vertical traitshares alternative options Regular expressions. One character sets two alternatives, but there may be more of them, it is enough to use more vertical screaks. It must be remembered that this operator uses the maximum possible part of the expression. For this reason, the alternative operator is most often used inside the brackets.

It was also canceled the use of the reverse braid [... \\) becomes (...) and \\ (... \\) becomes (...).

At the end of the post, I will give some examples of using Regex:

$ Cat Text1 1 Apple 2 PEAR 3 Banana $ Grep P Text1 1 Apple 2 Pear $ Grep "PP *" Text1 1 Apple 2 Pear $ Cat Text1 | Grep "L \\ | N" 1 Apple 3 Banana $ Echo -e "Find An \\ N * Here" | Grep "\\ *" * HERE $ GREP "PL \\? * R" Text1 # P, in lines where there is R 2 Pear $ Grep "a .." Text1 # Rows with A, followed by at least 2 symbols 1 Apple 3 Banana $ Grep "" Text1 # Search for lines where there are 3 or P 1 Apple 2 PEAR 3 BANANA $ ECHO -E "Find An \\ N * Here \\ Nsomewhere." | Grep "[. *]" * here Somewhere..name] $ echo -e "123 \\ N456 \\ N789 \\ N0" | grep "" 123 456 789 $ sed -e "/ (a.**) \\| \\ (-p.*/a/s/a/a/g" Text1 # replacement and on and in all lines where After and it goes, or after p, P 1 Apple 2 PEAR 3 BANANA * \\ ./ Last Word./g "First. A Last Word. This Is a Last Word.

Sincerely, MC.SIM!

In today's article I want to touch such a huge topic as Regular expressions. I think everyone knows that the topic of regquins (so regular expressions are called in slang) - immense in the amount of one post.

I will start with the fact that there are several varieties of regular expressions:

1. Traditional regular expressions (they are basic, basic and basic Regular Expressions (BRE))

  • the syntax of these expressions is defined as outdated, but nevertheless is still widespread and used by many UNIX utilities
  • The main regular expressions include the following metasimwols (about their values \u200b\u200bbelow):
    • \\ (\\) - the initial option for () (in advanced)
    • \\ (\\) - the initial option for () (in advanced)
    • \n. where n. - number from 1 to 9
  • Features of using data metasimvols:
    • Star must follow after an expression corresponding to a single symbol. Example: *.
    • Expression \\( block\\) * It should be considered wrong. In some cases, it corresponds to zero or more repetitions block . In others, it corresponds to the string block* .
    • Inside the symbolic class, special characters values \u200b\u200bare mainly ignored. Special cases:
    • To add a symbol ^ into a set, it should be placed there not first.
    • To add a symbol to the set, it should be placed there first or last. For example:
      • dNS-named template where letters, numbers, minus and point-separator can include: [-0-9a-za-z.];
      • any character, besides minus and numbers: [^ -0-9].
    • To add a [or] symbol to the set, it should be placed there first. For example:
      • corresponds to], [, a or b.

2. Extended regular expressions (they are extended Regular Expressions (ERE))

  • The syntax of these expressions is similar to the syntax of the main expressions, except:
    • Canceled the use of the reverse braid line for metasimvols () and ().
    • The reverse sinking line in front of the metacimol is canceled its special meaning.
    • Rejected theoretically irregular design \\ n. .
    • Added metacimol + ,? , | .

3. Regular expressions compatible with Perl(they are Perl-Compatible Regular Expressions (PCRE))

  • have a richer and at the same time predictable syntax than even POSIX ERE, so applications are often used.

Regular expressions consist oftemplates Specify the template Search. Pattern consists of rulessearch, which are compiled from symbolsand metasimvolov.

Search rules Defined as follows operations:

Listing |

Vertical trait (|) Shares the permissible options, one can say - logical or. For example, "Gray | Grey" corresponds gray. or grey..

Grouping or union ()

Round brackets Used to determine the area of \u200b\u200baction and priority of operators. For example, "Gray | Grey" and "GR (A | E) Y" are different samples, but they both describe a set containing gray. and grey..

Quantification ()? * +.

Quantifier after a symbol or group determines how many times previousthe expression may occur.

general expression, repetitions may be from m to n inclusive.

general expression m and more repetitions.

general expression no more than n repetition.

smooth n repetitions.

Question markmeans 0 or 1. times the same as {0,1} . For example, "Colou? R" corresponds to color, I. colour..

Starmeans 0, 1 or any number Once ( {0,} ). For example, "GO * GLE" corresponds ggle, gogle., google and etc.

A plusmeans at least 1. Once ( {1,} ). For example, "GO + GLE" matches gogle., google etc. (but not ggle).

The specific syntax of these regular expressions depends on the implementation. (that is, in basic regular expressions Symbols (and)- shielded backlash)

MetactersBy simply, these are symbols that do not match their real value, that is, a symbol. (point) is not a point, but any one character, etc. Please get acquainted with metasimvols and their values:

. correspond to oneany symbol
[something] Correspond to any onesymbol from the number of prisoners in brackets. At the same time: the "-" symbol is interpreted literally only if it is located directly after opening or in front of the closing bracket: or [-ABC]. Otherwise, it denotes the interval of characters. For example, corresponds to "A", "B" or "C". corresponds to the letters of the Latin Latin alphabet. These designations can be combined: corresponds to A, B, C, Q, R, S, T, U, V, W, X, Y, Z.C. To establish the correspondence of the characters "[" or "]", is enough to the closing bracket It was the first character after opening: corresponds to "]", "[", "a" or "b". If the value in square brackets was presented with the symbol ^, then the value of the expression corresponds to single symbol from among those which are not in brackets. For example, [^ ABC] corresponds to any symbol except "A", "B" or "C". [^ a-z] corresponds to any character except the symbols of the lower register in the Latin alphabet.
^ Corresponds to the beginning of the text (or the beginning of any string if the line is the line).
$ Corresponds to the end of the text (or the end of any string if the line mode).
\\(\\) or () Announces "Named" (grouped expression), which can be used later (see the following element: \\ n.). "Announced Subscription" is also a "block". Unlike other operators, this (in traditional syntax) requires a bexlesh, in an extended and perl symbol \\ is not needed.
\n. Where n. - this is a figure from 1 to 9; correspond to n.-To noted at home (for example (abcd) \\ 0, that is, ABCD characters are noted by zero). This design theoretically irregularShe was not accepted in the advanced syntax of regular expressions.
*
  • Starafter the expression corresponding to the unit symbol corresponds to zeroor more copiesof this (preceding) expression. For example, "*" corresponds to an empty string, "x", "y", "zx", "zyx", etc.
  • \n.*, where n. - This is a number from 1 to 9, corresponds to zero or more entries for conformity. n.- a marked imitation. For example, "\\ (a. \\) C \\ 1 *" corresponds to "ABCAB" and "ABCABA", but not "ABCAC".

The expression concluded in "\\ (" and "\\)" and accompanied "*" should be considered incorrect. In some cases, it corresponds to zero or more occurrences of the string that was enclosed in brackets. In others, it corresponds to the expression enclosed in the bracket, given the symbol "*".

\{x.,y.\} Corresponds to the last ( upcoming) a block occurring at least x. and no more y. time. For example, "A \\ (3.5 \\)" corresponds to "AAA", "AAAA" or "AAAAA". Unlike other operators, this (in traditional syntax) requires a bexlesh.
.* Designation of any number of any characters between two parts of the regular expression.

Metasimwalls We help to use different conformity. But how to imagine the metacimum by the usual symbol, that is, the symbol [(square bracket) the value of the square bracket? Simply:

  • need to prevent ( shield) Metacimol (. * + \\? ()) backlash. For example \\. or \\[

To simplify the task of some characters sets, they were combined into the so-called. classes and categories of characters. POSIX standardized declaration of some classes and categories of characters, as shown in the following table:

POSIX Class similarly designation
[: Upper:] symbols of upper register
[: Lower:] symbols of the lower register
[: Alpha:] symbols of the upper and lower register
[: Alnum:] numbers, upper and lower register symbols
[: Digit:] numbers
[: xdigit:] hexadecimal numbers
[: Punct:] [.,!?:…] signs of punctuation
[: Blank:] [\\ t] space and Tab.
[: Space:] [\\ t \\ n \\ r \\ f \\ v] symbols of pass
[: CNTRL:] control symbols
[: Graph:] [^ \\ t \\ n \\ r \\ f \\ v] symbols of print
[: Print:] [^ \\ t \\ n \\ r \\ f \\ v] print symbols and skip symbols

Regex has such a thing as:

Regex greed

I will try to describe as clear as possible. Suppose we want to find all HTML tags in some text. Localled the task, we want to find the values \u200b\u200bof the prisoners between< и >, together with these heels. But we know that tags have a different length and tags themselves, at least 50 pieces. List them all, concluding in the metachamivol - the task is too time-consuming. But we know that we have an expression. * (Point asterisk), characterizing any number of any characters in the string. With this expression, we will try to find in the text (

So, How to create a 10/50 RAID on the LSI Megaraid controller (relevant and for: Intel SRCU42X, Intel SRCS16):

) all values \u200b\u200bbetween< и >. As a result, the entire string will correspond to this expression. why, because Remex - Zhaden and tries to capture any all number of characters between< и >, respectively, the whole line starting < p\u003e So, ...and finishing ...> will belong to this rule!

I hope for example, it is clear what greed is. To get rid of this greed, you can go on the next way:

  • take into account the characters not Relevant desired sample (for example:<[^>] *\u003e For the case described above)
  • reliable from greed by adding a quantifier definition as undesirable:
    • *? - "Not greedy" ("Lazy") Equivalent *
    • +? - "not greedy" ("lazy") equivalent +
    • (n,)? - "not greedy" ("lazy") equivalent (n,)
    • . *? - "not greedy" ("lazy") equivalent. *

All the above want to supplement the syntax of extended regular expressions:

Regular expressions in POSIX are similar to the traditional UNIX syntax, but with the addition of some metasimvols:

A plusindicates that previoussymbol or groupmay be repeated one or several times. Unlike the stars, at least one repetition is required.

Question mark Does previousthe symbol or group is optional. In other words, in the appropriate line it may be absent or present smooth onetime.

Vertical traitshares alternative options for regular expressions. One character sets two alternatives, but there may be more of them, it is enough to use more vertical screaks. It must be remembered that this operator uses the maximum possible part of the expression. For this reason, the alternative operator is most often used inside the brackets.

It was also canceled the use of the reverse braid [... \\) becomes (...) and \\ (... \\) becomes (...).

At the end of the post, I will give some examples of using Regex:

$ Cat Text1 1 Apple 2 Pear 3 Banana $ Grep P Text1 1 Apple 2 Pear $ Grep PEA Text1 2 Pear $ Grep "P *" Text1 1 Apple 2 PEAR 3 BANANA $ Grep "PP *" Text1 1 Apple 2 Pear $ Grep " X "Text1 $ Grep" X * "Text1 1 Apple 2 PEAR 3 BANANA $ Cat Text1 | Grep "L \\ | N" 1 Apple 3 Banana $ Echo -e "Find An \\ N * Here" | Grep "\\ *" * HERE $ GREP "PP \\ +" Text1 # strings, with a content of one P and 1 or more p 1 Apple $ Grep "PL \\? E" Text1 1 Apple 2 Pear $ Grep "PL \\? E" Text1 # PE with a possible symbol L 1 Apple 2 Pear $ Grep "p. * R" Text1 # P, in lines where there are R 2 Pear $ Grep "a .." Text1 # Rows with A, followed by at least 2 characters 1 Apple 3 Banana $ Grep "\\ (an \\) \\ +" Text1 # Search for more Repeat An 3 Banana $ Grep "An \\ (an \\) \\ +" Text1 # Search for 2x Repeats An 3 Banana $ Grep "" Text1 # Search Rows, where there are 3 or P 1 Apple 2 PEAR 3 BANANA $ ECHO -E "Find An \\ N * Here \\ NSomewhere." | Grep "[. *]" * Herewhere. $ # Looking for symbols from 3 to 7 $ ECHO -E "123 \\ N456 \\ N789 \\ N0" | Grep "" "123 456 789 $ # Looking for a digit, behind which there is no letters n and r $ grep to the end of the line" [[: digit:]] [^ nr] * $ "Text1 1 Apple $ Sed -e" / \\ (a . * a \\) \\ | \\ (p. * p \\) / s / a / a / g "Text1 # Replacement and on a in all lines, where after and or after r goes P 1 Apple 2 PEAR 3 BANANA $ sed -e "/ ^ [^ lmnxyz] * $ / s / EAR / Each / G" Text1 # Replacing EAR on Each in lines not starting on LMNXYZ 1 Apple 2 Peach 3 Banana $ Echo "First. A Phrase. This is a sensence. " | \\ # Replacing the last word in a sentence at Last World. \u003e Sed -e "S / [^] * \\ ./ Last Word./G" First. A Last Word. This Is a Last Word.

Regular expressions are a very powerful tool to search for text on the template, processing and lines, which can be used to solve a variety of tasks. Here are the main of them:

  • Checking text input;
  • Search and replace text in the file;
  • Packet renaming files;
  • Interaction with services such as Apache;
  • Checking the string to match the template.

It is far from full listRegular expressions allow you to do much more. But for new users, they may seem too complex, since a special language is used to form them. But given the opportunities provided, the regular expressions of Linux should know and be able to use each system Administrator.

In this article, we will consider regular Bash expressions for beginners so that you can deal with all the capabilities of this tool.

In regular expressions, two types of characters can be used:

  • ordinary letters;
  • metacimol.

Conventional characters are letters, numbers and punctuation marks from which any lines consist. All texts consist of letters and you can use them in regular expressions to search for the desired position in the text.

Metasimwalls are something else, it is they who give strength to regular expressions. With the help of metasimvol, you can make much more than searching for one symbol. You can search for combinations of characters, use the dynamic number of their number and choose the ranges. All special mixers can be divided into two types, these are replacement symbols that are replaced by conventional characters, or statements that indicate how many times the symbol can repeat. The syntax of the regular expression will look like:

normal_Simviv special mixer_Productor

special simal_zames special mixer_Productor

  • - With a reverse braid, the alphabetic specialists begin, and it is used if it is necessary to use a special monitor in the form of a punctuation mark;
  • ^ - indicates the beginning of the line;
  • $ - indicates the end of the line;
  • * - indicates that the previous symbol can be repeated 0 or more;
  • + - Indicates that the previous symbol should repeat more than one or more times;
  • ? - the previous symbol may occur zero or once;
  • (n) - indicates how many times you need to repeat the previous symbol;
  • (N, N) - the previous symbol can be repeated from n to n times;
  • . - any character besides the translation of the string;
  • - any character specified in brackets;
  • x | W. - symbol X or Y symbol;
  • [^ AZ] - any character, except those indicated in brackets;
  • - any character from the specified range;
  • [^ a-z] - any character that is not in the range;
  • b. - denotes the word border with a space;
  • B. - indicates that the symbol must be inside the word, for example, the UX coincides with UXB or Tuxedo, but does not coincide with Linux;
  • d. - means that the character is a digit;
  • D. - non-cyfactory symbol;
  • n. - Line translation symbol;
  • s. - one of the symbols of the space, space, tabulation, and so on;
  • S. - any character besides a space;
  • t. - tabl symbol;
  • v. - symbol of vertical tabulation;
  • w. - any letter symbol, including underscore;
  • W. - any letter symbol except underline;
  • uXXX - Unicdoe symbol.

It is important to note that before the iconic specials, you need to use oblique trait to indicate that the specialist is next. That's right and the opposite, if you want to use a specialist, which is applied without a slash as a conventional symbol, you will have to add a slant.

For example, you want to find a line 1+ 2 \u003d 3 in the text. If you use this line as a regular expression, you will not find anything, because the system interprets a plus as a specialist, which reports that the previous unit must repeat one or more times. Therefore, it needs to be shielded: 1 + 2 \u003d 3. Without shielding, our regular expression would correspond to only a string 11 \u003d 3 or 111 \u003d 3 and so on. Before the line is not necessary, because it is not a specialist.

Examples of using regular expressions

Now that we have considered the basics and you know how everything works, it remains to consolidate the knowledge gained about regular Linux Grep expressions in practice. Two very useful special symbols are ^ and $, which indicate the beginning and end of the string. For example, we want to get all users registered in our system whose name begins on s. Then you can apply a regular expression «^ S». You can use the egrep command:

egrep "^ s" / etc / passwd

If we want to select strings according to the last character in the line, which can be used $ for this. For example, choose all system users, without a shell, records about such users end on False:

eGREP "FALSE $" / ETC / PASSWD

To display user names that start on S or D, use such an expression:

egrep "^" / etc / passwd

The same result can be obtained by using the "|" symbol. The first option is more suitable for the ranges, and the second is often used for normal or / or:

egrep "^" / etc / passwd

Now let's choose all users whose name is not three characters. The username is completed with a colon. We can say that it may contain any letter symbol that must be repeated three times, before the colon:

egrep "^ W (3):" / etc / passwd

findings

In this article, we reviewed regular Linux expressions, but these were only the most basics. If you smoke a little deeper, you will find that with this tool you can do much more interesting things. The time spent on the development of regular expressions will definitely cost that.

At the end of the lecture from Yandex Pro Regular Expressions: