Detailed explanation of Nginx regular expressions

Detailed explanation of Nginx regular expressions

Nginx (engine x) is a high-performance HTTP and reverse proxy server, as well as an IMAP/POP3/SMTP server. Nginx was developed by Igor Sysoev for the second most visited site in Russia, Rambler.ru (Russian: Рамблер). The first public version 0.1.0 was released on October 4, 2004.

It releases its source code under a BSD-like license and is known for its stability, rich feature set, sample configuration files, and low system resource consumption. On June 1, 2011, nginx 1.0.4 was released.

Nginx is a lightweight web server/reverse proxy server and email (IMAP/POP3) proxy server released under a BSD-like protocol. Its characteristics are that it occupies less memory and has strong concurrency capabilities. In fact, nginx's concurrency capabilities are indeed better than those of the same type of web servers. Users of nginx websites in mainland China include: Baidu, JD.com, Sina, NetEase, Tencent, Taobao, etc.

Today we are going to talk about the usage rules of his regular expressions. I will briefly give a few examples and then explain them.

What is a regular expression

Regular expressions, also known as regular expressions. Regular Expression (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept in computer science. Regular expressions are often used to retrieve and replace text that matches a certain pattern (rule).

Many programming languages ​​support string manipulation using regular expressions. For example, Perl has a powerful regular expression engine built into it. The concept of regular expressions was originally popularized by Unix tools such as sed and grep. Regular expression is often abbreviated as "regex", the singular is regexp, regex, and the plural is regexps, regexes, and regexen.

Regular expressions consist of some common characters and some metacharacters. Ordinary characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings, which we will explain below.

In the simplest case, a regular expression looks like an ordinary search string. For example, the regular expression "testing" does not contain any metacharacters. It can match strings such as "testing" and "testing123", but cannot match "Testing".

To really use regular expressions well, correctly understanding metacharacters is the most important thing. The following table lists all metacharacters and a brief description of them.

Metacharacters

describe

\

The next character marker, or a backreference, or an octal escape character. For example, "\\n" matches \n. "\n" matches a newline character. The sequence "\\" matches "\" and "\(" matches "(". This is equivalent to the concept of "escape character" in many programming languages.

^

Matches the beginning of an input line. If the Multiline property of the RegExp object is set, ^ also matches the position after "\n" or "\r".

$

Matches the end of an input line. If the Multiline property of the RegExp object is set, $ also matches the position before "\n" or "\r".

*

Matches the preceding subexpression any number of times. For example, zo* matches "z", but also "zo" and "zoo". *Equivalent to {0,}.

+

Matches the preceding subexpression one or more times (greater than or equal to 1 times). For example, "zo+" matches "zo" and "zoo", but not "z". + is equivalent to {1,}.

?

Matches the preceding subexpression zero or one time. For example, "do(es)?" matches either "do" or "does". ?Equivalent to {0,1}.

{n}

n is a non-negative integer. Matches a certain number of times. For example, "o{2}" does not match the "o" in "Bob", but does match the two o's in "food".

{n,}

n is a non-negative integer. Match at least n times. For example, "o{2,}" does not match the "o" in "Bob", but it matches all o's in "foooood". “o{1,}” is equivalent to “o+”. “o{0,}” is equivalent to “o*”.

{n,m}

Both m and n are non-negative integers, where n<=m. Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood" as a group and the last three o's as a group. “o{0,1}” is equivalent to “o?”. Please note that there cannot be a space between the comma and the two numbers.

?

When this character immediately follows any of the other qualifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. The non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. For example, for the string "oooo", "o+" will match as many "o"s as possible, resulting in ["oooo"], while "o+?" will match as few "o"s as possible, resulting in ['o', 'o', 'o', 'o']

.point

Matches any single character except "\n" and "\r". To match any character including "\n" and "\r", use a pattern like "[\s\S]".

(pattern)

Matches pattern and retrieves the match. The matches obtained can be obtained from the generated Matches collection, using the SubMatches collection in VBScript and the $0...$9 properties in JScript. To match a parenthesis character, use "\(" or "\)".

(?:pattern)

Non-acquisition matching, matching pattern but not acquiring matching results, not storing for later use. This is useful when using the or character "(|)" to combine parts of a pattern. For example, "industr(?:y|ies)" is a shorter expression than "industry|industries".

(?=pattern)

Non-acquisition matching, positive positive lookahead, matches the search string at the beginning of any string matching pattern, and the match does not need to be acquired for later use. For example, "Windows(?=95|98|NT|2000)" can match "Windows" in "Windows2000", but cannot match "Windows" in "Windows3.1". Lookahead does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the characters containing the lookahead.

(?!pattern)

Non-capturing matching, positive negative lookahead, matches the search string at the beginning of any string that does not match the pattern. This match does not need to be captured for later use. For example, "Windows(?!95|98|NT|2000)" can match "Windows" in "Windows3.1", but cannot match "Windows" in "Windows2000".

(?<=pattern)

Non-acquisition matching, reverse positive preview, is similar to forward positive preview, but in the opposite direction. For example, "(?<=95|98|NT|2000)Windows" matches "Windows" in "2000Windows" but not "Windows" in "3.1Windows".

"(?<=95|98|NT|2000)Windows" will currently report an error in the re module test in python3.6. The lengths of the strings connected with "|" must be the same. Here, the length of "95|98|NT" is 2, and the length of "2000" is 4, which will result in an error.

(?<!patte_n)

Non-acquisition matching, reverse negative lookahead, is similar to positive negative lookahead, but in the opposite direction. For example, “(?<!95|98|NT|2000)Windows” can match “Windows” in “3.1Windows”, but cannot match “Windows” in “2000Windows”. This place is incorrect, there is a problem

Any of the or items used here cannot exceed 2 digits, such as "(?<!95|98|NT|20)Windows is correct, "(?<!95|980|NT|20)Windows is an error. If used alone, there is no restriction, such as (?<!2000)Windows is a correct match.

Same as above, here in Python 3.6 the string length in the re module must be consistent, not necessarily 2, for example “(?<!1995|1998|NTNT|2000)Windows” is also acceptable.

x|y

Matches x or y. For example, "z|food" matches "z" or "food" (be careful here). "[zf]ood" matches "zood" or "food".

[xyz]

A collection of characters. Matches any one of the contained characters. For example, "[abc]" can match the "a" in "plain".

[^xyz]

A set of negative characters. Matches any character not contained in the string. For example, "[^abc]" can match any character in "plain".

[az]

Character range. Matches any character in the specified range. For example, "[az]" matches any lowercase alphabetic character in the range "a" to "z".

Note: A hyphen can only represent a range of characters when it is inside a character group and appears between two characters; if it appears at the beginning of a character group, it can only represent the hyphen itself.

[^az]

Negative character range. Matches any character not in the specified range. For example, "[^az]" matches any character that is not in the range "a" to "z".

\b

Matches the boundary of a word, that is, the position between the word and the space (that is, there are two concepts of "matching" in regular expressions, one is matching characters, and the other is matching positions. The \b here matches the position). For example, "er\b" can match "er" in "never" but not in "verb"; "\b1_" can match "1_" in "1_23" but not in "21_3".

\B

Matches a non-word boundary. "er\B" can match the "er" in "verb", but cannot match the "er" in "never".

\cx

Matches the control character indicated by x. For example, \cM matches a Control-M or a carriage return. The value of x must be one of AZ or az. Otherwise, c is treated as a literal "c" character.

\d

Matches a digit character. Equivalent to [0-9]. grep needs to add -P, perl regular expression supports

\D

Matches a non-digit character. Equivalent to [^0-9]. Grep needs to add -P, perl regular expression supports

\f

Matches a form feed character. Equivalent to \x0c and \cL.

\n

Matches a newline character. Equivalent to \x0a and \cJ.

\r

Matches a carriage return character. Equivalent to \x0d and \cM.

\s

Matches any invisible character, including space, tab, form feed, etc. Equivalent to [ \f\n\r\t\v].

\S

Matches any visible character. Equivalent to [^ \f\n\r\t\v].

\t

Matches a tab character. Equivalent to \x09 and \cI.

\v

Matches a vertical tab character. Equivalent to \x0b and \cK.

\w

Matches any word character including underscore. Similar to but not equivalent to "[A-Za-z0-9_]", where "word" characters use the Unicode character set.

\W

Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".

\xn

Matches n, where n is a hexadecimal escape value. Hexadecimal escape values ​​must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04&1". ASCII encoding can be used in regular expressions.

\num

Matches num, where num is a positive integer. A reference to the retrieved match. For example, "(.)\1" matches two consecutive identical characters.

\n

Identifies an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, then n is a backreference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.

\nm

Identifies an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, then nm is a backreference. If \nm is preceded by at least n gets, then n is a backreference followed by the literal m. If none of the previous conditions are met, and if both n and m are octal digits (0-7), then \nm will match the octal escape value nm.

\nml

If n is an octal digit (0-7), and both m and l are octal digits (0-7), then matches the octal escape value nml.

\un

Matches n, where n is a Unicode character represented by four hexadecimal digits. For example, \u00A9 matches the copyright symbol (&copy;).

\p{P}

Lowercase p stands for property, indicating Unicode property, and is used as a prefix for Unicode regular expressions. The "P" in the brackets indicates one of the seven character attributes of the Unicode character set: punctuation characters.

The other six properties are:

L: letter;

M: Marking symbol (usually does not appear alone);

Z: separator (such as space, line break, etc.);

S: symbol (such as mathematical symbols, currency symbols, etc.);

N: Numbers (such as Arabic numerals, Roman numerals, etc.);

C: Other characters.

*Note: This syntax is not supported by some languages, e.g. javascript.

\<

\>

Matches the beginning (\<) and end (\>) of a word. For example, the regular expression \<the\> can match "the" in the string "for the wise", but cannot match "the" in the string "otherwise". Note: This metacharacter is not supported by all software.
( ) Define the expression between ( and ) as a "group" and save the characters matching this expression into a temporary area (up to 9 characters can be saved in a regular expression), which can be referenced using symbols from \1 to \9.
| Perform a logical "OR" operation on the two matching conditions. For example, the regular expression (him|her) matches "it belongs to him" and "it belongs to her", but not "it belongs to them.". Note: This metacharacter is not supported by all software.

Example 1: Automatic redirection of Wap-side access to PC-side domain name

The requirement of this case is that if I use a mobile phone to access the domain name www.baidu.com, it will automatically rewrite it to m.baidu.com. When I visit the domain name www.souhu.com, it is rewritten to the domain name m.souhu.com.

if ( $server_name ~ ((|www.|)([if ( $ser|
 #Filter the main domain name if ( $server_name ~ ((www.|)([\S\s]*)) ) {
  set $domain $3;
 }

 #Set the initial value set $temp 0;

 #Judge whether it is a payment domain name if ( $host ~* (pay|zf) ) {
  set $temp "${temp}1";
 }

 #Judge whether it is a mobile phone if ($http_user_agent ~* (mobile|nokia|iphone|ipad|android|samsung|htc|blackberry)) {
  set $temp "${temp}2";
 }

 #Judge whether to jump if ( $temp = "02" ) {
  rewrite ^(.*) https://app.$domain permanent;
 }

Script logic analysis:

First of all, we need to get the main domain name, so we have to use regular expressions to match it. If we take the domain name www.baidu.com as an example, the first thing we see is the www. field, but there is also a situation where users may directly enter baidu.com to access it, so we use (www.|) to match it here, and then match this field. The $3 below means taking the value in the third bracket, and finally assigning it to the variable $a. The next step is to use the built-in variable $http_user_agent to determine how the user accesses the site, and then perform a redirect operation.

Example 2: Nginx IP whitelist

The requirement of this case is that our backend access only allows specific IP addresses to access. If other IP addresses access it, we will jump to another error page or directly jump back to the home page.

#define initial value set $my_ip 0;

#Judge whether it is the specified whitelist if ( $http_x_forwarded_for ~* "10.0.0.1|172.16.0.1" ){
 set $my_ip 1;
}

# Redirect IP that is not in the whitelist if ( $my_ip = 0 ) {
 rewrite ^/$ /40x.html;
}

Script logic analysis:

This is actually the same as the above judgment of whether the user is accessing from a computer or a mobile phone, but the only difference is the built-in variables. In the built-in variables in Nginx, $http_x_forwarded_for is the real IP address accessed by the customer, so we just use this built-in variable for judgment and add an initial value at the same time;

Example 3: Rewrite the URL address and hide the submitted content

The requirement of this case is that after we submit some form content, the URL address will be displayed except for some parameters, such as http://baidu.com/index.php?user=admin&pass=123, and we need to rewrite the URL to http://baidu.com/index

rewrite ^/(\w+)/(\w+)/z(\d+) /$1/$2/$3/$arg_x/$arg_y? permanent;
rewrite ^/(\w+)/(\w+)/(\d+)/(\d+)/(\d+) /$1/$2/$3/$4_$5.png permanent;

Script logic analysis:

First, let's think about the evolution of the URL, http://baidu.com/index.php?user=admin&pass=123 => http://baidu.com/index.php/user/admin/pass/123 => http://baidu.com/index, then we proceed step by step according to the evolution. The nginx rewrite regular match will not match the parameter after the question mark, so you need to use $arg_{parameter name} to retain the parameter, and the matching rule must end with a question mark; finally, match some other items to replace and the rewrite is completed.

Summarize

Regular expressions are not difficult. There are only a few commonly used matching metacharacters. You can also say that regular expressions are a socket game, but this game is very widely used.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • Nginx if statement plus regular expression to achieve string truncation
  • Detailed explanation of nginx location configuration regular expression example
  • Nginx pseudo-static Rewrite regular resource summary
  • Nginx rewrite regular matching rewriting method example
  • nginx configuration location summary location regular writing and rewrite rule writing
  • Python regular analysis of nginx access log
  • How to use regular expressions to automatically match wildcard domain names in nginx
  • How to use nginx to intercept specified URL requests through regular expressions
  • Introduction to Nginx regular expression related parameters and rules

<<:  Summary of the operation records of changing MyISAM storage engine to Innodb in MySQL

>>:  Several ways to implement inheritance in JavaScript

Recommend

CSS to achieve Cyberpunk 2077 style visual effects in a few steps

background Before starting the article, let’s bri...

Docker batch start and close all containers

In Docker Start all container commands docker sta...

Review of the best web design works in 2012 [Part 1]

At the beginning of the new year, I would like to...

How to install Apache service in Linux operating system

Download link: Operating Environment CentOS 7.6 i...

HTML basic summary recommendation (title)

HTML: Title Heading is defined by tags such as &l...

How to import Excel files into MySQL database

This article shares with you how to import Excel ...

CSS implements Google Material Design text input box style (recommended)

Hello everyone, today I want to share with you ho...

Solution to Docker pull timeout

Recently, Docker image pull is very unstable. It ...

MySQL 5.6 compressed package installation method

There are two installation methods for MySQL: msi...

MySQL 5.7.27 winx64 installation and configuration method graphic tutorial

This article shares the installation and configur...

How to deploy a simple c/c++ program using docker

1. First, create a hello-world.cpp file The progr...