How to block and prohibit web crawlers in Nginx server

Every website usually encounters many non-search engine crawlers. Most of these crawlers are used for content collection or are written by beginners. Unlike search engine crawlers, they have no frequency control and often consume a lot of server resources, resulting in a waste of bandwidth.

In fact, Nginx can easily filter requests based on User-Agent. We only need to use a simple regular expression at the required URL entry position to filter out crawler requests that do not meet the requirements:

location / {
  if ($http_user_agent ~* "python|curl|java|wget|httpclient|okhttp") {
    return 503;
  }
  # Other normal configuration...
}

Note: The variable $http_user_agent is an Nginx variable that can be referenced directly in the location. ~* indicates a case-insensitive regular match, which can filter out 80% of Python crawlers through Python.

Blocking web crawlers in Nginx

server { 
    listen 80; 
    server_name www.xxx.com; 
    #charset koi8-r; 
    #access_log logs/host.access.log main; 
    #location / { 
    #root html; 
    # index index.html index.htm; 
    #} 
  if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") { 
        return 403; 
    } 
  location ~ ^/(.*)$ { 
        proxy_pass http://localhost:8080; 
    proxy_redirect off; 
    proxy_set_header Host $host; 
    proxy_set_header X-Real-IP $remote_addr; 
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 
    client_max_body_size 10m; 
    client_body_buffer_size 128k; 
    proxy_connect_timeout 90; 
    proxy_send_timeout 90; 
    proxy_read_timeout 90; 
    proxy_buffer_size 4k; 
    proxy_buffers 4 32k; 
    proxy_busy_buffers_size 64k; 
    proxy_temp_file_write_size 64k; 
  } 
    #error_page 404 /404.html; 
    # redirect server error pages to the static page /50x.html 
    # 
    error_page 500 502 503 504 /50x.html; 
    location = /50x.html { 
      root html; 
    } 
    # proxy the PHP scripts to Apache listening on 127.0.0.1:80 
    # 
    #location ~ \.php$ { 
    # proxy_pass http://127.0.0.1; 
    #} 
    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000 
    # 
    #location ~ \.php$ { 
    #root html; 
    # fastcgi_pass 127.0.0.1:9000; 
    # fastcgi_index index.php; 
    # fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name; 
    #include fastcgi_params; 
    #} 
    # deny access to .htaccess files, if Apache's document root 
    # concurs with nginx's one 
    # 
    #location ~ /\.ht { 
    # deny all; 
    #} 
  }

You can test it with curl

curl -I -A "qihoobot" www.xxx.com

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:

Modify nginx configuration to limit the frequency of malicious crawlers
A simple method to configure User-Agent to filter crawlers in Nginx
How to view the behavior of search engine spider crawlers in Linux/Nginx
Nginx configuration example of limiting search engine crawler frequency and prohibiting web crawler blocking
Nginx anti-crawler strategy to prevent UA from crawling websites

<<: Detailed explanation of the code between the MySQL master library binlog (master-log) and the slave library relay-log

>>: A brief discussion on the alternative method of $refs in vue2 in vue3 combined API

CSS Reset style reset implementation example

How to block and prohibit web crawlers in Nginx server

CSS Reset style reset implementation example

mysql solves time zone related problems

Linux deb package decompression, modification and other operation methods code examples

Div css naming standards css class naming rules (in line with SEO standards)

Installation and configuration method of Zabbix Agent on Linux platform

Tomcat's method of setting ports through placeholders (i.e. parameter specification method)

How to insert weather forecast into your website

Vue el-date-picker dynamic limit time range case detailed explanation

Tutorial on setting up scheduled tasks to backup the Oracle database under Linux

Detailed process of deploying Docker to WSL2 in IDEA

Recommend

Detailed explanation of how to clear a few pixels of blank space under an image using CSS

Does MySql need to commit?

Solution to the ineffectiveness of flex layout width in css3

How to control the startup order of docker compose services

How to modify Ubuntu's source list (source list) detailed explanation

Solution to the problem that MySQL in Windows system cannot input and display Chinese

Several popular website navigation directions in the future

How to get the contents of .txt file through FileReader in JS

How to create, start, and stop a Docker container

In-depth understanding of MySQL long transactions

The latest MySQL 5.7.23 installation and configuration graphic tutorial

Briefly describe the use and description of MySQL primary key and foreign key

Design of image preview in content webpage

Linux system opens ports 3306, 8080, etc. to the outside world, detailed explanation of firewall settings

Example of integrating Kafka with Nginx