Robots.txt detailed introduction

Robots.txt detailed introduction

Robots.txt is a plain text file in which website administrators can declare parts of the website that they do not want to be accessed by robots, or specify that search engines only index specified content. Basic introduction to robots.txt
Robots.txt is a plain text file in which website administrators can declare parts of the website that they do not want to be accessed by robots, or specify that search engines only index specified content.
When a search robot (sometimes called a search spider) visits a site, it will first check whether there is a robots.txt in the root directory of the site. If it exists, the search robot will determine the scope of access according to the content of the file; if the file does not exist, the search robot will crawl along the link.
In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.
Robots.txt writing syntax <br />First, let’s look at a robots.txt example: http://www.csswebs.org/robots.txt
Visit the above specific address, we can see the specific content of robots.txt as follows:
# Robots.txt file from http://www.csswebs.org
# All robots will spider the domain
User-agent: *
Disallow:
The above text means that all search robots are allowed to access all files under the www.csswebs.org site.
Specific syntax analysis: the text after # is explanatory information; User-agent: is followed by the name of the search robot. If it is followed by *, it refers to all search robots; Disallow: is followed by the file directory that is not allowed to access.
Below, I will list some specific uses of robots.txt:
Allow all robots to access
User-agent: *
Disallow:
Or you can create an empty file "/robots.txt" file
Block all search engines from accessing any part of the site
User-agent: *
Disallow: /
Block all search engines from accessing several sections of the site (directories 01, 02, 03 in the example below)

User-agent: *
Disallow: /01/
Disallow: /02/
Disallow: /03/
Block access to a search engine (BadBot in the example below)
User-agent: BadBot
Disallow: /
Only allow access from a certain search engine (Crawler in the example below)
User-agent: Crawler
Disallow:
User-agent: *
Disallow: /
In addition, I think it is necessary to expand on this and introduce robots meta:
The Robots META tag is mainly targeted at specific pages. Like other META tags (such as the language used, page description, keywords, etc.), the Robots META tag is also placed in the <head> </head> of the page, specifically used to tell search engine ROBOTS how to crawl the content of the page.
How to write the Robots META tag:
There is no case distinction in the Robots META tag. name="Robots" means all search engines, and can be written as name="BaiduSpider" for a specific search engine. The content part has four command options: index, noindex, follow, and nofollow. The commands are separated by ",".
The INDEX directive tells the search robot to crawl the page;
The FOLLOW instruction indicates that the search robot can continue crawling along the links on the page;
The default values ​​for the Robots Meta tag are INDEX and FOLLOW, except for inktomi, for which the default value is INDEX, NOFOLLOW.
Thus, there are four combinations:
<META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”>
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
<META NAME=”ROBOTS” CONTENT=”INDEX,NOFOLLOW”>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
Among them, <META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”> can be written as <META NAME=”ROBOTS” CONTENT=”ALL”>;
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> can be written as <META NAME="ROBOTS" CONTENT="NONE">
At present, it seems that the vast majority of search engine robots abide by the rules of robots.txt. As for the Robots META tag, there is not much support at present, but it is gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command "archive" that can limit whether GOOGLE retains web page snapshots. For example:
<META NAME=”googlebot” CONTENT=”index,follow,noarchive”>

<<:  Summary of JS tips for creating or filling arrays of arbitrary length

>>:  Six methods for nginx optimization

Recommend

Do you know the common MySQL design errors?

Thanks to the development of the Internet, we can...

Detailed explanation of the middleman mode of Angular components

Table of contents 1. Middleman Model 2. Examples ...

A great collection of web standards learning resources

These specifications are designed to allow for bac...

jQuery simulates picker to achieve sliding selection effect

This article shares the specific code of jQuery t...

JavaScript microtasks and macrotasks explained

Preface: js is a single-threaded language, so it ...

Detailed tutorial on compiling and installing python3.6 on linux

1. First go to the official website https://www.p...

Uncommon but useful tags in Xhtml

Xhtml has many tags that are not commonly used but...

How to hide the border/separation line between cells in a table

Only show the top border <table frame=above>...

A detailed introduction to the use of block comments in HTML

Common comments in HTML: <!--XXXXXXXX-->, wh...

Explanation of nginx load balancing and reverse proxy

Table of contents Load Balancing Load balancing c...

What are the advantages of using B+Tree as an index in MySQL?

Table of contents Why do databases need indexes? ...

Vue Basic Tutorial: Conditional Rendering and List Rendering

Table of contents Preface 1.1 Function 1.2 How to...

How to change the password of mysql5.7.20 under linux CentOS 7.4

After MySQL was upgraded to version 5.7, its secu...

Installation process of MySQL5.7.22 on Mac

1. Use the installation package to install MySQL ...