How to prevent website content from being included in search engines

Usually the goal of building a website is to have it indexed by search engines and expand its promotion. But if your website involves personal privacy or confidential non-public web pages and you need to prohibit search engines from indexing and crawling it, what should you do? For example, Taobao is an example of a website that is prohibited from being indexed by search engines. This article will teach you several ways to block or prohibit search engines from indexing and crawling website content.

Search engine spiders are constantly crawling the Internet. If our website does not take any actions to prohibit search engines from indexing it, it will easily be indexed by search engines. So here's how to prevent search engines from indexing website content.

First, the robots.txt method

Search engines comply with the robots.txt protocol by default (not excluding some rogue engines). Create a robots.txt text file and put it in the root directory of the website. Edit the code as follows:

User-agent: *
Disallow: /

With the above code, you can tell search engines not to crawl and index this website. Be careful when using the above code: this will prohibit all search engines from accessing any part of the website.

If you only prohibit Baidu search engine from indexing and crawling web pages

1. Edit the robots.txt file and design the markup as:

User-agent: Baiduspider
Disallow: /

The above robots file will prohibit all crawling from Baidu.

Let’s talk about Baidu’s user-agent here. What is Baiduspider’s user-agent?

Baidu uses different user-agents for various products:

Product name corresponds to user-agent
Wireless search Baiduspider
Image Search Baiduspider-image
Video Search Baiduspider-video
News Search Baiduspider-news
Baidu Collection Baiduspider-favo
Baidu Alliance Baiduspider-cpro
Business Search Baiduspider-ads
Web and other searches Baiduspider

You can set different crawling rules based on the different user-agents of each product. The following robots implementation prohibits all crawling from Baidu but allows image search to crawl the /image/ directory:

User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Allow: /image/

Please note: the web pages crawled by Baiduspider-cpro and Baiduspider-ads will not be indexed, they are just executing the operations agreed with the customer, so they do not comply with the robots protocol. This can only be resolved by contacting Baidu.

How to only prohibit Google search engine from indexing and crawling web pages? The method is as follows:

Edit the robots.txt file and mark it as:

User-agent: googlebot
Disallow: /

Second, web page code method

Add the code <meta name="robots" content="noarchive"> between the <head> and </head> of the website's homepage code. This tag prohibits search engines from crawling the website and displaying web page snapshots.

Add <meta name="Baiduspider" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Baidu search engine from crawling the website and displaying web page snapshots.

Add <meta name="googlebot" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Google search engine from crawling the website and displaying web page snapshots.

In addition, when our needs are very strange, such as the following situations:

1. The website has added robots.txt, can it still be found in Baidu search?

Because it takes time to update the search engine index database. Although Baiduspider has stopped accessing the web pages on your website, it may take several months to clear the web page index information that has been established in the Baidu search engine database. Please also check whether your robots configuration is correct. If your need to refuse to be included is very urgent, you can also submit a request through the complaint platform.

2. I want my website content to be indexed by Baidu but not saved as snapshots. What should I do?

Baiduspider complies with the Internet meta robots protocol. You can use the meta settings of a web page to have Baidu only index that page, but not display a snapshot of that page in the search results. Just like updating robots, it takes time to update the search engine index database. So even if you have prohibited Baidu from displaying snapshots of the page in search results through meta in the web page, if the web page index information has already been established in the Baidu search engine database, it may take two to four weeks for the update to take effect online.

3. If you want to be indexed by Baidu but do not want to save website snapshots, the following code can solve the problem:

<meta name="Baiduspider" content="noarchive">

4. If you want to prohibit all search engines from saving snapshots of your web pages, the code is as follows:

<meta name="robots" content="noarchive">

Here are some commonly used code combinations:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">: This page can be crawled, and other links can be indexed along this page
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">: Do not crawl this page, but you can crawl and index other links along this page
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">: This page can be crawled, but other links cannot be crawled and indexed along this page
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">: Do not crawl this page, and do not crawl and index other links along this page

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:

How to block and prohibit web crawlers in Nginx server
Summary of commonly used parsing methods for Python crawler beautifulsoup4
Python common crawler code summary for easy query
Python implements Tencent News crawler through requests
Python3 crawler automatically queries the weather and realizes voice broadcast
Python crawler UserAgent usage example
Explanation of implementing crawlers based on node.js
A brief discussion on the working principle and data collection of the Scrapy web crawler framework
How to use Electron to write a nodejs crawler with an interface
How to make money through python crawlers

<<: How to use Baidu Map API in vue project

>>: How to insert batch data into MySQL database under Node.js