How to prevent website content from being included in search engines

How to prevent website content from being included in search engines

Usually the goal of building a website is to have it indexed by search engines and expand its promotion. But if your website involves personal privacy or confidential non-public web pages and you need to prohibit search engines from indexing and crawling it, what should you do? For example, Taobao is an example of a website that is prohibited from being indexed by search engines. This article will teach you several ways to block or prohibit search engines from indexing and crawling website content.

Search engine spiders are constantly crawling the Internet. If our website does not take any actions to prohibit search engines from indexing it, it will easily be indexed by search engines. So here's how to prevent search engines from indexing website content.

First, the robots.txt method

Search engines comply with the robots.txt protocol by default (not excluding some rogue engines). Create a robots.txt text file and put it in the root directory of the website. Edit the code as follows:

User-agent: *
Disallow: /

With the above code, you can tell search engines not to crawl and index this website. Be careful when using the above code: this will prohibit all search engines from accessing any part of the website.

If you only prohibit Baidu search engine from indexing and crawling web pages

1. Edit the robots.txt file and design the markup as:

User-agent: Baiduspider
Disallow: /

The above robots file will prohibit all crawling from Baidu.

Let’s talk about Baidu’s user-agent here. What is Baiduspider’s user-agent?

Baidu uses different user-agents for various products:

  • Product name corresponds to user-agent
  • Wireless search Baiduspider
  • Image Search Baiduspider-image
  • Video Search Baiduspider-video
  • News Search Baiduspider-news
  • Baidu Collection Baiduspider-favo
  • Baidu Alliance Baiduspider-cpro
  • Business Search Baiduspider-ads
  • Web and other searches Baiduspider

You can set different crawling rules based on the different user-agents of each product. The following robots implementation prohibits all crawling from Baidu but allows image search to crawl the /image/ directory:

User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider-image
Allow: /image/

Please note: the web pages crawled by Baiduspider-cpro and Baiduspider-ads will not be indexed, they are just executing the operations agreed with the customer, so they do not comply with the robots protocol. This can only be resolved by contacting Baidu.

How to only prohibit Google search engine from indexing and crawling web pages? The method is as follows:

Edit the robots.txt file and mark it as:

User-agent: googlebot
Disallow: /

Second, web page code method

Add the code <meta name="robots" content="noarchive"> between the <head> and </head> of the website's homepage code. This tag prohibits search engines from crawling the website and displaying web page snapshots.

Add <meta name="Baiduspider" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Baidu search engine from crawling the website and displaying web page snapshots.

Add <meta name="googlebot" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Google search engine from crawling the website and displaying web page snapshots.

In addition, when our needs are very strange, such as the following situations:

1. The website has added robots.txt, can it still be found in Baidu search?

Because it takes time to update the search engine index database. Although Baiduspider has stopped accessing the web pages on your website, it may take several months to clear the web page index information that has been established in the Baidu search engine database. Please also check whether your robots configuration is correct. If your need to refuse to be included is very urgent, you can also submit a request through the complaint platform.

2. I want my website content to be indexed by Baidu but not saved as snapshots. What should I do?

Baiduspider complies with the Internet meta robots protocol. You can use the meta settings of a web page to have Baidu only index that page, but not display a snapshot of that page in the search results. Just like updating robots, it takes time to update the search engine index database. So even if you have prohibited Baidu from displaying snapshots of the page in search results through meta in the web page, if the web page index information has already been established in the Baidu search engine database, it may take two to four weeks for the update to take effect online.

3. If you want to be indexed by Baidu but do not want to save website snapshots, the following code can solve the problem:

<meta name="Baiduspider" content="noarchive">

4. If you want to prohibit all search engines from saving snapshots of your web pages, the code is as follows:

<meta name="robots" content="noarchive">

Here are some commonly used code combinations:

  • <META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">: This page can be crawled, and other links can be indexed along this page
  • <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">: Do not crawl this page, but you can crawl and index other links along this page
  • <META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">: This page can be crawled, but other links cannot be crawled and indexed along this page
  • <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">: Do not crawl this page, and do not crawl and index other links along this page

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:
  • How to block and prohibit web crawlers in Nginx server
  • Summary of commonly used parsing methods for Python crawler beautifulsoup4
  • Python common crawler code summary for easy query
  • Python implements Tencent News crawler through requests
  • Python3 crawler automatically queries the weather and realizes voice broadcast
  • Python crawler UserAgent usage example
  • Explanation of implementing crawlers based on node.js
  • A brief discussion on the working principle and data collection of the Scrapy web crawler framework
  • How to use Electron to write a nodejs crawler with an interface
  • How to make money through python crawlers

<<:  How to use Baidu Map API in vue project

>>:  How to insert batch data into MySQL database under Node.js

Recommend

Vue implements scroll loading table

Table of contents Achieve results Rolling load kn...

Detailed explanation of padding and abbreviations within the CSS box model

As shown above, padding values ​​are composite at...

mysql5.5 installation graphic tutorial under win7

MySQL installation is relatively simple, usually ...

Example of converting timestamp to Date in MySQL

Preface I encountered a situation at work: In the...

The whole process of developing a Google plug-in with vue+element

Simple function: Click the plug-in icon in the up...

How to change the MySQL database file directory in Ubuntu

Preface The company's Ubuntu server places th...

VMware Workstation 14 Pro installation and activation graphic tutorial

This article shares the installation and activati...

How to optimize MySQL performance through MySQL slow query

As the number of visits increases, the pressure o...

WeChat applet implements fixed header and list table components

Table of contents need: Function Points Rendering...

A brief discussion on the CSS overflow mechanism

Why do you need to learn CSS overflow mechanism i...

Detailed explanation of the use of title tags and paragraph tags in XHTML

XHTML Headings Overview When we write Word docume...

Solution to multiple 302 responses in nginx proxy (nginx Follow 302)

Proxying multiple 302s with proxy_intercept_error...

Why is IE6 used by the most people?

First and foremost, I am a web designer. To be mor...

Conditional comment style writing method and sample code

As front-end engineers, IE must be familiar to us...

Nginx monitoring issues under Linux

nginx installation Ensure that the virtual machin...