How to prevent website content from being included in search engines

How to prevent website content from being included in search engines

Usually the goal of building a website is to have it indexed by search engines and expand its promotion. But if your website involves personal privacy or confidential non-public web pages and you need to prohibit search engines from indexing and crawling it, what should you do? For example, Taobao is an example of a website that is prohibited from being indexed by search engines. This article will teach you several ways to block or prohibit search engines from indexing and crawling website content.

Search engine spiders are constantly crawling the Internet. If our website does not take any actions to prohibit search engines from indexing it, it will easily be indexed by search engines. So here's how to prevent search engines from indexing website content.

First, the robots.txt method

Search engines comply with the robots.txt protocol by default (not excluding some rogue engines). Create a robots.txt text file and put it in the root directory of the website. Edit the code as follows:

User-agent: *
Disallow: /

With the above code, you can tell search engines not to crawl and index this website. Be careful when using the above code: this will prohibit all search engines from accessing any part of the website.

If you only prohibit Baidu search engine from indexing and crawling web pages

1. Edit the robots.txt file and design the markup as:

User-agent: Baiduspider
Disallow: /

The above robots file will prohibit all crawling from Baidu.

Let’s talk about Baidu’s user-agent here. What is Baiduspider’s user-agent?

Baidu uses different user-agents for various products:

  • Product name corresponds to user-agent
  • Wireless search Baiduspider
  • Image Search Baiduspider-image
  • Video Search Baiduspider-video
  • News Search Baiduspider-news
  • Baidu Collection Baiduspider-favo
  • Baidu Alliance Baiduspider-cpro
  • Business Search Baiduspider-ads
  • Web and other searches Baiduspider

You can set different crawling rules based on the different user-agents of each product. The following robots implementation prohibits all crawling from Baidu but allows image search to crawl the /image/ directory:

User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider-image
Allow: /image/

Please note: the web pages crawled by Baiduspider-cpro and Baiduspider-ads will not be indexed, they are just executing the operations agreed with the customer, so they do not comply with the robots protocol. This can only be resolved by contacting Baidu.

How to only prohibit Google search engine from indexing and crawling web pages? The method is as follows:

Edit the robots.txt file and mark it as:

User-agent: googlebot
Disallow: /

Second, web page code method

Add the code <meta name="robots" content="noarchive"> between the <head> and </head> of the website's homepage code. This tag prohibits search engines from crawling the website and displaying web page snapshots.

Add <meta name="Baiduspider" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Baidu search engine from crawling the website and displaying web page snapshots.

Add <meta name="googlebot" content="noarchive"> between the <head> and </head> codes on the homepage of the website to prevent Google search engine from crawling the website and displaying web page snapshots.

In addition, when our needs are very strange, such as the following situations:

1. The website has added robots.txt, can it still be found in Baidu search?

Because it takes time to update the search engine index database. Although Baiduspider has stopped accessing the web pages on your website, it may take several months to clear the web page index information that has been established in the Baidu search engine database. Please also check whether your robots configuration is correct. If your need to refuse to be included is very urgent, you can also submit a request through the complaint platform.

2. I want my website content to be indexed by Baidu but not saved as snapshots. What should I do?

Baiduspider complies with the Internet meta robots protocol. You can use the meta settings of a web page to have Baidu only index that page, but not display a snapshot of that page in the search results. Just like updating robots, it takes time to update the search engine index database. So even if you have prohibited Baidu from displaying snapshots of the page in search results through meta in the web page, if the web page index information has already been established in the Baidu search engine database, it may take two to four weeks for the update to take effect online.

3. If you want to be indexed by Baidu but do not want to save website snapshots, the following code can solve the problem:

<meta name="Baiduspider" content="noarchive">

4. If you want to prohibit all search engines from saving snapshots of your web pages, the code is as follows:

<meta name="robots" content="noarchive">

Here are some commonly used code combinations:

  • <META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">: This page can be crawled, and other links can be indexed along this page
  • <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">: Do not crawl this page, but you can crawl and index other links along this page
  • <META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">: This page can be crawled, but other links cannot be crawled and indexed along this page
  • <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">: Do not crawl this page, and do not crawl and index other links along this page

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:
  • How to block and prohibit web crawlers in Nginx server
  • Summary of commonly used parsing methods for Python crawler beautifulsoup4
  • Python common crawler code summary for easy query
  • Python implements Tencent News crawler through requests
  • Python3 crawler automatically queries the weather and realizes voice broadcast
  • Python crawler UserAgent usage example
  • Explanation of implementing crawlers based on node.js
  • A brief discussion on the working principle and data collection of the Scrapy web crawler framework
  • How to use Electron to write a nodejs crawler with an interface
  • How to make money through python crawlers

<<:  How to use Baidu Map API in vue project

>>:  How to insert batch data into MySQL database under Node.js

Recommend

How to use selenium+testng to realize web automation in docker

Preface After a long time of reading various mate...

About the implementation of JavaScript carousel

Today is another very practical case. Just hearin...

Boundary and range description of between in mysql

mysql between boundary range The range of between...

Form submission refresh page does not jump source code design

1. Design source code Copy code The code is as fol...

How to design and optimize MySQL indexes

Table of contents What is an index? Leftmost pref...

Solution for adding iptables firewall policy to MySQL service

If your MySQL database is installed on a centos7 ...

An example of refactoring a jigsaw puzzle game using vue3

Preface It took two days to reconstruct a puzzle ...

The difference between MySQL execute, executeUpdate and executeQuery

The differences among execute, executeUpdate, and...

Analysis of product status in interactive design that cannot be ignored in design

In the process of product design, designers always...

Flex layout achieves fixed number of rows per line + adaptive layout

This article introduces the flex layout to achiev...

MySQL green version setting code and 1067 error details

MySQL green version setting code, and 1067 error ...

Detailed steps for quick installation of openshift

The fastest way to experience the latest version ...

What are the differences between var let const in JavaScript

Table of contents 1. Repeated declaration 1.1 var...