Control Search Engines Crawling

Control Search Engines Crawling

Normally, websites want to be exposed to the public. Search engines are helping the websites to achieve this goal. However, search engine need to index the websites in order to make the website searchable. Search engine crawlers, also known as spiders or bots, will crawl website page by page to index the contents from time to time in order to get the latest indexes of the contents. However this could cause high system resource usage with all of those pages loaded in a short period of time, therefore, websites sometime does not want this happens.

 

Image result for Control Search Engines Crawling

 

Controlling search engine crawlers with a robots.txt file

Website owners, can instruct search engines on how they should crawl a website, by using a robots.txt file.

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

It’s important to know robots.txt rules don’t have to be followed by bots, and they are a guideline. For instance to set a Crawl-delay for Google this must be done in the Google Webmaster tools.

For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.

Rules:

  • The robots.txt file needs to be at the root of your site: http://domain.com/robots.txt
  • User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.
  • Disallow: sets the files or folders that are not allowed to be crawled.

Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:

Top 3 US search engine User-agents:

Googlebot
Yahoo! Slurp
bingbot

Common search engine User-agents blocked:

AhrefsBot
Baiduspider
Ezooms
MJ12bot
YandexBot

Here are some of the most common uses of the robots.txt file:

Set a crawl delay for all search engines:

Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours to reduce the server load.

User-agent: *
Crawl-delay: 30

Allow all search engines to crawl website:

User-agent: *
Disallow:

Disallow all search engines from crawling website:

User-agent: *
Disallow: /

Disallow one particular search engines from crawling website:

User-agent: Baiduspider
Disallow: /

Disallow all search engines from particular folders:

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /tmp/

Disallow all search engines from particular files:

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm
Disallow: /store.htm

Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:

User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

45 total views, 1 views today

Author: Albert

Leave a Reply