Normally, websites want to be exposed to the public. Search engines are helping the websites to achieve this goal. However, search engine need to index the websites in order to make the website searchable. Search engine crawlers, also known as spiders or bots, will crawl website page by page to index the contents from time to time in order to get the latest indexes of the contents. However this could cause high system resource usage with all of those pages loaded in a short period of time, therefore, websites sometime does not want this happens.
Controlling search engine crawlers with a robots.txt file
Website owners, can instruct search engines on how they should crawl a website, by using a robots.txt file.
When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.
For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.
- The robots.txt file needs to be at the root of your site: http://domain.com/robots.txt
- User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.
- Disallow: sets the files or folders that are not allowed to be crawled.
Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:
Top 3 US search engine User-agents:
Googlebot Yahoo! Slurp bingbot
Common search engine User-agents blocked:
AhrefsBot Baiduspider Ezooms MJ12bot YandexBot
Here are some of the most common uses of the robots.txt file:
Set a crawl delay for all search engines:
A Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours to reduce the server load.
User-agent: * Crawl-delay: 30
Allow all search engines to crawl website:
User-agent: * Disallow:
Disallow all search engines from crawling website:
User-agent: * Disallow: /
Disallow one particular search engines from crawling website:
User-agent: Baiduspider Disallow: /
Disallow all search engines from particular folders:
User-agent: * Disallow: /cgi-bin/ Disallow: /private/ Disallow: /tmp/
Disallow all search engines from particular files:
User-agent: * Disallow: /contactus.htm Disallow: /index.htm Disallow: /store.htm
Disallow all search engines but one:
If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:
User-agent: * Disallow: /private/ User-agent: Googlebot Disallow:
When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.