What is Robots.txt?

The file that is used to tell robots and crawlers what not to crawl on your site. Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots parts of your site should not be visited by the robot. When a robot visits a website, first check the robots.txt file in the root directory.

If it can find this file, it will scan the contents to see if it can recover additional documents (files). You can customize the robots.txt file to apply only to specific robots, and deny access to certain folders or files. Here is a sample robots.txt to prevent all robots to visit the entire site: -

# Tells Scanning Robots Where They Are and Are Not Welcome
# User-agent:    can also specify by name; "*" is for everyone
# Disallow:    if this matches first part of requested path,
#            forget it

User-agent: *    # applies to all robots
Disallow: /      # disallow indexing of all pages

The recording begins with the lines of a user agent or more, stating that robots apply to the records, followed by "Disallow" and "Allow" instructions for the robot. To assess whether access to a URL is allowed, a robot must try to match the roads and let the lines do not allow the URL, in the order they appear in the folder. The first match found is used. If not found, the default assumption is that the URL is allowed.

Spiders and Robots Exclusion

Web Robots are programs that automatically traverse the hypertext structure by retrieving a Web document and recursively retrieving all documents that are referenced. This page explains how you can control what these robots to do while visiting your site.