Robots.TXT File

Joran Hofman
March 6, 2021

What Is A Robots.txt File?

A Robots.Txt file is a text file that webmasters use to tell web robots (such as search engines) which pages of a website to crawl or not. Mainly, it is used to prevent the requests that a site receives from overloading it. This file uses the Robot Exclusion Standard.

Why Is A robots.txt File Important?

The Robots.txt files regulate the entry of bots to various areas of a site. While it can be terrible if bots are accidentally not allowed to crawl an entire site, there are certain cases where a robots.txt file can help:

  • Prevent duplicate content from showing on SERPs, although meta robots are generally a better option for these cases.
  • Keep the privacy of entire parts of a website.
  • Prevent internal search results pages from displaying in a public SERP
  • Indicate the location of the maps of a site.
  • Prevent the indexing of certain files or types of files on a website by search engines.
  • Dictate crawl delays so that servers are not overloaded when multiple pieces of content are loaded at the same time by crawlers.

You might not require a robots.txt file at all when there are no sectors on a site that you want to control user access to.

How Does A Robots.txt File Work?

Search engines have two main functions:

1. Search and crawl the web to find content;

2. Create indexes for that content to be able to serve it to information seekers.

Search engines follow links to travel from one site to another to crawl sites or web pages, eventually traveling billions of links, pages, and websites. This tracking behavior is sometimes known as "spiders."

After entering a web page and before modifying it, the bot will look for a robots.txt file. If it finds it, the bot reads it before proceeding down the page. This robots.txt file has information on how the search engine should crawl, the information contained there and will specify further bot operations on this specific site. If there is no command that does not allow a user agent to act on the robots.txt file or there is no robots.txt file on the website, the bot will continue to crawl more information on the website.

How To Create A Robots.txt File?

Robots.txt technical syntax

The Robots.txt syntax is considered the "language" of these files. A new robots.txt file can be made using any plain text editor. There are five common terms generally found in a robots.txt file:

  • User-agent: the specific bot on each website that you are giving crawling instructions to (usually a search engine).
  • Disallow: is the command used to tell a user-agent not to crawl a specific URL. Only one "Disallow:" line is allowed for each of the URLs.
  • Allow: (only applies to the Google bot or Googlebot): it is the directive to tell the Google bot that it can enter a page or sub-folder, even if its main page or sub-folder is not allowed.
  • Crawl-delay: indicates the seconds a bot should wait before loading and crawling the information on the page. (Googlebot does not recognize this directive, but the crawl frequency can be configured from the Google Search Console.
  • Sitemap: it is used to specify the location of any XML sitemap that is related to this URL (this directive is only compatible with Ask, Yahoo, Google, and Bing).

Explore more glossaries