12Apr 2019




What Is Robots.txt and How To Use It?


Frequent visits from the search engine crawlers to your website is a good sign for the success of your website and it’s content. However, the way your site is indexed by search engines may not be optimal. The essence of the robots meta tags is to tell search engine robots where not to go on your site. Because some search engines robots cannot read meta tags, the robots meta tags will not be noticed. That is why you need to use robots.txt file to get the message across.

What Is Robots.txt?

The importance of robots.txt file can’t be ignored. Robots.txt is a simple text file that you can place in the root directory of your website to guide search engines on how to index it. It is not mandatory for search engines to respect robots.txt but the majority of them tend to comply with this code. As a result of this, you should not use robots.txt as a firewall that will prevent crawling completely of your sensitive data. Since search engine robots are free to disobey the robots.txt commands, you should not rely on them to protect sensitive information.

Location of Robots.txt

It is essential to place robots.txt. in the main directory so that crawlers will be able to discover it. By default, they go to your main directory to see if there is any robots.txt. If they don’t find it, they will index your entire website. Therefore, if you don’t put it in the right place, it may not work at all.

Structure of a Robots.txt File

A robots.txt structure is not complicated and can be very adaptable. It contains a countless list of user agents as well as disallowed files/directories. It basically looks like what you see below:

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /~different/

“User-agent” refers to search engine crawlers (or any other crawler) while ‘disallow’ refers to files as well as directories that should not be indexed. Furthermore, you can add comment lines so that you can be more specific with your command. In the example above, three directories are blocked from being indexed. It is important to note that you can’t put two directories in one line, each of them must be on a separate line. The star (*) in front of ‘user agent’ refers to any crawler.

Furthermore, you should not make the mistake of typing the wrong commands. You should spell the directories properly without missing the colons. When your robots.txt file gets complex, you can use some validation tools to check it. You can find one at http://tool.motoricerca.info/robots-checker.phtml.

Here are some examples of its usage;

To block the entire site from being indexed by search engines:

User-agent: *

Disallow: /

To enable complete indexation of your website by web crawlers:

User-agent: *


To block many directories from being indexed:

User-agent: *

Disallow: /cgi-bin/

To block a particular crawler from indexing your site:

User-agent: Bot1

Disallow: /

Robots.txt and SEO

Exclusion of Images

Some CMS (WordPress) versions of robots.txt will exclude your image folders by default. This is common with older CMS versions but not with newer versions. If you are using an older version, you should double check to sure that your image folders are not excluded. The implication of this exclusion is that it prevents the indexation of your images which will exclude them from Google Image Search. This can go a long way in negatively affecting the SEO ranking of your website. In order to fix this, you need to open your robots.txt file and get rid of the command below:

Disallow: /images/ 

Additional Tips

It is important to avoid blocking your CSS, JavaScript as well as other valuable files by default. This will disallow Googlebot to render the page properly and understand that your website is optimized for mobile devices.

Furthermore, you should note that the addition of disallow commands to robots.txt file will not hide the content from crawlers. It will only block spiders from gaining access to the content. In order to get rid of the content completely, you can use meta noindex.

You should not attempt to use robots.txt file for handling duplicate contents. It is better to use Rel=canonical tag which is a component of your webpage’s HTML head. This will ensure that your site is not penalized for duplicate contents.

IntellyHost is a full-fledged web hosting service provider offering the best hosting plans for shared hosting, VPS hosting, enterprise hosting and dedicated hosting with %99.99 uptime.

Comments (0)