Robots Exclusion Standard A robots.txt file, commonly mis-represented as a robot.txt file, is a file encoded in the ANSI text format. This basically means it is a simple text file which should be created in Notepad. It controls how search engine crawlers (robots) look at your website and can be used to specify how certain areas of your site is indexed or to give instruction to specific search engines.

The file should be placed in the root directory of your website of where your index.html or home page resides. Even though you may not require the spider to exclude any area of your site from its search you should still have it as all the top-ranked search engines now look for it.

Some reasons you may need to exclude spiders from your site include

1. There are some private directories or information that you do not want to be crawled.
2. You’re still fixing parts of the site and some areas may contain error pages.
3. You have optimized certain pages for specific search engines and want to exclude other search engine spiders from indexing it.
4. You want to prevent some search engine robots or email harvesting bots (Bad Bots) from crawling your pages altogether.

Syntax For File Creation

The basic instructions are placed in two lines of text.

User-agent: Spider Name
Diallow: File/Directory Name

Let’s look at some examples:

1. If you want to allow every spider to index everything on your site.

User-agent: *
Disallow:

An asterisk “*” is used to represent all search engine spiders while the second disallow line is left blank.

2. If you want NO spider to index anything on your website.

User-agent: *
Disallow: /

This may be useful when you’re just starting to fix your entire site. Remember to change it back once the site is active.

3. If you want to prevent all the bots from searching a specific section of your site

User-agent: *
Disallow: /specificsection/

The forward slash is placed at the beginning and end of the directory name to allow NO part of that file from being crawled. So, if you were disallowing a certain page from that directory for all the search engines

User-agent: *
Disallow: /specificsection/private1.html

4. Finally you can prevent specific robots from crawling your sites. Some examples of them are: Google - Googlebot, MSN - MSNBot, AltaVista - Scooter, ASk/Teoma - ASkJeeves, Inktomi/HotBot- Inktomi Slurp.

Google even has a separate bot for indexing images on your website called Googlebot-Image. Make sure if you are disallowing a specific bot from your site page that you place them first in the text document.

Most people prevent bots from indexing their cgi-bin files, private files, images and newly constucted pages. A common robots.txt file might look like this

User-agent: Googlebot-Image
Disallow: /

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /temp/
Disallow: /newarticles/
Disallow: /images/

Alternatives

HTML meta tags can also be used to prevent robots from crawling certain pages. The html code

<meta name=”robots” content=”noindex, nofollow” />
can be placed in the HEAD section of an html document to exclude the page from the search engine index and not to follow any links on this page for further possible indexing.


Leave a Reply