ALH WebMaster pages
ALH webmaster resources categories
 

Spiders and Robots

The search engines have automated programmes that crawl the web and record details of the web sites they find. You want them to visit your site and take details of all the pages in it to put in their indexes.

Spiders only see the text in your code and that is what they index. To see your page like a spider will see it, try using the Spider Simulator tool

http://www.searchengineworld.com/cgi-bin/sim_spider.cgi

Here is a good list of spider names and their search engines

http://www.searchenginedictionary.com/spider-names.shtml

This is another big list of robots and user agents that you may see listed in your site logs. Here you can find out who their parents are!

http://www.psychedelix.com/agents.html

See below for information on using the robots.txt file to control what is spidered

 

 


Controlling what is spidered - robots.txt

There is a standard way of telling the spiders which parts of your site they can and cannot index. Each time a spider visits your domain it looks first for a file called robots.txt in the root domain. If it cannot find it, then it will usually just go ahead and spider whatever it wants to within your site. You do not have to have a robots.txt but we recommend having one to avoid 404 errors and possible problems with some spiders.

If you use a custom 404 page to redirect people to your home page if they request a page that does not exist, it may cause problems for the spiders as they will be seeing an html page when they requested a robots.txt file and may just get confused and go away.

If you can see the activity logs for your site, you can check for visits that requested the robots.txt file and you know they are spiders and can see what they were able to crawl through.

To create a robots.txt file, use a simple text editor like Notepad - Not something like Word or Dreamweaver that will add extra formatting to it.

The format for a robots.txt file is to specify the User Agents (robot names such as Googlebot for Google) and the areas that are disallowed for them. A simple one to allow all agents to crawl the whole site looks like this

User-Agent: *
Disallow:

that is, for all agents, disallow nothing (i.e. allow everything)

If you do not want certain directories crawled, you disallow them like this

This allows all robots to crawl all files except the images file.

User-agent: *
Disallow: /images/

If you want to exclude a single spider from a certain directory, do it like this

This specifically denies Googlebot-image to your images file

User-agent: Googlebot-Image
Disallow: /images/

To see what other sites have done with their robots.txt files, just type in the whole url for the file, e.g. our own one is here:

http://www.anythingleft-handed.co.uk/robots.txt

We have disallowed all robots from crawling our cgi-bin as our search programmes and other functionality just confuse them, and if they are confused they tend to go away. We have also specifically allowed the main engine robots to see everything - this is technically redundant code but it might just make them feel welcome!

VALIDATING YOUR FILE

Once you have created and uploaded your robots.txt file you can use a validation program to check it and make sure it is OK

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

FIND OUT MORE

To find out more about robots.txt try these links:

http://www.robotstxt.org/wc/norobots.html

http://javascriptkit.com/howto/robots.shtml - a useful tutorial and links

 

 

© Copyright 2000-09 Anything Left-Handed