|
Controlling what is spidered - robots.txt
There is a standard way of telling the spiders which parts of your site they can and cannot index. Each time a spider visits your domain it looks first for a file called robots.txt in the root domain. If it cannot find it, then it will usually just go ahead and spider whatever it wants to within your site. You do not have to have a robots.txt but we recommend having one to avoid 404 errors and possible problems with some spiders.
If you use a custom 404 page to redirect people to your home page if they request a page that does not exist, it may cause problems for the spiders as they will be seeing an html page when they requested a robots.txt file and may just get confused and go away.
If you can see the activity logs for your site, you can check for visits that requested the robots.txt file and you know they are spiders and can see what they were able to crawl through.
To create a robots.txt file, use a simple text editor like Notepad - Not something like Word or Dreamweaver that will add extra formatting to it.
The format for a robots.txt file is to specify the User Agents (robot names such as Googlebot for Google) and the areas that are disallowed for them. A simple one to allow all agents to crawl the whole site looks like this
User-Agent: *
Disallow:
that is, for all agents, disallow nothing (i.e. allow everything)
If you do not want certain directories crawled, you disallow them like this
This allows all robots to crawl all files except the images file.
User-agent: *
Disallow: /images/
If you want to exclude a single spider from a certain directory, do it like this
This specifically denies Googlebot-image to your images file
User-agent: Googlebot-Image
Disallow: /images/
To see what other sites have done with their robots.txt files, just type in the whole url for the file, e.g. our own one is here:
http://www.anythingleft-handed.co.uk/robots.txt
We have disallowed all robots from crawling our cgi-bin as our search programmes and other functionality just confuse them, and if they are confused they tend to go away. We have also specifically allowed the main engine robots to see everything - this is technically redundant code but it might just make them feel welcome!
VALIDATING YOUR FILE
Once you have created and uploaded your robots.txt file you can use a validation program to check it and make sure it is OK
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
FIND OUT MORE
To find out more about robots.txt try these links:
http://www.robotstxt.org/wc/norobots.html
http://javascriptkit.com/howto/robots.shtml - a useful tutorial and links
|