summaryrefslogtreecommitdiff
path: root/old/published/Webmonkey/robotstxt.txt
blob: 0fd0e126a72aa477fc8f5a179cf03c5d8dccd13b (plain)
1
Have you ever wondered why your server logs show 404 errors for a file named robots.txt when you've never linked to or created any such file? The answer is that all well trained web crawlers always look for a file named robots.txt that will ostensibly tell them what to index and what to live alone.

If you've got 404 errors, that means your site is missing a robots.txt file and the bots are just winging it. Why not help them out and gain a little control over what gets indexed in the process?

A robots.txt fil helps direct the bots to the content you want them to know about and prevents them from indexing pages (like your admin section for instance) that you don't want them to crawl. When used in conjunction with a sitemap it might even help improve your search engine ranking.

== What it is ==

As the name implies, robots.txt is simply a flat textfile with a few simple directions that tell all robots, or even just specific crawlers, what parts of your site to index.

To get started, let's use a simple example. Imagine you have a site at http://mysite.com and you use WordPress, which you access at the URL: http://www.mysite.com/wp-admin/. Now you don't want the robots to index your admin login page because it's private, so create a new file, robots.txt, at the root level of your site and add these  lines:

<pre>
User-Agent: *
Disallow: /wp-admin
</pre>

That tells all bots (the * is a wildcard that will match any user agent) to ignore the <code>wp-admin</code> directory and everything below it.

The basic format for all robots.txt rules is:

<pre>
User-Agent: [Bot name]
Disallow: [Directory or File Name]
</pre>

So let's modify the above example so that only the Google Bot is excluded (there's no good reason to do that in this case, but for the sake of example):

<pre>
User-Agent: Googlebot
Disallow: /wp-admin
</pre>

Here's a more practical example that will prevent the Google image scraping bot from indexing your images folder:

<pre>
User-Agent: Googlebot-Image
Disallow: /images
</pre>

Let's say you really hate the Lycos web crawler, well, just disallow your whole site:

<pre>
User-Agent: T-Rex
Disallow: /
</pre>

Obviously the Lycos user agent is "T-Rex," which raises the question: where do you find out the name of all the various crawlers and their user agent signatures? 

The answer is to head over to Robotstxt website and check out the [http://www.robotstxt.org/db.html list of bots in the wild]. You'll note that there are over 300 different bots listed there, most of which you've probably never heard of -- don't worry neither have we. 

In most cases you can get by with rules that just use the * wildcard, but should you ever need to target a specific bot, now you know how.

== More complex scenarios ==

So far we've just created very basic rules, but you can actually get quite complex. Let's say for example that we want all bots to ignore our WordPress admin and then we want all of them except the GoogleBot to ignore our images directory.

Here's what that would look like:

<pre>
User-agent: *
Disallow: /wp-admin
Disallow: /images

User-agent: Googlebot-Image
Disallow: /wp-admin
</pre>

First we address all bots and tell them to ignore both of the directories we want to keep hidden. Then we specifically address the Google Image bot and tell it to ignore the wp-admin directory. But because the specific rule overrides the general one, the Google Image bot will go ahead and crawl the images directory because we haven't told it not to.

== Caveats ==

Most well behaved bots will obey your robots.txt rules, however, it's important to note that this isn't a security method. Just because you tell the bots to ignore your private files, doesn't mean a) that they will (there are badly behaved bots out there) or b) that anyone else will.

Robots.txt files are merely guides, not a way to make sure no one sees your pages. If you're looking to secure your files, use something like a password protected directory. That way you'll stop the bots and the humans.

== Conclusion ==

That's really all there is to robots.txt. If you'd like to learn more about robots and see some other examples, head over to the [http://www.robotstxt.org/ Robotstxt website] which the web's most comprehensive source for all things related to web crawlers.