1
|
Ever wonder why Google and other search engines are ignoring portions of your website? It could be that the big search engines just don't like you, but simpler answer more likely -- they don't know where all your pages are.
And if search engines can't find your site's pages, then there's no way for them to be indexed, which means you miss out on money-earning traffic. That's no good, so how can you explicitly tell a search engine spider where you pages are?
The answer is using a sitemap.
== What is a Sitemap? ==
A sitemap is essentially a table of contents for your website. But the sitemaps we're talking about here are not designed for human viewing -- like the sitemaps you might offer visitors looking for a quick way to navigate your site -- instead sitemap.xml files serve the same information in a format that search engine spiders can easily understand.
A sitemap is a simple XML file (named, fittingly, sitemap.xml) that gives the location, last-modified date and some other metadata for every page in your site.
When a search engine bot comes to your site and finds a sitemap, it will follow all the specified URLs, indexing the content and including whatever metadata your sitemap specifies.
== The Sitemap Protocol ==
The sitemap protocol is pretty simple, the basic format looks like this:
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://webmonkey.com/</loc>
<lastmod>2008-10-13T04:20:36Z</lastmod>
<changefreq>always</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://webmonkey.com/new-post/</loc>
<lastmod>2008-10-13T20:20:36Z</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
</pre>
As you can see, we start with a basic xml declaration -- make sure you specify the UTF-8 encoding, Google requires that sitemaps be UTF-8 encoded or it will ignore them. The next line opens our <code>urlset</code> tag which is the container tag that will hold all our URLs.
Note that we're pointing to the schema defined on [http://www.sitemaps.org/protocol.php sitemaps.org]. As of this writing version 0.9 is latest official schema.
The next tag is the <code>url</code> tag which is just a container for all the bits of information we can tell the search engines about for each page on our site. Those options are:
# '''loc''' (required) -- the URL of the page. This URL must begin with the protocol (generally http) and end with a trailing slash, if your web server requires it.
# '''lastmod''' (optional) -- date of last time you modified the page. Should be in [http://www.w3.org/TR/NOTE-datetime W3C Datetime format], but you can omit the time portion.
# '''changefreq''' (optional) -- How often the page is likely to change. Ostensibly this helps search engines figure out how often to crawl the page, but just because you put "hourly" don't expect the Google bot to stop by that often. The possible values are: always, hourly, daily, weekly, monthly, yearly and never. Note that you'll probably only want to use "never" for permalink archive pages.
# '''priority''' (optional) -- The priority of this URL relative to other URLs on your site. In other words, how important is this particular URL in the grand scheme of your site? Possible values range from 0.0 - 1.0. If you don't specify a priority, the url will receive a default value of 0.5.
Two things to keep in mind: first only the <code>loc</code> node is actually required, those, as we'll see below, most out-of-the-box sitemap creators make it easy to give out more info than just the URL.
The other thing to keep in mind is that most of the time search engine bots expect your sitemap to live at <code>http://mysite.com/sitemap.xml</code> -- the root level of the site.
Of course that doesn't mean you can't have a simple pointer file at the root level and then the actual sitemaps file somewhere else. In fact, since your sitemaps.xml file cannot exceed 10 megabytes in size, and should have no more than 50,000 URLs per file, if you've got a very large site, you'll need to use a pointer and several separate sitemap.xml files.
To do that create a root sitemaps.xml file with content like this:
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://mysite/sitemap1.xml</loc>
<lastmod>2008-10-13T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://mysite.com/sitemap2.xml</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>
</pre>
Then at the URLs sitemap1.xml and sitemap2.xml you'd define the different parts of your sitemap using the same scheme we saw above.
== Creating a Sitemap ==
Okay now that you know what a sitemap file is, how do you go about creating one? Well the thing about sitemaps is that they need to be dynamic, that is, whenever you add a new post or URL to your site, you need to update the sitemap.
For small sites, hand coding might be an option, but even the simplest of sites gets pretty complex pretty quickly.
Fortunately there are some tools that can make the task easier. For instance, you can use the [https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html Google Sitemap Generator], which is a Python script that can create a sitemap for you. The sitemap generator even comes with instructions on how to [https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html#recur set up a cronjob] so that your sitemap stays up to date.
But even using cron isn't ideal in most cases -- especially if you have a site that adds dozens of new pages everyday. Luckily, most the the major publishing systems and web frameworks offer ways to create dynamically updated sitemaps. Here's a few links to get your started:
# '''Movable Type''': Movable Type allows you create as many templates as you'd like, so just create a new sitemaps template and make sure it gets served at the urls: http://mysite.com. To help you get started, check out Niall Kennedy's somewhat dated, but still helpful, tutorial on [http://www.niallkennedy.com/blog/2005/06/google-sitemaps.html Sitemaps in Movable Type]. Also check out the Movable Type wiki which has some more [http://wiki.movabletype.org/Canonical_Google_Sitemap_template sitemap examples].
# '''WordPress''': To generate sitemaps in WordPress just install the [http://wordpress.org/extend/plugins/google-sitemap-generator/ Google XML Sitemaps] plugin. It will handle all the dirty work, automatically updating your sitemap every time you edit or create a post.
# '''Django''': The Django web development framework ships with a built-in sitemap generator. For more details read through [http://docs.djangoproject.com/en/dev/ref/contrib/sitemaps/ the official documentation].
# '''Drupal''': Like Django, Drupal ships with a sitemap tool, head over to the [http://drupal.org/project/xmlsitemap official documentation] for more details.
== Conclusion ==
Sitemaps aren't particularly difficult to use and they can work wonders for search engine ranking. They're no substitute for quality content and inbound links, but if Google and rest see your site is a black hole on the web, sending out an invite and offering up a sitemap is one of the best ways to make friends with search engine spiders.
|