There are typically two kinds of sitemaps when discussing websites. The HTML sitemap that is a usually a nice looking web page with a list of neatly organized page titles, and then there is the XML sitemap. It looks funky, like some type of computer code, and most people don’t even know it exists. That’s the one I’m talking about today, the funky one, which happens to be one of the most important pages to understand for good SEO.
If you hear someone ask about your XML sitemap, just think of it as the table of contents for search engines. It tells the “bots” what pages they should crawl, index, and rank you for.
Ultimately, it’s just a suggestion and Google has the right to ignore it and crawl your site willy-nilly, but most of the time it abides.
If you make a lot of content changes to a website there is a way to manually ask Google to crawl and re-index your website using the Search Console platform. Using the Fetch option you can tell Google to crawl a specific page and all pages linked from that page and then submit to index. The XML sitemap contains a link to every page on your website and makes re-indexing a website super easy.
You can use this function to index a new page or blog post in record time too. I’ve had blog posts ranking on Google within 20 minutes of publishing them using this trick.
If you want to explore the nuances of the XML sitemap further check out this in-depth guest post on Moz by Michael Cottam.
What to be careful of…
Use the preferred version of the domain that’s been established with Google Search Console. You’ll be dinged with errors if you use the wrong HTTP(S) or www vs non-www version in the sitemap. All URLs must be uniformed to match the primary domain settings.
Google sees each version of a website as a unique entity and will index them separately if given the chance, and indexing duplicate content will not help you rank any better. It causes you to compete with yourself and causes just enough confusion for Google that a competitor will suddenly have a major opportunity to outrank you.
If using the Yoast SEO WordPress plugin the default sitemap will often include custom post types and archive pages that are basically just shell pages that contain no real content. These are low quality pages that only hurt your overall domain authority and should be avoided.
Think about it. If a person lands on a page with no content on it they’re more likely to just leave your website than to click through to another page.
This page is essentially the opposite from the XML sitemap. The robots.txt page tells search engines which pages you DO NOT want crawled and indexed. Think of confirmation pages, gated content, admin pages, etc. These are pages that people should not be able to get to from a Google search, but only after taking a specific action on the website. E.g. email opt-in, making a purchase, filling out a contact form, logging in to a portal, etc.
One thing most websites are missing is a line in the robots.txt file that specifies where the XML sitemap is located. These pages work together by telling search engines, “here are the pages I want you to ignore, and here is the URL for the pages I want indexed.”
Sadly, just like with the XML sitemap these directives are only considered suggestions to search bots and could be ignored at any time. To make sure private pages are not indexed take the extra step of adding a tag to each page you don’t want crawled with the line: <meta name='robots' content='noindex,nofollow' />