|
The extent of the duplicate url problem had not quite hit home, although I know it is a signifcant issue, until I got a spider to crawl our site.
The auto-generated google sitemap is not working for us at the moment, so I decided to use a sitemap generation service freely available on the internet to generate our xml sitemap on a temporary basis until magento bug is resolved.
The tool i use is pretty good – you can find it at: http://www.xml-sitemaps.com/
This tool goes off and crawls your site finding url’s and then producing a sitemap. It obeys the robots.txt file and so in the crawl process gives a reasonable indication of what a search engine spider would find.
I let the sitemap tool do it’s stuff and to my amazement it produced the following results:
• On this particular site we have about 40 products in 11 categories and a handful of products with size / colour variations – so a maximum of 70 skus – a small site.
• We had restricted some pages via robots.txt (actually our robots.txt file looked like this:
Disallow: /index.php/
Disallow: /*?
Disallow: /*.js$
Disallow: /*.css$
Disallow: /checkout/
Disallow: /tag/
Disallow: /catalogsearch/
Disallow: /review/
Disallow: /app/
Disallow: /downloader/
Disallow: /js/
Disallow: /lib/
Disallow: /media/
Disallow: /*.php$
Disallow: /pkginfo/
Disallow: /report/
Disallow: /skin/
Disallow: /var/
• The sitemap tool only went 3 levels deep and added 500 urls to the sitemap – at which point it stopped spidering as the free tool has a 500 url limit
• The sitemap tool found a significant amount of urls for each product and category – created by the different views of categories, prices, colours etc and
• So from 40 products and 11 categories we had over 500 urls spidered – this is a huge issue.
Looking throught the urls in the generated sitemap I noticed all of the urls generated from the different views had a url starting with the /catalog/ directory.
So just by adding a couple of lines to the robots.txt I was able to solve the majority of the problems – so our robots.txt file is now like this:
User-agent: *
Disallow: /index.php/
Disallow: /*?
Disallow: /*.js$
Disallow: /*.css$
Disallow: /checkout/
Disallow: /tag/
Disallow: /catalogsearch/
Disallow: /review/
Disallow: /app/
Disallow: /downloader/
Disallow: /js/
Disallow: /lib/
Disallow: /media/
Disallow: /*.php$
Disallow: /pkginfo/
Disallow: /report/
Disallow: /skin/
Disallow: /var/
Disallow: /catalog/
Disallow: /customer/
Sitemap: http://www.domain.co.uk/sitemap.xml
So I ran the free sitemap tool again – now it only finds 74 url’s – great - the majority of the issue is temporarily resolved and I am seeing the proper rewritten urls for each product / category.
BUT
I still have duplicate urls for each product as each product is still able to be accessed from several different urls, such as:
www.domain.co.uk/product.html
www.domain.co.uk/category/product.html
and if the product is in multiple categories:
www.domain.co.uk/category1/product.html
www.domain.co.uk/category2/product.html
etc
I see the robots.txt workaround as a fudge it is not a good long term strategy to rely on robots.txt stopping spiders, but for now it is OK.
Long term this has to be fixed with better rewrite rules or different coding for url generation and by addressing the product in multiple categories issue.
I urge you to run the free sitemap tool over your magento site to see what you find.
It is really critical from a search engine perspective and the effectiveness of the SEO on your site that this is dealt with prior to a search engine spider finding and indexing your site.
The really scary part is this is a small store of ours – we have others we are planning to migrate to magento with 8000 skus ! We will not be embraking on this until the url issue is resolved.
I am not putting this forward as an ultimate resolution, but one which we are testing and appears to HELP the issue, NOT resolve it.
Anyway hope this helps some of you.
|