Googlebot and RSS
Dave Winer is upset because he thinks Google’s Googlebot web crawler has started looking for Atom and RSS 1.0 files, while excluding RSS 2.0. However, a quick look at my logs reveals that Googlebot is crawling my RSS 2 feed just fine:
64.68.82.143 - - [22/Apr/2004:02:15:25 -0400] "GET /xml/rss2.xml HTTP/1.0" 200 12891 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
I don’t see any requests for /atom.xml in my logs. There have been a few requests for /index.rdf, but that’s not suspicious since that file did exist on my server until a couple of months ago, and was linked to from my MovableType-generated home page until I edited the template.
It looks to me like Googlebot is just doing what a web crawler should do: Crawling all files linked to from the main page. If Dave’s anonymous correspondant is seeing hits on /index.rdf and /atom.xml, it probably means that his pages contain links to those files. Googlebot isn’t going to guess that a file called /index.xml exists - if you want Googlebot to crawl it, link to it!
One thing I don’t understand is that Dave’s correspondant says “It’s the first time I’ve seen googlebots looking for these files”. Possible explanations:
- Googlebot did look for them before, but he never noticed until today.
- He recently added links to these files.
Update, 1:00 PM
From the comments below, it seems clear that Googlebot is indeed asking some sites for /index.rdf and /atom.xml, even though it hasn’t seen any links to those files, and even when the site itself links to an /index.xml file. Interesting.
Out of curiosity, I ran a few queries to try and figure out how many feeds with common names Google has indexed:
| Filename | Query | Hits |
|---|---|---|
| index.rdf | filetype:rdf index | 188,000 |
| rss.xml | filetype:xml rss | 323,000 |
| atom.xml | filetype:xml atom | 11,100 |

