Googlebot and RSS
Dave Winer is upset because he thinks Google’s Googlebot web crawler has started looking for Atom and RSS 1.0 files, while excluding RSS 2.0. However, a quick look at my logs reveals that Googlebot is crawling my RSS 2 feed just fine:
64.68.82.143 - - [22/Apr/2004:02:15:25 -0400] "GET /xml/rss2.xml HTTP/1.0" 200 12891 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
I don’t see any requests for /atom.xml in my logs. There have been a few requests for /index.rdf, but that’s not suspicious since that file did exist on my server until a couple of months ago, and was linked to from my MovableType-generated home page until I edited the template.
It looks to me like Googlebot is just doing what a web crawler should do: Crawling all files linked to from the main page. If Dave’s anonymous correspondant is seeing hits on /index.rdf and /atom.xml, it probably means that his pages contain links to those files. Googlebot isn’t going to guess that a file called /index.xml exists - if you want Googlebot to crawl it, link to it!
One thing I don’t understand is that Dave’s correspondant says “It’s the first time I’ve seen googlebots looking for these files”. Possible explanations:
- Googlebot did look for them before, but he never noticed until today.
- He recently added links to these files.
Update, 1:00 PM
From the comments below, it seems clear that Googlebot is indeed asking some sites for /index.rdf and /atom.xml, even though it hasn’t seen any links to those files, and even when the site itself links to an /index.xml file. Interesting.
Out of curiosity, I ran a few queries to try and figure out how many feeds with common names Google has indexed:
| Filename | Query | Hits |
|---|---|---|
| index.rdf | filetype:rdf index | 188,000 |
| rss.xml | filetype:xml rss | 323,000 |
| atom.xml | filetype:xml atom | 11,100 |
Dave’s at least partly right. Googlebot has started actively looking for index.rdf and atom.xml. I have neither of those files on any of my sites, but the files are being requested. As for whether Googlebot is automatically requesting rss.xml, I couldn’t tell you because all of my sites have links to their rss.xml files.
Comment by Matt — April 22, 2004 @ 11:03 am
Thanks, Matt, that’s interesting. I still don’t see any /atom.xml requests in my logs, though. It could be that in cases where there is no RSS feed link, Googlebot is trying a couple of default locations for the most popular weblogging tools - /index.rdf for Movable Type, and /atom.xml for Blogger. That would seem like a sensible approach.
I wonder, then - is Googlebot doing this on all sites, or only those it determines to be blogs? And if only blogs, how is it making the distinction?
Comment by Abe — April 22, 2004 @ 11:26 am
Perhaps there is a third site mistakenly linking to atom.xml and index.rdf on all sorts of blogs.
Comment by Aaron Swartz — April 22, 2004 @ 11:50 am
That could be true, as well, Aaron.
Comment by Matt — April 22, 2004 @ 11:52 am
WordPress uses atom.php and rss.php, etc. :-)
Comment by Stoffer — April 22, 2004 @ 11:54 am
[It looks to me like Googlebot is just doing what a web crawler should do: Crawling all files linked to from the main page. If Dave’s anonymous correspondant is seeing hits on /index.rdf and /atom.xml, it probably means that his pages contain links to those files.]
I have links to rss.xml on my site.
I’ve never had links to atom.xml or index.rdf.
My blog is hand-rolled using perl.
Google only started requesting atom.xml and index.rdf from my site today.
I’m not religious about syndication formats but I figure if google starts crawling feeds it should at least request rss.xml.
walter
Comment by walter — April 22, 2004 @ 12:29 pm
Walter,
Thanks for the details. It seems clear that my theory was incorrect. I’ll update the original post.
Comment by Abe — April 22, 2004 @ 12:55 pm
I bet you $10 that Aaron’s theory is correct. Somewhere on the net, there’s a page that’s guessing feed URLs. Google’s just recently indexed it, and is following those links.
It’s not necessarily on your own site. It could be anywhere. But somewhere, that link exists.
Where this page is, or if it even still exists, is unknown. But I’m sure that’s what’s causing these scans.
And incidentally, Dave/Walter’s conspiracy theory about Google choosing which “versions” of feeds to scan for it bullshitty FUD of the first degree. Adjust your tin-foil hats and move on, folks.
Rod.
Comment by Rod — April 22, 2004 @ 1:11 pm
Just for fun…
http://www.scripting.com/Default.asp
http://www.xanadb.com/Default.asp
http://www.photomatt.net/Default.asp
http://slashdot.org/Default.asp
See, now Google’s going to start hitting /Default.asp on those servers. It’s a conspiracy, chaps! Google’s looking for servers that run Microsoft IIS and giving them priority!
Have I made my point yet?
Rod.
Comment by Rod — April 22, 2004 @ 1:15 pm
Rod, good way to express that point. That was my thought as well. The real story is why do experienced bloggers (not Abe) disregard polite inquiry and jump straight to accusations?
Comment by At Work — April 22, 2004 @ 1:35 pm
Because that’s what being part of the A-list is about. Don’t you know? I’m leaning toward Dave’s position, however. When Googlebot is making these requests, it’s only requesting robots.txt, atom.xml, and index.rdf, in that order. Every request has been exactly the same. I don’t think this would be happening if it were simply due to some page having all these bad links.
Comment by Matt — April 22, 2004 @ 1:39 pm
Position doesn’t really make sense in my comment. Interpretation would be a better word.
Comment by Matt — April 22, 2004 @ 1:54 pm
I get google crawled pretty much every day lately, don’t know why, and so far no atom, or rdf requests.
But google is continuing to spider my rss, here’s my log:
I get googled almost every day (have no idea why), and they have not searched for atom, or rdf, but they do always spider my RSS file.
here’s my log:
Host: xxx.xxx.xxx.xxx
Url: /rss.xml Http Code : 200
Date: Apr 22 06:54:22
Http Version: HTTP/1.0″ Size in Bytes: 12073
Referer: - Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Not sure what, if anything this shows, but there you go.
Comment by Pat Rock — April 22, 2004 @ 2:32 pm
Abe wrote: “It could be that in cases where there is no RSS feed link, Googlebot is trying a couple of default locations for the most popular weblogging tools - /index.rdf for Movable Type, and /atom.xml for Blogger.”
That’s my theory too, supported by the behavior mentioned by PhotoMatt: “I’ve been getting random requests from Googlebot for atom.xml and index.rdf files on this site and others. It’s always in the root or in relevant subdirectories (usually /blog or similar).” ( http://photomatt.net/archives/2004/04/20/google-cooking/ )
Comment by Greg R. — April 22, 2004 @ 3:17 pm
“The real story is why do experienced bloggers (not Abe) disregard polite inquiry and jump straight to accusations?”
Its just a continuation of the campaign against Atom. Atom’s got them running scared, and the only way they can slow it down is by launching a concerted Stop Energy campaign using everything from flaming and misdirection through to scare-mongering.
Thankfully, Atom collaborators seem to have stepped above being distracted by such tactics.
If I weren’t so interested in Atom, I’d have a great laugh compiling and detailling the list of Stop Energy tactics these guys have used.
Comment by Isofarro — April 23, 2004 @ 6:14 am
Some people might have an anti-Atom agenda, Isofarro. I do not. My concern is that Google is at some point going to use its dominant position to manipulate what technologies people use. I want to see Atom succeed, but not through dirty tactics. I’m glad that this turned out to be a false alarm.
Comment by Matt — April 23, 2004 @ 7:25 am
Isofarro,
I don’t have an anti-atom agenda either. I’ll switch to atom if/when it becomes the defacto standard. It’s not the standard today, though it may become the defacto standard if this is indicative of googlebot’s future behaviour. All it takes is a few gentle nudges from a very powerful force. I’m not opposed to Atom, I am opposed to the means by which google may be enforcing a new standard.
Then again - brace yourself, you’re unlikely to hear this from any other comment poster - I may be 100% wrong.
Walter
Comment by walter — April 23, 2004 @ 9:36 am
Matt, walter,
My unreserved apologies to you both. My response is directed at the invective here: http://blogs.law.harvard.edu/crimson1/2004/04/22#a1479 where an experienced blogger truely does jump straight into accusations instead of a polite enquiry (and he does have contacts with people inside Google). This isn’t an isolated incident in the barrage of negativity against Atom.
I’m heartened that both seem willing to treat Atom on its own merits. Thank you.
Comment by Isofarro — April 23, 2004 @ 4:11 pm