BigDaddy using new Googlebot???? [Archive] - Search Engine Roundtable Forums

PDA

View Full Version : BigDaddy using new Googlebot????


pk_synths
02-02-2006, 04:15 PM
I have a huge site. Close to 2 million pages so I've never had an issue with Google indexing my pages fairly quickly. Recently I uploaded 2k new pages and for some reason Googlebot has decided to index those pages with the session id attached.

The strange thing is that this is only happening on BigDaddy!! Instead of having 2k new pages indexed BigDaddy is showing 13k because most are indexed numerous times with just the session id being different.

I checked Google "classic" and it has 700 pages indexed of the 2k and NONE have session ids which looks and sounds normal. But this BigDaddy thing is very strange. My system is setup to not serve Googlebot any session ids by looking at the request string for "Googlebot" any other request gets a session id. SO I'm guessing whatever Google is using to spider my site now isn't using the Googlebot protocol and getting the session id served to it or it's dumping the "Googlebot" protocol after the first request.

If anyone out there has a site specifically setup to not serve Googlebot session ids can you please check BigDaddy and confirm your pages aren't getting indexed with the session id either.

I'd hate to get smacked with a dup penalty because Google decided to change their spider protocol.

Thanks,

rustybrick
02-02-2006, 05:30 PM
Hmmm... not sure if it is a crawling issue versus a way big daddy handles the URLs and duplicate content.

Its very new, so it is hard to tell.

They may just weed out those pages in the SERPs. Did you try searching for the keywords some of those pages are targeting?

! search-engines-web
02-02-2006, 05:48 PM
This has happenned for a number of sites, both on the "Old" & "New" Google......What has also occassionally happenned, is having Both URLs cached & in SERPs - sometimes pages apart - sometimes Clustered right next to one another in SERPs...

There are several likely posiblities for this....but all the theories have not been tested yet.

In some severe cases - this has caused long term damage to a site - but many times Google eventually "repaired" itself

pk_synths
02-03-2006, 05:54 PM
OK after doing some research I think I've found out what the issue is. Google has had BigDaddy in mind for a long time. For over a year now there have been rumours about 2 googlebots and well that part is true. They introduced a new bot called Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html) that's basically collecting data for BigDaddy and has been for over a year now.

What's really clever about this is that Google realized that they can't simply put their existing index onto BigDaddy's new infustructure and hope to eliminate SPAM. There are probabaly millions of cloaked pages that were served to Googlebot over the last 5 years that it would make the index tainted. So Google decided to reindex the entire web using their new bot. What this bot does though is disguise itself as a user so servers wont push Googlebot cloaked pages and spider redirects, etc.

This would explain why there are so many session ids in the BigDaddy index for my site. Whenever the new bot came by, my system thought it was a regular user and served the bot the URL and since Google wants to canonicalize URLs it exceped the session ids.

This is somewhat of a hunch but it seems pretty feisable to me.

pk_synths
02-23-2006, 02:08 PM
Looks like I was right about Googlebot is now using Mozilla as the useragent.

http://www.adsensebits.com/node/24

Explains all the reports of Google not following the robots.txt file and why it's indexing sessionids.

dazzlindonna
02-23-2006, 02:44 PM
Which also means that everyone's external css and js scripts could now be read and understood by Google. (Note: that is a theory, not a statement of fact)