Google drop large numbers of pages. "The Madness of King Google""

When Google arrived on the scene in the late 1990s, they came in with a new idea of how to rank pages. Until then, search engines had ranked each page according to what was in the page – it’s content – but it was easy for people to manipulate a page’s content and move it up the rankings. Google’s new idea was to rank pages largely by what was in the links that pointed to them – the clickable link text – which made it a little more difficult for page owners to manipulate the page’s rankings.

Changing the focus from what is in a page to what other websites and pages say about a page (the link text), produced much more relevant search results than the other engines were able to produce at the time.

The idea worked very well, but it could only work well as long as it was never actually used in the real world. As soon as people realised that Google were largely basing their rankings on link text, webmasters and search engine optimizers started to find ways of manipulating the links and link text, and therefore the rankings. From that point on, Google’s results deteriorated, and their fight against link manipulations has continued. We’ve had link exchange schemes for a long time now, and they are all about improving the rankings in Google – and in the other engines that copied Google’s idea.

In the first few months of this year (2006), Google rolled out a new infrastructure for their servers. The infrastructure update was called “Big Daddy”. As the update was completed, people started to notice that Google was dropping their sites’ pages from the index – their pages were being dumped. Many sites that had been fully indexed for a long time were having their pages removed from Google’s index, which caused traffic to deteriorate, and business to be lost. It caused a great deal of frustration, because Google kept quiet about what was happening. Speculation about what was causing it was rife, but nobody outside Google knew exactly why the pages were being dropped.

Then on the 16th May 2006, Matt Cutts, a senior Google software engineer, finally explained something about what was going on. He said that the dropping of pages is caused by the improved crawling and indexing functions in the new Big Daddy infrastructure, and he gave some examples of sites that had had their pages dropped.

Here is what Matt said about one of the sites:

Some one sent in a health care directory domain. It seems like a fine site, and it’s not linking to anything junky. But it only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages.

And about the same site, he went on to say:

A few more relevant links would help us know to crawl more pages from your site.

Because the site hasn’t attracted enough relevant links to it, it won’t have all of its pages included in Google’s index, in spite of the fact that, in Matt’s words, “it seems like a fine site”. He also said the same about another of the examples that he gave.

Let me repeat one of the things that he said about that site. “A few more relevant links would help us know to crawl more pages from your site.” What??? They know that the site is there! They know that the site has more pages that they haven’t crawled and indexed! They don’t need any additional help to know to crawl more pages from the site! If the site has “fine” pages then index them, dammit. That’s what a search engine is supposed to do. That’s what Google’s users expect them to do.

Google never did crawl all sites equally. The amount of PageRank in a site has always affected how often a site is crawled. But they’ve now added links to the criteria, and for the first time they are dumping a site’s pages OUT of the index if it doesn’t have a good enough score. What sense is there in dumping perfectly good and useful pages out of the index? If they are in, leave them in. Why remove them? What difference does it make if a site has only one link pointing to it or a thousand links pointing to it? Does having only one link make it a bad site that people would rather not see? If it does, why index ANY of it’s pages? Nothing makes any sort of sense.

So we now have the situation where Google intentionally leaves “fine” and useful pages out of their index, simply because the sites haven’t attracted enough links to them. It is grossly unfair to website owners, especially to the owners of small websites, most of whom won’t even know that they are being treated so unfairly, and it short-changes Google’s users, since they are being deprived of the opportunity to find many useful pages and resources.

So what now? Google has always talked against doing things to websites and pages, solely because search engines exist. But what can website owners do? Those who aren’t aware of what’s happening to their sites simply lose – end of story. Those who are aware of it are forced into doing something solely because search engines exist. They are forced to contrive unnatural links to their sites – something that Google is actually fighting against – just so that Google will treat them fairly.

Incidentally, link exchanges are no good, because Matt also said that too many reciprocal links causes the same negative effect. The effect being that the site isn’t crawled as often, and fewer pages from the site are indexed.

It’s a penalty. There is no other way to see it. If a site is put on the Web, and the owner doesn’t go in for search engine manipulation by doing unnatural link-building, the site gets penalised by not having all of its pages indexed. It can’t be seen as anything other than a penalty.

Is that the way to run a decent search engine? Not in my opinion it isn’t. Do Google’s users want them to leave useful pages and resources out of the index, just because they haven’t got enough links pointing to them? I don’t think so. As a Google user, I certainly don’t want to be short-changed like that. It is sheer madness to do it. The only winners are those who manipulate Google by contriving unnatural links to their sites. The filthy linking rich get richer, and the link-poor get poorer – and pushed by Google towards spam methods.

Google’s new crawling/indexing system is lunacy. It is grossly unfair to many websites that have never even tried to manipulate the engine by building unnatural links to their sites, and it is very bad for Google’s users, who are intentionally deprived of the opportunity to find many useful pages and resources. Google people always talk about improving the user’s experience, but now they are intentionally depriving their users. It is sheer madness!

What’s wrong with Google indexing decent pages, just because they are there? Doesn’t Google want to index all the good pages for their users any more? It’s what a search engine is supposed to do, it’s what Google’s users expect it to do, and it’s what Google’s users trust it to do, but it’s not what Google is doing.

At the time of writing, the dropping of pages is continuing with a vengeance, and more and more perfectly good sites are being affected.

A word about Matt Cutts

Matt is a senior software engineer at Google, who currently works on the spam side of things. He is Google’s main spam man. He communicates with the outside world through his blog, in which he is often very helpful and informative. Personally, I believe that he is an honest person. I have a great deal of respect for him, and I don’t doubt anything that he says, but I accept that he frequently has to be economical with the truth. He may agree or disagree with some or all of the overwhelming outside opinion concerning Google’s new crawl/index function, but if he agrees with any of it, he cannot voice it publically. This article isn’t about Matt Cutts, or his views and opinions; it is about what Google is doing.

The thread in Matt’s blog where all of this came to light is here.

Update:

Since writing this article, it has occured to me that I may have jumped to the wrong conclusion as to what Google is actually doing with the Big Daddy update. What I haven’t been able to understand is the reason for attacking certain types of links at the point of indexing pages, instead of attacking them in the index itself, where they boost rankings. But attacking certain types of links may not be Big Daddy’s primary purpose.

The growth of the Web continues at a great pace, and no search engine can possibly keep up with it. Index space has to be an issue for the engines sooner or later, and it may be that Big Daddy is Google’s way of addressing the issue now. Search engines have normally tried to index as much of the Web as possible, but, since they can’t keep pace with it, it may be that Google has made a fundamental change to the way they intend to index the Web. Instead of trying to index all pages from as many websites as possible, they may have decided to allow all sites to be represented in the index, but not necessarily to be fully indexed. In that way, they can index pages from more sites, and their index could be said to be more comprehensive.

Matt Cutts has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.

If that’s what Big Daddy is about, then I would have to say that it is fair, because it may be that Google had to leave many sites out of the index due to space restrictions, and the new way would allow pages from more sites to be included in the index.

The Madness of King Google

A word about Matt Cutts

Other posts: