While doing the research for an article, I discovered an oddity. There are no cached pages of groklaw.net at google. I tried to go back and see if some comments on groklaw?were in cache but it (the google cache) contains no data even though 286 sources link directly to groklaw.net. That would give groklaw serious google crawler activity one would assume…afterall, this site has only 4 links back to it and it has been cached. One has to wonder why groklaw isn’t being cached? Perhaps it is a geeklog websystem CMS (Content Management System)? Perhaps it is…but if so, it isn’t being used at geeklog.net. The question once again is…why is groklaw not cached? Perhaps some could point to a blocking technology (like a robots.txt or crawler controller) that could be limiting the crawler from accessing pages at groklaw. Once again, evidence points that is itsn’t. I did a search for “site:groklaw.net microsoft” and came up with 36 hits…so somehow google is crawling groklaw without caching it. Odd behavior to say the least. Has anyone else ever found this activity with other sites? I’m attempting to contact google about this oddity currently…I’ll let everyone know what develops. Hopefully, the conspiracy theorists will stay quiet 🙂
UPDATE: This has been explained in further detail and has been explained by those commenting as well within the abilities of a standard robots.txt file. So, it looks like I just underestimated the ability to NOT have your site indexed or cached. Thanks goes out to Rob and asdf.
Check out http://www.groklaw.net/robots.txt and http://www.google.com/remove.html#exclude_website
Groklaw prevents all search engines from indexing their stuff.
-Rob
NO CARRIER
See…I could see that, but you can still ‘site:groklaw.net keyword’ which means the search engine still indexes their stuff. So whether they want them to or not, it indexes it.
The only way that groklaw wouldn’t be cached is if they added a meta name of “META NAME=”ROBOTS” CONTENT=”NOARCHIVE” or “META NAME=”GOOGLEBOT” CONTENT=”NOARCHIVE” which should be in the source for their front page…but of course, since they deny robots from entering there, they can’t tell them not to archive. So the question still remains, why aren’t they cached? We can still search the site using google as stated in the post…so that means their still being indexed. If so, why isn’t the site cached.
Google has been emailed on this…I asked a hypothetical website based question on it. I should get a reply in a couple of years.
PJ does not allow google to cache … see this article and comments that follow.
http://www.groklaw.net/article.php?story=20050120012343778#c265908
Groklaw mirrors set up at http://gl.scofacts.org/
http://www.korgwal.com/gl-mirror/
tp reserve the information per the Creative Commons license
No. You are incorrect.
robots.txt blocks everything. No crawling, no indexing, no caching.
Google is picking up the links from other sites. It is “unable” to crawl groklaw.