While doing the research for an article, I discovered an oddity. There are no cached pages of groklaw.net at google. I tried to go back and see if some comments on groklaw?were in cache but it (the google cache) contains no data even though 286 sources link directly to groklaw.net. That would give groklaw serious google crawler activity one would assume…afterall, this site has only 4 links back to it and it has been cached. One has to wonder why groklaw isn’t being cached? Perhaps it is a geeklog websystem CMS (Content Management System)? Perhaps it is…but if so, it isn’t being used at geeklog.net. The question once again is…why is groklaw not cached? Perhaps some could point to a blocking technology (like a robots.txt or crawler controller) that could be limiting the crawler from accessing pages at groklaw. Once again, evidence points that is itsn’t. I did a search for “site:groklaw.net microsoft” and came up with 36 hits…so somehow google is crawling groklaw without caching it. Odd behavior to say the least. Has anyone else ever found this activity with other sites? I’m attempting to contact google about this oddity currently…I’ll let everyone know what develops. Hopefully, the conspiracy theorists will stay quiet 🙂
UPDATE: This has been explained in further detail and has been explained by those commenting as well within the abilities of a standard robots.txt file. So, it looks like I just underestimated the ability to NOT have your site indexed or cached. Thanks goes out to Rob and asdf.
This content is published under the Attribution-Noncommercial-Share Alike 3.0 Unported license.