Friday, 18 April 2008

It's a wonderful, wonderful web

First, the news that Google are starting to crawl the deep or invisible web via html forms on a sample of 'high quality' sites (via The Walker Art Center's New Media Initiatives blog):
This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.
You're probably already well indexed if you have a browsable interface that leads to every single one of your collection records and images and whatever; but if you've got any content that was hidden behind a search form (and I know we have some in older sites), this could give it much greater visibility.

Secondly, Mike Ellis has done a sterling job synthesising some of the official, backchannel and informal conversations about the semantic web at MW2008 and adding his own perspective on his blog.

Talking about Flickr's 20 gazillion tags:

To take an example: at the individual tag level, the flaws of misspellings and inaccuracies are annoying and troublesome, but at a meta level these inaccuracies are ironed out; flattened by sheer mass: a kind of bell-curve peak of correctness. At the same time, inferences can be drawn from the connections and proximity of tags. If the word “cat” appears consistently - in millions and millions of data items - next to the word “kitten” then the system can start to make some assumptions about the related meaning of those words. Out of the apparent chaos of the folksonomy - the lack of formal vocabulary, the anti-taxonomy - comes a higher-level order. Seb put it the other way round by talking about the “shanty towns” of museum data: “examine order and you see chaos”.

The total “value” of the data, in other words, really is way, way greater than the sum of the parts.

So far, so ace. We've been excited about using the implicit links created between data as people consciously record information with tags, or unconsciously with their paths between data to create those 'small ontologies, loosely joined'; the possibilities of multilingual tagging, etc, before. Tags are cool.

But the applications of this could go further:
I got thinking about how this can all be applied to the Semantic Web. It increasingly strikes me that the distributed nature of the machine processable, API-accessible web carries many similar hallmarks. Each of those distributed systems - the Yahoo! Content Analysis API, the Google postcode lookup, Open Calais - are essentially dumb systems. But hook them together; start to patch the entire thing into a distributed framework, and things take on an entirely different complexion.
...
Here’s what I’m starting to gnaw at: maybe it’s here. Maybe if it quacks like a duck, walks like a duck (as per the recent Becta report by Emma Tonkin at UKOLN) then it really is a duck. Maybe the machine-processable web that we see in mashups, API’s, RSS, microformats - the so-called “lightweight” stuff that I’m forever writing about - maybe that’s all we need. Like the widely accepted notion of scale and we-ness in the social and tagged web, perhaps these dumb synapses when put together are enough to give us the collective intelligence - the Semantic Web - that we have talked and written about for so long.
I'd say those capital letters in 'Semantic Web' might scare some of the hardcore SW crowd, but that's ok, isn't it? Semantics (sorry) aside, we're all working towards the same goal - the machine-processable web.

And in the meantime, if we can put our data out there so others can tag it, and so that we're exposing our internal 'tags' (even if they have fancier names in our collections management systems), we're moving in the right direction.

(Now I've got Black's "Wonderful Life" stuck in my head, doh. Luckily it's the cover version without the cheesy synths).

Right, now I'm off to the Museum in Docklands to talk about MultiMimsy database extractions and repositories. Rock.

2 comments:

  1. Thanks for the post and the link..

    The deep web stuff is very interesting - will be watching to see what difference it makes, if any. My initial thought is "if you want it indexed, you should have provided links, not forms" (as per comments on the Google blog) but I guess lots of people don't think SEO in the retentive way that techies do...

    The caps S and W were deliberately provocative :-)

    And now I too have "wonderful world" on my brain. It's like a terrible audio virus...

    Mike

    ReplyDelete
  2. Oh my gawd, Google doing the brute force thing again... How do they know when a query returns a negative without standards such as a 'results' microformat in place? Jeez. So I commented on the post - why not let us set up 'deep web sitemaps' when we can't create browses...

    ReplyDelete