Sunday, 20 April 2008

Crowdsourcing metadata cleaning?

If you're interested in another perspective on dealing with user-generated tags or metadata, this blog post from, Fingerprinting and Metadata Progress Report talks about how they're trying to create 'order from chaos':
So far our fingerprint server identified 23 million unique tracks, from the 650 million fingerprint requests you’ve thrown at it. Who knows how many unique tracks there are out there.. We have a couple of hundred million tracks based on spelling alone – but not all of them are spelt correctly.

They have some interesting issues to deal with in cleaning up their (i.e. your data, if you're a user) data, especially when 'the most popular spelling is not necessarily the correct one'. And what about bands that change their name (but are essentially the same band) or line-up (are they still the same band?) - when do you decide to create a new identifier?

They're letting users who are logged in vote on potential corrections to an artist name, effectively testing crowdsourcing metadata corrections as well as the original data creation process. This model could work for museums - depending on the collection, some museums already get a lot of corrections when parts of their collections are published online. What would happen if we made that process transparent?


  1. Sounds pretty good, the work they're doing.

    I can't help feeling that people generally care much more about music than museum objects (and are thus more likely to want to participate in such a project), however I think we can probably use some aspects of this kind of work.

  2. The thing is, most people aren't setting out to 'participate in a metadata cleaning project', they're just voting on the correct artist name, or organising their iTunes, or whatever.

    Yeah, the user base is different and much bigger, but there are still lots of people organising, annotating and labelling museum objects - students, researchers, curators, collectors.

    If we could somehow capture the knowledge created when people use or interact with our objects, we'd end up with much better metadata. In reality, there are lots of issues to resolve but it's an interesting model.

  3. I agree, it could really work...