Tagging, tagging everywhere, but not a drop of connotation to be found.
Seems like everyone is tagging their photos, blogs and music with extra meta-data to try to help describe the content. Just check out sites like Filckr (for photos), Technorati and Del.icio.us (for blogs and websites), and MusicStrands and Last.Fm (for music) to see what I mean. This type of open tagging of data has been given the name Folksonomy (aka social classification). The idea behind tagging is to make it easier to search documents (binary or text).
Searching the old fashion way: The way people normally add context to their searches is to add additional words. Take the word “rock”. To a geologist, it means one thing and to someone talking about music, it means a whole other thing, but a document on geology can contain the word “rock” and so can a document on music.
So, if you want to search for rock, in the music sense, you search for “rock music”, and for geology, you would search “rock geology”. The problem with this way of searching is that it relies on either all the words to be included in the document, or the use of an AI engine to derive the semantic meaning of the document. Neither method is 100% accurate, since only the author really knows the semantic meaning behind the words they are using, which brings us to the concept of tagging documents.
Because the author is the only one really capable of describing the semantic meaning of the document, folksonomy came to be and lets people tag their documents with keywords or phrases. The problem is that although you can now search on words or phrase that may not actually be in the document, you still have no idea the semantic meaning of the word or phrase has in relation to the author. Why? Because the tag “rock music” can be used by anyone, and everyone’s opinion on what the definition of “rock music” differs at least slightly. So, if I want to distinguish my definition of rock music from everyone else’s I would create a tag “Rock Music – DonXML”, which is pretty lame.
A much better, and cleaner, way of tagging would be to add the concept of a Controlled Vocabulary (CV), so the person tagging the document can tag their documents with their own tags. The problem here is that each CV sits in its own world, with no relationship to other CVs, which is not the case with global tags, since, in effect; there is really only one CV. So you would need the ability to map one CV to another, which then creates a Thesaurus. The upside is that you could “publish” one or more CVs for your site, use them on your site, and you maintian total control. Think of it as an extension of the RSS and Atom concepts of Categories, except that there would be a standard method to publish and define your categories and their relationships.
Some contend that adding CV to folksonomy can’t work, since it is expensive to maintain and develop CVs, but I’m not talking about creating massive CVs that everyone share. I’d rather see every person/site create their own CVs, since each person/site can use their own definitions (just like they can do now with the folksonomy tags). But, personal CVs require the following additional items to work:
- A common way to publish the CVs.
- A common way to publish Thesauri (mapping personal CVs to other CVs)
Fortunately, the Publishing Industry (aka PRISM) already has a public standard for Controlled Vocabularies. The problem is that there is no standard API built around retrieving or maintaining CVs, the spec is all about the CV document. The PRISM CV spec could also be used to map between CVs. One of the mapping problems I’ve been struggling with is what I call relativistic thesauri. If you take 2 similar terms, each defined in separate CVs, by separate authors, but only the original author should be able to define the relationship between the terms, relative to their term. So, if MusicMatch has a term Rock, and I have a term Rock, I can mark the MusicMatch term Rock as a broaderTerm of my term called Rock. MusicMatch may not see it the same way, and they could map my term as a relatedTerm (or even a synonym). This type of information needs to part of a standard thesaurus publishing API.
There’s a lot more stuff to go over (I didn’t even touch on how to make use of CVs when querying, which, in the end, is the whole reason to go thru this effort).
Sites like The Working Network, GotDotNet, and CodeZone (along with all the blog sites, and a lot more), could really make use of this, without having to wait for things like Web 2.0 and the Semantic Web. Without something like this, you are just creating random acts of senseless tagging.