Title Image

Don Xml's Grok This

The home of Don Demsak
Welcome to Don Xml's Grok This Sign in | Help
in Search

This Blog

Syndication

Site Sponsors

DonXml's All Things Techie

Random Acts of Senseless Tagging

Tagging, tagging everywhere, but not a drop of connotation to be found.

Seems like everyone is tagging their photos, blogs and music with extra meta-data to try to help describe the content.  Just check out sites like Filckr (for photos), Technorati and Del.icio.us (for blogs and websites), and MusicStrands and Last.Fm (for music) to see what I mean.  This type of open tagging of data has been given the name Folksonomy (aka social classification).  The idea behind tagging is to make it easier to search documents (binary or text).

Searching the old fashion way:  The way people normally add context to their searches is to add additional words.  Take the word “rock”.  To a geologist, it means one thing and to someone talking about music, it means a whole other thing, but a document on geology can contain the word “rock” and so can a document on music.

So, if you want to search for rock, in the music sense, you search for “rock music”, and for geology, you would search “rock geology”.  The problem with this way of searching is that it relies on either all the words to be included in the document, or the use of an AI engine to derive the semantic meaning of the document.  Neither method is 100% accurate, since only the author really knows the semantic meaning behind the words they are using, which brings us to the concept of tagging documents.

Because the author is the only one really capable of describing the semantic meaning of the document, folksonomy came to be and lets people tag their documents with keywords or phrases.  The problem is that although you can now search on words or phrase that may not actually be in the document, you still have no idea the semantic meaning of the word or phrase has in relation to the author.  Why?  Because the tag “rock music” can be used by anyone, and everyone’s opinion on what the definition of “rock music” differs at least slightly.  So, if I want to distinguish my definition of rock music from everyone else’s I would create a tag “Rock Music – DonXML”, which is pretty lame.

A much better, and cleaner, way of tagging would be to add the concept of a Controlled Vocabulary (CV), so the person tagging the document can tag their documents with their own tags.  The problem here is that each CV sits in its own world, with no relationship to other CVs, which is not the case with global tags, since, in effect; there is really only one CV.  So you would need the ability to map one CV to another, which then creates a Thesaurus.  The upside is that you could “publish” one or more CVs for your site, use them on your site, and you maintian total control.  Think of it as an extension of the RSS and Atom concepts of Categories, except that there would be a standard method to publish and define your categories and their relationships.

Some contend that adding CV to folksonomy can’t work, since it is expensive to maintain and develop CVs, but I’m not talking about creating massive CVs that everyone share.  I’d rather see every person/site create their own CVs, since each person/site can use their own definitions (just like they can do now with the folksonomy tags).  But, personal CVs require the following additional items to work:

  • A common way to publish the CVs.
  • A common way to publish Thesauri (mapping personal CVs to other CVs)

Fortunately, the Publishing Industry (aka PRISM) already has a public standard for Controlled Vocabularies.  The problem is that there is no standard API built around retrieving or maintaining CVs, the spec is all about the CV document.  The PRISM CV spec could also be used to map between CVs.  One of the mapping problems I’ve been struggling with is what I call relativistic thesauri.  If you take 2 similar terms, each defined in separate CVs, by separate authors, but only the original author should be able to define the relationship between the terms, relative to their term.  So, if MusicMatch has a term Rock, and I have a term Rock, I can mark the MusicMatch term Rock as a broaderTerm of my term called Rock.  MusicMatch may not see it the same way, and they could map my term as a relatedTerm (or even a synonym).  This type of information needs to part of a standard thesaurus publishing API.


There’s a lot more stuff to go over (I didn’t even touch on how to make use of CVs when querying, which, in the end, is the whole reason to go thru this effort). 
Sites like The Working Network, GotDotNet, and CodeZone (along with all the blog sites, and a lot more), could really make use of this, without having to wait for things like Web 2.0 and the Semantic Web.  Without something like this, you are just creating random acts of senseless tagging.

 

Published Monday, November 14, 2005 2:51 PM by donxml

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

Haacked said:

The whole reason "folksonomies" have caught on is the decentralized management. They aren't perfect, but most centralized approaches just failed.

Have you looked at how Flickr approaches this problem? They use a technology they call "clustering". They look at how clusters of tags form together. So if you search for "rock" you'll see clusters around the geographical term as well as clusters around the rock n roll term based on other tags.
November 14, 2005 3:52 PM

Don Demsak said:

I totally agree about being decentralized, but stuff like Flickr still isn't decentralized enough. All the data is kept on Flickr (same thing for Technorati). I would prefer to decentralize to the max, and expose CV services on a web site. The problem is that everyone would need to use some sort of standard, and we have seen how well that actually works out in the real world (not very well). I'm working on a service to do just that, add a CV web service to a site, so that I can combine it with blogging engines like CommunityServer or DasBlog. The problem is that a CV service is just one part of the big picture. You still need a UI to maintain the CV, integration with RSS Writers, creating a Search Web Service, integration with that, and integration with Blog Readers (like SharpReader and RssBandit). Otherwise the CV is pretty much a waste of time.
November 14, 2005 4:04 PM

Korby Parnell said:

Couldn't agree with y'all more. Folksonomies are the way to go. This is an excellent post and the comments are great!
Korby Parnell
Product Manager
Gotdotnet.com
Microsoft.com Communities & Collaborative Development
November 15, 2005 6:44 PM

Dare Obasanjo said:

>The idea behind tagging is to make it easier to search documents (binary or text).

No it isn't. Neither del.icio.us or Flickr which made tagging popular are about using tags for search.
November 15, 2005 9:41 PM

Don Demsak said:

Dare, I guess I'm missing something, since both sites you mention have searching by tags right on their home pages. How else does one search for pictures (in Flickr's case).
November 16, 2005 7:38 AM

Sean Gerety said:

What about having a CV schema that other sites could consume and you could link to in an RSS post to help other site determine the meaning of your tags?

Sean
November 16, 2005 2:26 PM

Don Demsak said:

Sean, that would be the PRISM CV spec. Since it is partyly based on RDF (in version 1.2, even more so in the working draft of 1.3), there is no XML schema to validate against. But there still is a "contract", the spec. I've got to get some examples together to help explain the ideas.
November 16, 2005 4:03 PM

Jeff Atwood said:

> I would prefer to decentralize to the max

Well, Microsoft would prefer that everyone decentralize their shared sign on to Passport, too, but that hasn't happened.

So It's a question of "what would be nice" and "what actually works in the real world".

Also, I'd like a pony.
November 16, 2005 5:47 PM

--chaz said:

-
November 16, 2005 11:40 PM

you've been HAACKED said:

re: Categories vs Tags

September 28, 2006 4:32 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About donxml

I’m an independent consultant, specializing in .Net solutions architecture, based out of New Jersey who also doubles as an evangelist for XML, Domain Driven Design, enterprise architecture and .Net. I do not work for Microsoft, the W3C or any other big company that you may know of (at least not yet). I’ve been an indie for over ten years, and although I’ve been tempted a couple times to take a job with companies like Microsoft, I’ve haven’t found something better than my current situation. I work mostly with the large pharmaceuticals that are based here in New Jersey, and usually find myself on long term contracts. Definitely not the prototypical indie consultant, but it lets me dedicate time to my non-income generating activities like the developer community stuff, plus financing open source projects like XPathmania and MVP-XML. If you would like to talk to me about doing some contract work, just contact me via the contact page. My rates vary widely, depending on lots of different variables, but mostly distance from Jersey, and type of work. Plus, I’ve been known to donate some of my code for various projects.
Powered by Community Server, by Telligent Systems