LED Technology Illuminates New Paths In Sustainable Lighting

The myth of the extraordinary teacher

Progressives Should Appeal to Empathy not Self-Interest

Beyond Bayesian Filtering: Ranking

Spam filters can be roughly categorised in rule-based filters, Bayesian, or a combination thereof. As far as I’m concerned, good Bayesians are so effective that rule-based ones are not interesting. In particular since wrong/obsolete rules can trigger false positives (a good message being classified as spam, which IMHO is worse than seeing the occasional spam message).

Basically a bayesian spam filter analyses, using word (or word chain) occurrance, a message and calculates the probability that it’s spam. Then, using thresholds it decides whether the message is to be regarded as spam or not. If it is, it ends up in your spam folder. There is generally a feedback mechanism to enable learning, so you can toss things into the spam folder or grab stuff out, and that will adjust the corresponding probabilities. So in a nutshell, Bayesian spam filtering system is probabilistic, dynamic, and self-learning. See also the Wikipedia article providing background info on Bayesian spam filtering.

As mentioned, it does its job extremely well and most email apps use the system in some form, either implemented internally or by using an external tool through an API. Lovely.

What annoys me greatly is that the libraries and APIs appear to focus purely on (email) spam detection. In a way that’s lovely as a lot of work has been done for decoding MIME and HTML, and that helps. On the downside, when you feed in a message all you get out is a yes/no (spam or not), although some have a “maybe/unsure” return value also. It does the specific job, but I atcually want to know the original probability! Some do this in some form but again presume that they’re dealing with emails so they stick it in a header line. I just want to put in a chunk of text, and get out a probability. That’s all. I want less functionality, part of what’s getting done – not more…. funny, huh?

I am now thinking I might need to hack/extend/fork an existing library like bogofilter or dspam, but if you know of a library/API that delivers what I want, please do let me know! Given the functionality is essentially in there and the extensive work already done, it makes no sense to write a completely new implementation.

Filtering makes sense for email spam, you periodically check your spam folder to aid the learning process, and that’s it. But ranking would enable us to deal with a different problem: information overload.

First, let’s define this problem: we connect to mailing lists, news groups, forums, RSS feeds, facebook and Twitter feeds, and so on. We want info on a specific topic, but in the end we get too much info. Actually, each of those is somewhat selective: you pick a specific source: with mailing lists, forums and news groups you are interested in the basic topic, with an RSS feed you’re interested in whomever writes the blog or delivers the web site content, and facebook and twitter friends may have similar interests to you so chances are that what they notice, you might want to know about also.

It’s not entirely clear what bits of info in those feeds is of interest to you or not, but we already know that the total is too much to keep up with, so reading only part is the only way to go. With that in mind, I reckon it’d be great to rank information so that I see the more important stuff first, and then depending on time I can read more (towards the less interesting). Again, I can’t possibly read everything, so any help in selecting is good, right?

My current thoughts on what to do with this (and I’ve already been walking around with the idea for way to long):

  • Find library/API that does exactly what I want, or extend/adapt one – submit back the changes, of course. Open Source benefits.
  • Adapt feed readers to call the API and somehow apply or display the ranking, depending on their user interface. This could show as sort order, colour coding, etc, depending on context.
  • Adapt feed readers to use feedback mechanism, enabling the backend system to learn my preferences.
  • Using known connections (twitter/facebook friends), adjust otherwise possibly neutral probability to enhance/accelerate the learning process; implement these external interactions in such a way that individual privacy is not compromised.

Quite possibly there viable Upstarta style business opportunities. We’ll see. Key factors are always incentive and opportunity with the latter also dependent on capability. I have an incentive, as the idea comes from me thinking about how to resolve my own info overload “problem”.

Comments/suggestions welcome!

Pot Luck at Lentz on most Fridays – starting a new tradition?

I moved to my current house months ago but hadn’t yet done a house warming… instead of just holding a belated party, I started something different: (almost) every Friday is Pot Luck at Lentz. All the info is at that wiki page.

It’ll build over time as people get used to it happening. With the iCal feed it’s easy to track and see whether it’s on (some Fridays I might have something else to do), and one of the great things is that even I won’t know who might show up. I like that kind of surprise and it creates a nice mix of different people.

For now I’ve mainly invited some locals, but I have friends around Australia and elsewhere in the world – many will sometimes find themselves in Brisbane and of course it’d be fine for them to just drop in!

The one problem I haven’t yet solved is making sure that friends have my address details up-to-date. My electronic, phone and postal details are location-independent but of course my street address is not… how can I convey that to a relatively large number of people safely; yes a mailing list of sorts could work, but it needs to be kept up to date as new people need to be added and some do change their email. Ideas welcome – in the meantime if you do need my address to have on record, just in case you might find yourself in Bris on a Fri in the coming months, just drop me a line. Then, no need to RSVP for the night, just check the iCal beforehand.