Spam filters can be roughly categorised in rule-based filters, Bayesian, or a combination thereof. As far as I’m concerned, good Bayesians are so effective that rule-based ones are not interesting. In particular since wrong/obsolete rules can trigger false positives (a good message being classified as spam, which IMHO is worse than seeing the occasional spam message).
Basically a bayesian spam filter analyses, using word (or word chain) occurrance, a message and calculates the probability that it’s spam. Then, using thresholds it decides whether the message is to be regarded as spam or not. If it is, it ends up in your spam folder. There is generally a feedback mechanism to enable learning, so you can toss things into the spam folder or grab stuff out, and that will adjust the corresponding probabilities. So in a nutshell, Bayesian spam filtering system is probabilistic, dynamic, and self-learning. See also the Wikipedia article providing background info on Bayesian spam filtering.
As mentioned, it does its job extremely well and most email apps use the system in some form, either implemented internally or by using an external tool through an API. Lovely.
What annoys me greatly is that the libraries and APIs appear to focus purely on (email) spam detection. In a way that’s lovely as a lot of work has been done for decoding MIME and HTML, and that helps. On the downside, when you feed in a message all you get out is a yes/no (spam or not), although some have a “maybe/unsure” return value also. It does the specific job, but I atcually want to know the original probability! Some do this in some form but again presume that they’re dealing with emails so they stick it in a header line. I just want to put in a chunk of text, and get out a probability. That’s all. I want less functionality, part of what’s getting done – not more…. funny, huh?
I am now thinking I might need to hack/extend/fork an existing library like bogofilter or dspam, but if you know of a library/API that delivers what I want, please do let me know! Given the functionality is essentially in there and the extensive work already done, it makes no sense to write a completely new implementation.
Filtering makes sense for email spam, you periodically check your spam folder to aid the learning process, and that’s it. But ranking would enable us to deal with a different problem: information overload.
First, let’s define this problem: we connect to mailing lists, news groups, forums, RSS feeds, facebook and Twitter feeds, and so on. We want info on a specific topic, but in the end we get too much info. Actually, each of those is somewhat selective: you pick a specific source: with mailing lists, forums and news groups you are interested in the basic topic, with an RSS feed you’re interested in whomever writes the blog or delivers the web site content, and facebook and twitter friends may have similar interests to you so chances are that what they notice, you might want to know about also.
It’s not entirely clear what bits of info in those feeds is of interest to you or not, but we already know that the total is too much to keep up with, so reading only part is the only way to go. With that in mind, I reckon it’d be great to rank information so that I see the more important stuff first, and then depending on time I can read more (towards the less interesting). Again, I can’t possibly read everything, so any help in selecting is good, right?
My current thoughts on what to do with this (and I’ve already been walking around with the idea for way to long):
- Find library/API that does exactly what I want, or extend/adapt one – submit back the changes, of course. Open Source benefits.
- Adapt feed readers to call the API and somehow apply or display the ranking, depending on their user interface. This could show as sort order, colour coding, etc, depending on context.
- Adapt feed readers to use feedback mechanism, enabling the backend system to learn my preferences.
- Using known connections (twitter/facebook friends), adjust otherwise possibly neutral probability to enhance/accelerate the learning process; implement these external interactions in such a way that individual privacy is not compromised.
Quite possibly there viable Upstarta style business opportunities. We’ll see. Key factors are always incentive and opportunity with the latter also dependent on capability. I have an incentive, as the idea comes from me thinking about how to resolve my own info overload “problem”.
Comments/suggestions welcome!
update: an RSS feed reader that does this is http://www.newsblur.com/
it’s open and has an api… Good start! perhaps the devs can make the system generic so it can also be used for other feed source types.