http://confusedofcalcutta.com/2011/11/27/opensource-edible-landscapes-the-todmorden-story/
If you’re particularly into bad news, there are many places that will indulge your particular interest today. This is not one of them. Here, I want to spend a little time on things that give me hope for humanity, things that have an uplifting effect on me; things that remind me that I have much to be thankful for, things that make my heart sing with joy. Like Todmorden. Todmorden is a old Domesday-Book-mentioned market town that is in both Lancashire as well as Yorkshire (depending on which side of the Calder you’re standing), with about 15,000 people and almost as many ways to pronounce its name (though the locals apparently just call it Tod). I’ve never been there. But I will. Soon. This post will tell you why. Sometime in 2009, I’d seen coverage of something happening in Todmorden that intrigued me. Locals there had apparently agreed to work together to try and become self-sufficient from the perspective of food. Their initial focus was on fruit and […]
Category Archives: Upstarta
Beyond Bayesian Filtering: Ranking
Spam filters can be roughly categorised in rule-based filters, Bayesian, or a combination thereof. As far as I’m concerned, good Bayesians are so effective that rule-based ones are not interesting. In particular since wrong/obsolete rules can trigger false positives (a good message being classified as spam, which IMHO is worse than seeing the occasional spam message).
Basically a bayesian spam filter analyses, using word (or word chain) occurrance, a message and calculates the probability that it’s spam. Then, using thresholds it decides whether the message is to be regarded as spam or not. If it is, it ends up in your spam folder. There is generally a feedback mechanism to enable learning, so you can toss things into the spam folder or grab stuff out, and that will adjust the corresponding probabilities. So in a nutshell, Bayesian spam filtering system is probabilistic, dynamic, and self-learning. See also the Wikipedia article providing background info on Bayesian spam filtering.
As mentioned, it does its job extremely well and most email apps use the system in some form, either implemented internally or by using an external tool through an API. Lovely.
What annoys me greatly is that the libraries and APIs appear to focus purely on (email) spam detection. In a way that’s lovely as a lot of work has been done for decoding MIME and HTML, and that helps. On the downside, when you feed in a message all you get out is a yes/no (spam or not), although some have a “maybe/unsure” return value also. It does the specific job, but I atcually want to know the original probability! Some do this in some form but again presume that they’re dealing with emails so they stick it in a header line. I just want to put in a chunk of text, and get out a probability. That’s all. I want less functionality, part of what’s getting done – not more…. funny, huh?
I am now thinking I might need to hack/extend/fork an existing library like bogofilter or dspam, but if you know of a library/API that delivers what I want, please do let me know! Given the functionality is essentially in there and the extensive work already done, it makes no sense to write a completely new implementation.
Filtering makes sense for email spam, you periodically check your spam folder to aid the learning process, and that’s it. But ranking would enable us to deal with a different problem: information overload.
First, let’s define this problem: we connect to mailing lists, news groups, forums, RSS feeds, facebook and Twitter feeds, and so on. We want info on a specific topic, but in the end we get too much info. Actually, each of those is somewhat selective: you pick a specific source: with mailing lists, forums and news groups you are interested in the basic topic, with an RSS feed you’re interested in whomever writes the blog or delivers the web site content, and facebook and twitter friends may have similar interests to you so chances are that what they notice, you might want to know about also.
It’s not entirely clear what bits of info in those feeds is of interest to you or not, but we already know that the total is too much to keep up with, so reading only part is the only way to go. With that in mind, I reckon it’d be great to rank information so that I see the more important stuff first, and then depending on time I can read more (towards the less interesting). Again, I can’t possibly read everything, so any help in selecting is good, right?
My current thoughts on what to do with this (and I’ve already been walking around with the idea for way to long):
- Find library/API that does exactly what I want, or extend/adapt one – submit back the changes, of course. Open Source benefits.
- Adapt feed readers to call the API and somehow apply or display the ranking, depending on their user interface. This could show as sort order, colour coding, etc, depending on context.
- Adapt feed readers to use feedback mechanism, enabling the backend system to learn my preferences.
- Using known connections (twitter/facebook friends), adjust otherwise possibly neutral probability to enhance/accelerate the learning process; implement these external interactions in such a way that individual privacy is not compromised.
Quite possibly there viable Upstarta style business opportunities. We’ll see. Key factors are always incentive and opportunity with the latter also dependent on capability. I have an incentive, as the idea comes from me thinking about how to resolve my own info overload “problem”.
Comments/suggestions welcome!
Other people’s mail is costly to me
Before I lived at my current address, 4 students shared this location… and from what I can tell, before them some others. A lot of post still arrives in the mailbox for all these people, even after a few years and me doing a lot of return-to-sender efforts. The same companies just keep sending stuff anyway, not just once but ongoing. Also some post has no decent return address. So what do I do?
The range includes super funds and insurance companies; local, state and federal government; credit card companies and banks; universities. I would be a identity theft goldmine, so what do I do? I apparently can’t make it stop. So I try to shred. After all, leaving someone’s Medicare card lying around in the garbage is not nice, is it. Not that I caused it, but it feels wrong anyway.
Because I have a home office, I have a medium load shredder, and thank goodness for that because my word what businesses send out…. “highlight” today was a cosmetics company that apparently felt the need to put some sachet with some cosmetic cream in the envelope also. Aargh.
Ponderings
- does return-to-sender have any effect on company-client communications? I’m not talking addressed spam, but things like banks with their clients, etc. If so, how many RTS does it take to make it stick? If RTS doesn’t work, how do you make ’em stop?
- Companies sending me addressed unsolicited mail… I need to dispose of these items through shredding. The disposal process as a whole takes considerable time, as will asking them to stop mailing me (which apparently is not effective). Can I bill companies for this? Could I sue a company for aiding identity theft?
Update… someone has informed me of details from the Australian Commonwealth Postal Services Act of 1975. Essentially I can neither retain nor destroy the mail, on penalty of up to 2 years of imprisonment. So, no shredding then. The ponderings still apply (and it makes addressed unsolicited mail and unresponsive companies even more costly for an individual!).
And I suppose I’ll just have to hand in un-returnable post to the local post office or mail distribution centre… I can’t keep or destroy it, in those cases I am unable to address it back… so if the post gets me stuck in that way, I’ll have to hand back the responsibility to them. Best I can do?
Advertising – stupid sales
Below is an email exchange between someone trying to sell me a service (online advertising) and myself. It started with an unsolicited email (aka spam) but sometimes I’m just intrigued to see whether I can get some sense out of people. In this case, not.
Original mail:
I work for ***, a leading broker of online advertising, dealing with thousands of independent webmasters like yourself, worldwide.
I’ve had a look at your site and think Openquery.com would be a good match for our client, whose target demographic is similar to your own. We’re working on their behalf to acquire advertising from sites such as yours.
We would be interested in purchasing advertising in the form of a text-based advert on your site. We pay you a fixed annual fee for our advertisements.
My initial reply:
Can you please describe that target demographic to me?
thanks
Their reply:
Great hearing from you today. Thank you for your response.
I have two pricing options for you, depending on the type of client you would be interested in working with.Option 1: We can offer you 250USD for a client in the gaming industry; or
Option 2: We can offer you 200USD for a client in industries including mobile phones, travel or insurance.Let us know your preferred industry. Next, we’ll complete a quick assessment of your site and then advise you of the best client fit. In the meantime we can answer any questions you might have.
My reply:
I already asked you a question in my previous reply, and your response did not address it. In the above you are trying to sell me something, and we’re by no means at that stage. To refresh your memory, you wrote:
>> I’ve had a look at your site and think Openquery.com would be a good
>> match for our client, whose target demographic is similar to your own.And I asked:
> Can you please describe that target demographic to me?
Feel free to answer the question; it was your own statement I am referring to, and I do hope you are not making unfounded statements.
thanks
Their response:
I apologise for that. The target demographic for our client’s advert would typically be anyone who has access to the internet, and it would also depend on the type of client you would choose.(Option 1 or 2)
Let me know what you think.
My final reply:
I think that’s pathetic and useless.
“anyone with access to the Internet” has absolutely no reason to visit our site, as it’s highly specialised. Conclusion: you did not research your prospective client (me and my site) at all.
Go away.
The domain scam – repeat performances
This time it’s .CO, that’s actually Colombia but some domain registrar smartypants have decided that it’s the new “truly global” domain for companies. This, like .biz .mobi and all the others, is utter bollocks and the main people profiting are the domain registrars – although in this case I hope that Colombia at least will get something out of it.
Why is it a scam? Let’s analyse.
- If you already have a .com etc you’d rather not have someone else own the .co, so there’s incentive for you to buy it. This incentive is extra strong if you own the trademark, because you are obliged to defend the trademark or lose it. So if someone else gets the domain you’d need to fight them – so it’s simpler to just register the domain yourself. Gain? Nothing. Just “no potential loss”, at a cost.
- If you can’t get the .com or national domain of a name you might want to get the .co. That is, if the name isn’t trademarked. But does it make sense? Will people be able to find you rather than whoever else owns domains using that name? Unless the others are just domain hogs or dormant sites, you choose to fight for google juice.
- If you grab the .co of a name that is trademarked, you’re just opening a can of worms and in the best case still fighting for google juice. Why bother?
- You’re a Colombian company mainly focused on the national market. Well now you’re messed, because all these “globals” are eating up your namespace – supply&demand predicts that the price you pay for your domain names will go up. Sure some people will make a buck *selling* their existing .co, but in pure numbers most will be new and thus flow out of the country rather than in.
It’s a bit late to stop this, but it’s so very very bad.