« Comment Spam | Main | Could not resolve * to a component implementation. »

2006 Spam Statistics

For 2006 I kept very detailed statistics on spam. Below is a summary of some quick analysis that I did. I have a bunch of other data so if there is something else that you are interested in let me know and I'll see if I can extract that information.

First the high level numbers: 95,706 total emails for 2006 of which 74,458 were spam or roughly 78%. I should consider myself "lucky" since the average usually reported is around 90-95%. This translates to about 204 spam messages a day.

I whipped up some GD images to show these spam numbers in action. The spike in March is due to a high volume mailing list that I temporarily signed up for.

Prior to the middle of August of this year my spam filtering setup was pretty bad. As a result up 60 spam messages a day were getting through and had to be manually flagged. After updating my spam filters, I'm now only manually flagging 5 messages a day. I don't keep statistics on the number of false positives since I don't want to or care to look at 200 plus messages a day.

This is a day by day breakdown of spam percentage and message counts. Notice the general trend that while the percentage is constant the total number of spam messages per day is increasing.

Tags: spam


I find it very, very interesting that in your first image, the general shape of the spam and legit lines are identical. Peaks in the same months, declines in the same months. Why? Does it imply massive misclassification of one or the other? Is it all artifacts similar to your March mailing list? If so, why does spam volume of a list tend to correspond with its legit volume?
I suspect that being on the mailing list caused my email address to end up on more computers that were infected with some kind of virus that sends email. As a result joining and then leaving the mailing list would account for the spike/fall during March. Looking over the messages that were flagged during that month, I don't see anything strange otherwise.
Yeah, you explained that month fine. It's all the rest of it I'm curious about. In almost every month the spam and legits move in the same direction, there's a significant degree of correlation.
I beg to differ about the other months. The blue legit line is almost flat the rest of the year while the green spam line is on an upward trend. The legit line also moves down in Sep, Nov, and Dec while the spam line is moving upwards. In this case I'm not seeing the correlation you are.
OK. I think the line shapes look very, very similar, but I'm not gonna actually crank them through Matlab or anything.