CollegeSpamChecker (2018-2019)

This project has a visual representation available on my Portfolio.

https://portfolio.owenthe.dev/college-spam-checker


CollegeSpamChecker is a simple, but pretty cool project that I made to fiddle around with imaplib, and later the threading libraries in Python. It checks a folder on an email server for spam, collects all the mail, and displays a table showing who’s sending you the most email.

 

CollegeSpamChecker (CSC for the rest of this article) started development in 2018, when, after taking some standardized tests, I began to get bombarded with college spam email. I have a folder called college spam on my personal email server. Any time I get college spam, it gets thrown into this folder. At first, CSC was great, it was able to process a few hundred emails relatively quickly, and I had no complaints. Aside from some formatting errors, CSC worked great…or so I thought.

 

Of course as time went on my college spam folder grew and grew. Instead of taking a minute, CSC took about 5-6 minutes to do a spam analysis. This is when I thought “it’s time for some multi-threaded goodness!”. Of course multi-threaded programs are confusing for newcomers with Python – there’s threading and multiprocessing (and concurrent, which is what CSC uses) that look like they do the same thing but are vastly different, then you have to figure out if your use case is better off for one or the other, it’s a nightmare.

 

For CSC’s multiprocessing goodness, I used concurrent. The basic gist is that once the list of emails is fetched, concurrent will set up X amount of threads to handle a certain section of emails.

For instance, if I had 5 threads and 800 email references in an array (not the actual contents, as that needs to be fetched)

Thread 0 would handle emails from index 0 to index 159

Thread 1 would handle emails from index 160 to 319

And so on and so forth.

I also used concurrent because using anything else results in SSL errors from imaplib. I should also mention that the entire multi-processing part is basically a straight copy & paste from Stack Overflow.

 

After a few hours of coding I did get multi-threaded functionality working for CSC, and oh boy is it fast! 30 seconds to process 3,000 emails with 20 threads. I did also encounter some lovely errors with mail servers complaining about too many concurrent connections, so I built in functionalities to catch these errors. I also baked in the option to set the amount of threads to run the analysis with.

Lastly, because Gmail (and Yahoo Mail by extension) think that authenticating via username and password is insecure (okay very cool!), for CSC to work you have to allow insecure apps. To help with the situation, CSC can automatically fill out server names for Gmail, Yahoo Mail, and Outlook.com, and gives links for users to turn on (then off) allow insecure apps.

(also, CSC doesn’t work with Gmail! Still gotta fix that…)

 

And that’s CollegeSpamChecker! Of course this can be used to analyze all sorts of other mail folders, it isn’t strictly for analyzing college spam. I’ve used it on my main inbox and saw emails from services I forgot I signed up with, so there’s a benefit to that! You could use it to see which marketing campaigns are the loudest (or the quietest). There’s a lot of cool potential with mail analysis.

 


If you want to try out CollegeSpamChecker for yourself, check out the source code here:

https://gitlab.com/o355/CollegeSpamChecker