This project was part of the exam for the course Intelligent Data Analysis. This project helped me understand how to work on dataset, perform data wrangling on the dataset, explored various machine learning models, understand the working of the algorithms, optimize the ML models and to validate the results.
You have been hired by the IT department of a medium-sized company to train an email spam filter which should mark the incoming emails of all employees as spam or non-spam. The emails are parsed by a module and converted into the bag-of-words representation. A total of 57,173 different words (features) are distinguished. The aim of the filter is to identify a maximum number of spam emails, with a maximum of 0.2% of all legitimate emails being classified incorrectly. In addition, the company wants to make a statement about the effectiveness of the filter on future emails, i.e., what percentage of incoming spam emails will be identified in the future