Screening through hundreds of mail and prioritizing your work for the day is a difficult job. One kind of prioritization that is important is dealing with deadlines that are indicated in emails such as a slide deck or report due. The first step of calendarizing deadlines is to identify that a deadline is indeed present. We’ve tried to tackle this deadline tagging as an email classification problem using a Bayesian approach with the result being PMail. We also built an RNN based model which we compare with the Bayesian approach. We were able to show promising results with the approaches we took, which we detail in this post.
- What does PMail do?
- How are we recognizing the Mails as Deadline?
- Design and Architecture of PMail
- Parsing the Gmail Content
- Building PMail Predictor
- Basic precautions taken
- Reference Links
It identifies email as “contains a deadline” or “doesnt’t contain a deadline” and tags the email when it finds it to be talking about a deadline. Now the question is, what are deadline mails?. We consider mails which have deadline stamps and which need to be brought to the user’s attention immediately.
How is this any different to Gmail’s Important Label ? Gmail predicts important mails based on certain factors that don’t take the mail’s body text into account. whereas in PMail, we predict the tag based on timestamps or time related content in the content present in the mail’s body.
We built a probabilistic model for categorizing the mail into Deadline and Non-Deadline based on the content present in it. The model reads through the content present in the mail to understand the context and categorize it.
The gist is, it collects new email, preprocesses them and uses the model for categorizing them into “Deadlinei” or “Non-Deadline”. The system also collects already predicted mails once in a while and based on user modification it updates the existing user model. This way, the system can learn and improve over time.
As a user creates an account, a user entry is created in auth enabled mongodb, with standard data protection practices in place.
When user adds a subscription, two python asynchronous threads are created. One for fetching new mails and other for updating the subscription model ,the former will fetch new mails after every one minutes and is also responsible for predicting and labeling each mail fetched. The latter thread is a scheduler responsible for updating the individual subscription model after a certain amount of time by fetching the user modified mails and using them to update the model with new labeled data. And these threads dies when user removes the subscription.
Before the first model update for a subscription, the base model is used by subscription for the predictions. So, based on the assumption that every person is different and the way one categorizes a deadline mail differs from person to person, once the user provides input to model, there will be a individual model for every subscribed email id.
Interface for a user to create an account with PMail. After registering, a user can add a subscription for the gmail account. The user can add one or more subscription. User can choose to enable or disable model prediction for gmail defined categories like Social, Updates, Forums. User can also remove the subscription at anytime.
By google authclient, we allow the user to subscribe their gmail account into the PMail server. We allow the user to prioritize their mails on different category like personal, social, primary etc.. The user can enable and disable prioritizing of the email based on category, meaning the user can add or delete mail priority feature for one or more Gmail category labels.
The Google GMAIL API is used for download the mail threads and messages. Using the GMAIL API Reference, PMail Downloader download the mail threadId, reads through each of the threadId and download the message in the mails.
The downloaded mail from the PMail Downloader are preprocessed to filter unnecessary attributes and fields. Along with that, Pmail preprocessor will screen through the body content of the emails.
The Predictor(Model) will go through the mail content from the PMail preprocessor and identify which are deadline and non-deadline mails. Once the mails are classified as Deadline, The Label ‘DeadLine’ is created and added to the Mail’s labels which can be seen by the user in their GMAIL Account.
The PMail Updater updates the Model for the user based on the correction he makes in the predicted mail, means if the user removes the predicted mail from Deadline labels or add Deadline label to the predicted non-deadline labels then the correction count of the changes made by the user is taken into account. If the correction count exceeds the threshold count. The Mail Updater will update the existing model and replace the current model with the updated one.
The way gmail store the mail content is trivial. For the very initial step we just took the gmail encoded html value from the mail json. Then it was found that json can contain html or text key which holds the content. Sometime both keys are present and sometime there is only one. Sometime the content is present in nested way. Figuring out the nested levels and the ways keys are organized in the json itself was tricky. We had to go through all sorts of mails to parse that. And then if we took the text format of content, it comes with stuff like forward tags, hidden contents, reply tags etc., and extracting the raw content is hard from a string containing all these tags, so we chose to extract content from html format, since it contained html tags and for most mails and we used html parser to find and remove unwanted (gmail added) information from the content. Some mails were even not organized properly in html format by gmail. So, extra layers of nlp code was used to find and remove unwanted information. The parser has evolved with time of building the model, debugging the prediction criterias and words responsible for deadline prediction.
PMail Predictor is a Model built on a dataset of over 1000 mails. We have build 2 such Models -
Probabilistic naive bayes Model.
RNN Classifier with ULMFIT Model.
Since classifying the mails as deadline/non-deadline is nothing more than a Bayesian combination of the deadline probabilities of individual words. We have implemented our custom probabilistic model by slightly tweaking the Bayesian filter.
|Predicted Deadline||Predicted Non-Deadline|
- Training Accuracy : 0.85 percentile
- Testing Accuracy : 0.80 percentile
Building the Model:
At the very first build, we found that the Plan-For-Spam categorizes the mail as spam if the weightage of bad words(insignificant for spam) are more than the good words. And in our use case even if a single word or a phrase is good (deadline oriented) the mail should be categorized as deadline mail. Keeping above in mind we chose to ignore the bad words weightage. After this the built model performed better. The phrase was captured by nGram 5 (combination of words upto 5) when breaking the content into words.
Also things like numbers, weekdays, emails ids etc should be treated alike. For example, in “let’s meet on monday” or “let’s meet on tuesday”, monday and tuesday should be treated in the same way, meaning should be converted into a common identifier like say “let’s meet on @@@@”. These treated alike words were converted into their identifiers. And the probability map has key ‘@@@@’ with added probabilities of both weekdays. Similarly dates(numbers) like “07-11-2015” and “06-09-2019” is converted into “##-##-####”. This preprocessing of the mails increase the model accuracy by significant factor.
We built thousands of models for different values of hyperparameters like
- ‘number of top probability words’ to consider for mail prediction,
- ‘innocent probability’ for a word if already doesn’t exist in probability map,
- ‘nGram’ for consider maximum number of words combination to form a phrase,
- ‘deadline word multiplier’ for setting the weightage for each words in deadline labeled folder.
- ‘non deadline word multiplier’ for non deadline words, ‘word occurrence’
for setting a minimum occurrence of a words before its probability is
These thousands of models built with small dataset were also built with larger dataset to test the model consistency.
Updating the Model:
For model updation we followed,
User modified mails are fetched and put up in their respective deadline and non-deadline folders.
Then the word frequency maps are created with frequencies of each word occurrence.
Then the above maps are concatenated with existing deadline and non-deadline word maps.
Then the probability map is recreated by the same way as the base model creation.
After a month of testing and model updation for user input, pmail has shown expected behaviour from the model for the future predictions.
It was observed that the updater has to update the model atleast three or four times to predict the mails which are highly biased to wrong predictions.
ULMFIT(Universal Language Model Fine-tuning ) an effective transfer learning method that can be applied to any task in NLP and introduce techniques that are key for fine-tuning a language model.
The current dataset for building the model is too low. By using ULMFIT with only 100 labeled examples, it matches the performance of training from scratch on 100x more data and with the techniques of Slanted Triangular Learning rates and differential Learning rate, the model will converge faster.
|Predicted Deadline||Predicted Non-Deadline|
- Training Accuracy : 0.901 percent
- Testing Accuracy : 0.88 percent
Training and Validation accuracy of the Model are better even when trained with a smaller dataset (160 mails).
The convergence of the model is faster for fewer epochs (13) compared to traditional deep learning models, which takes atleast 50 epochs to yield better results.
Even for the less number of epochs the computation time is high (takes atleast 6 hrs and 8 GB ram to compute) with CPU but with the GPU (with 8GB RAM) can build the same in 15 min .
Model updation is a concern, rebuilding the model for each individual person takes more computation power and time.
Because of the model updation, we built and deployed the Probabilistic naive bayes Model for Pmail service.
Security is a major concern when it comes to user data and for that matter it is recommended that very minimal information should be saved with the server. To secure user, following are done:
- The mongodb is auth enabled.
- The refresh token are stored in AES 256 encrypted format.
- Credentials are also stored in md5 hash.
Since we are decrypting the refresh token to start service, AES 256 encyption is not completely secure. The refresh tokens are stored in the db which is still vulnerable from application perspective. Since the application is only tested in local environment, the large scale limitations are unknown and yet to be explored.
To avoid intruding user’s privacy, The server does not store any email content or user data in database except the refresh token and the predictions from the model. The predicted data is used to crosscheck and find out user modification for an email at the time of model updation.