Extracting text from PDF documents is a common pre-processing task for text analysis and NLP work. The main challenges tools face in extracting content from PDF files is that PDFs are composed of text, graphics and tabular structures encoded in a form designed for printing.
Keeping undesirable content out of social networks and communication channels is a common problem. Our email systems today have sophisticated “spam filters” thanks to which we’re protected from much harm and waste of time. The problem of spam is particularly harsh in niche social networks and interest groups which are small and sensitive to disruption. We run one such niche social network for typography enthusiasts called Fontli and we like to protect our dear typographers from content that they’re not interested in - which is everything that isn’t typography. The problem is that this is hard … even for humans!
In this post, we talk about a filter we recently developed and deployed to reduce and flag incidences of non-typographic content on Fontli, using a deep convolutional neural network based image classifier. We’ve had modest success and faced some intriguing situations and results along the way.