Keeping undesirable content out of social networks and communication channels is a common problem. Our email systems today have sophisticated "spam filters" thanks to which we're protected from much harm and waste of time. The problem of spam is particularly harsh in niche social networks and interest groups which are small and sensitive to disruption. We run one such niche social network for typography enthusiasts called Fontli and we like to protect our dear typographers from content that they're not interested in - which is everything that isn't typography. The problem is that this is hard ... even for humans!
In this post, we talk about a filter we recently developed and deployed to reduce and flag incidences of non-typographic content on Fontli, using a deep convolutional neural network based image classifier. We've had modest success and faced some intriguing situations and results along the way.
Table of contents
- What is Fontli?
- What’s special about Fontli?
- So what's the problem?
- Initial take - OpenCV & OCR engine
- Deep Learning
- Useful resources
Typographers are a rare breed for whom the myriad shapes of lettering are infinitely fascinating. They see great importance in the shapes of letters used to convey information in print that most others would ignore, and rightfully so. Fontli offers a place where individuals can share pictures of typography in the wild that fascinate them, have discussions and enrich their engagement with the domain and their community.
Beyond other photo sharing sites like Instagram, Flickr and Snapchat, Fontli offers additional discussion opportunities around text and typography. For instance, if you're unsure what font a particular lettering might be based on, you can have that discussion, drawing on existing font samples already included in the app. Photos posted can also be tagged with known typefaces and added to collections so more people can find them.
In short, we have type geeks on our network and we want to keep them happy.
Recently, we noticed an uptick in members posting non-typographic content such as selfies and scenery. While this kind of content is relatively small, our community is small too and hence even small portions of unwanted posts can be disruptive.
While we can introduce a moderation process to counter this trend, we decided to try a technological route instead to see if we can sift the wheat from the chaff using recently proven image classification techniques.
So here's the "wheat" -
and here is the "chaff" -
You get the idea. Our journey begins now.
Our initial take on solving this problem was to detect textual content from the images, and to get rid of selfies due to their dominance among the undesirables. Later, we concluded that CNNs (Convolutional Neural Networks) can do a better job at detecting them.
We sorted images into categories ([fig-4][fontli_images]) for better understanding of the domain, and tried running them through two libraries OpenCV and the Tesseract OCR engine. We used OpenCV to get rid of selfies and group photos and Tesseract for detecting text within an image.
Both these are great tools at their tasks, but we realized that our core problem is "recognizing typography" and not "filtering spam". The "non-typography" category is simply too large to dedicate techniques to sub-categories within it.
- To detect selfies & group photos ([fig-4][fontli_images]) from the image, it expects
images to have a proper shape and angle of the face. We don't usually get
that on Fontli.
- It rejected posters ([fig-4][fontli_images]), which are valid content on the Fontli
- Good at detecting printed text, but not typography and single letters ([fig-4][fontli_images]).
You can see the performance of each model in table-1.
Now, let’s see how we built CNN models ...
To classify an image into typographic or non-typographic content, we thought the deep convolutional neural networks which had recently been used to great effect at image classification tasks would be suitable. With the help of the python Keras library, we started training some networks for typography detection.
To build a good CNN model, we need data and a tuned architecture.
Once we'd collected adequate number of images, we divided them into training, validation and testing groups. For the architecture we built an initial model (fnet_4) with 2 convolutional layers and 2 fully Connected layers and we observed the model behaviour.
This was a simple, not too deep, architecture we started with for testing the waters.
model = Sequential() # build the model architecture model.add(Convolution2D(32, 3, 3,border_mode='same',input_shape=(128,128,1),W_constraint=maxnorm(3))) model.add(Activation('relu')) model.add(Convolution2D(32, 3, 3)) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(3,3))) model.add(Dropout(0.5)) ## for overfitting model.add(Flatten()) model.add(Dense(64)) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(2)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']) model.summary()
We observed that the training and the validation accuracy of the model were
Training accuracy : 85 % Validation accuracy : 82 %
When tested against the testing dataset , the results were
|Typography (22500 images)||87.4%||
|Non-Typography (22500 images)||
Testing accuracy : 78.45 %
As this was an initial tryout to see how the CNN behaves. We observed that a few images with typography ([fig-4]) were wrongly classified by the model. We attributed that to inadequate training data as we initially didn't have enough images of that sort.
Extra convolutional layers seemed required to extract deeper features and we needed more training data to improve the generalization capability of the network.
But when we feed pure text content images ([fig-4]), the model's prediction probability was high.
The model showed high false positive rate (FT) in the testing set for non-typography images.
The model showed high accuracy in detection of selfies as non-typography compared to OpenCV, when fed such images during training.
The failed cases in non-typography category were not confined to particular kind of images. This is mainly because the training dataset of the model is a smaller subset of non-typography images and we needed more.
We wanted to improve the model and we clearly had to deepen the architecture and expand the data set to cover the training gaps.
We collected about 100k images in all and manually labelled into typography and non-typography, and grouped them into training, validation and testing datasets.
|No of Images||Training||Validation||Testing|
|total = 102,000||45,000||5000||52,000|
As for the architecture, the number of convolutional layers added in the fnet_4
didn't seem good enough and so we tried adding 3 extra convolutional layers
and made the model work with full color images instead of grey scale versions.
model = Sequential() model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(128, 128, 3),activation="relu")) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(32, 3, 3, border_mode='same', activation="relu")) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(32, 3, 3, border_mode='same', activation="relu")) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(16, 3, 3, border_mode='same', activation="relu")) model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1))) model.add(Convolution2D(8, 3, 3, border_mode='same', activation="relu")) model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1))) model.add(Dropout(0.5)) model.add(Flatten()) model.add(Dense(64, activation="relu")) model.add(Dropout(0.4)) model.add(Dense(2, activation="softmax")) model.compile( loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']) model.summary()
We also added data augmentation and regularization techniques to avoid overfitting and to generate more data.
We observed that the training and the validation accuracy of the model improved to -
Training accuracy = 95.29 % Validation accuracy = 95.38%
When tested against the testing dataset , the results were
|Typography (10,090 images)||95.02%||
|Non-Typography (41910 images)||
Testing accuracy : 96.11 %
When more convolutional layers were added into the model, the training accuracy of the model improved significantly compared to fnet_4.
Adding more images in the dataset helped increase the validation and testing accuracy.
Drawbacks in the approaches of OpenCV and OCR engine are handled by the CNN approach (see table-1).
Most of the failed images in the case of traditional approach and fnet_4 were resolved in fnet_7, the accuracy of the model was significantly better compared to the earlier approaches. Even though there were a few failure cases for the model, the overall performance of the model was remarkable compared to the initial fnet_4 architecture.
The data used for training the fnet-7 model is still insufficient to rule out the failures. Our next attempt to improve the model's functioning is to use transfer learning starting from a network pre-trained on the ImageNet data set and use it for the "non-typography" category.
We got some intriguing results for different models when tested with different categories of images (see table-1)
Our experience led us to some interesting conclusions which lay out our hopes and plans for the near future.
Even the most naively designed DCNN did better to eliminate selfies than a custom hand tuned face detector into which multiple researcher-decades have been invested! This bodes well for using a neural compute architecture as a default building block for such "intelligence" tasks.
Getting a network to generalize well is hard in artistic domains. This is because the training data isn't clear cut. Humans too can't quite tell the difference between art and junk with good consensus. Since ours is an artistic domain, our system needs to face the same unceertainty. So we bias in favour of Fontli network members, so that genuinely interesting posts are not inadvertently marked as "spam".
The compute is the easy part today with very high level expressive libraries like Keras being available. Most of our work was with preparing the data and monitoring the adequacy of the data set.
If you're going to experiment using DCNNs, do yourself a favour and get a fairly powerful desktop computer with a good NVidia graphics card - at least a GTX 1080. The now launched GTX 1080 Ti looks promising. The more cards, the merrier.
Image classification is well understood now, but we don't quite understand how the nets do it at the end of the day. Work such as "deep dreaming" provides some insights into nets that we need to follow on. So if you'r looking for domain knowledge as an outcome of such a study, there's still a way to go.