Image Binarization with Convolutional Neural Networks
We have been using Tesseract for extracting text from document images. Our in-house dataset consists of images of documents captured using mobile phones, scanners and digital PDFs. We have identified two issues where Tesseract fails to extract text:
a) When the illumination in the image is not uniform, Tesseract fails to recognise the presence of text where ever illumination is on the darker side.
b) Due to folds in the paper, Tesseract predictions contain noisy text.
To alleviate these issues we binarized the document images using a convolutional neural network that has been pretrained on a set of real images of 300 DPI quality and their corresponding binarized targets.
- The convolutional neural network is a U-Net architecture.
- The network is trained in such a way that each image is split into $128 \times 128$ sized patches and each patch is passed through the network and its corresponding ground patch is used as target.
- At prediction time the image is again split into $128 \times 128$ patches and the corresponding predictions are collated into one final binarized image.
An illustration of document image binarization.
We have randomly sampled 44 documents from our in-house dataset and examined the OCR output of both original image and binarized image to infer if binarization helped to improve the quality of the OCR prediction. We have observed that in 36% of the images binarization improved the OCR output since it was able to solve the two issues mentioned above. In 36% of images the binarization didn't have any improvement on the OCR prediction. And in the remaining 18%, binarization degraded the image quality by introducing speckle noise and broken text, this lead to heavy degradation in the OCR output.
Further, we investigated as to why binzarization lead to degraded performance in some of the images. In particular we observed that when the images are of lesser quality (less than 300 DPI) and when the images are blurry lead to degraded binarization output.
We believe that preparing a new dataset with low quality images and their corresponding binarized output and retraining the neuralnetwork would lead to solve the above issues.
Using text localization information from Tesseract to imrprove Tesseract's output
Tesseract predicts word's location as a bouding box in the image with text. We have observed that Tesseract was able to recognize the text correctly when we passed the cropped image corresponding to the word it previously predicted.
We have also observed that at times this method fails due to inaccuracies in bounding box predictions. For example when the document image is warped the crop of a particular word may contain words from other lines.