PDF to Text Extraction

Extracting text from PDF documents is a common pre-processing task for text analysis and NLP work. The main challenges tools face in extracting content from PDF files is that PDFs are composed of text, graphics and tabular structures encoded in a form designed for printing.

The following factors can influence how various tools parse PDF content -

  • typography
  • content ordering which may be different from presentation ordering
  • whitespace
  • graphical content and its presentation
  • meta information

Image1

Available PDF tools, by implementation language

  1. Python: PdfMiner,PyPDF2, reportlab, pdfrw
  2. Java: iText, Apache PDFBox, PDF Clown, PDFXStream
  3. C/C++: pdflib, qpdf, GNUpdf, hummuspdf, libharu
  4. PHP: fpdf, tcpdf, mPDF

Notes on tools we’ve used in our projects

OUR CHOICE - PDFMiner: PDFMiner is our tool of choice for pythonistas, though it can have some issues with tabular content. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

PyPDF2xml: convert PDF to XML as an intermediate format. Built on pdfminer. Started as an alternative to poppler’s pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.

Reportlab: There is an open source version, and a paid version which adds the Report Markup Language (an alternative method of defining your document).

PDFRW:One of the fastest pure Python PDF parser available.It can be used with rst2pdf to faithfully reproduce vector images. It can be used in standalone, or in conjunction with reportlab to reuse existing PDFs in new ones.

OUR CHOICE - IText: iText is a library that allows you to generate PDF files on the fly. The iText classes are very useful for people who need to generate read-only, platform independent documents containing text, lists, tables and images. The library is especially useful in combination with Java™ technology-based Servlets: The look and feel of HTML is browser dependent; with iText and PDF you can control exactly how your servlet’s output will look.

OUR CHOICE - Apache PDFBox: For Java devs, IText/PDFBox are our tools of choice if the requirement focuses on the text part.

PDF Clown: PDF Clown for Java (PDF Jester) is a Java 1.5 library for reading, manipulating and writing PDF files, with multiple abstraction layers to satisfy different programming styles: from the lower level (PDF object model) to the higher (PDF document structure and content streaming).

PDFXStream: It is written in 100% pure Java, with no native components or dependencies. Its only requirement is a compliant Java 1.5 (or higher) JVM. It supports both text and image extraction from PDF and out-performs most of the other PDF Parsers but this is a licensed software where the free version only supports text extraction.

Nandyala Pavan Kumar avatar
About Nandyala Pavan Kumar, "Pavan"
Data science and engineering team.