Extracting text from PDF documents is a common pre-processing task for text
analysis and NLP work. The main challenges tools face in extracting content
from PDF files is that PDFs are composed of text, graphics and tabular
structures encoded in a form designed for printing.
The following factors can influence how various tools parse PDF content -
- content ordering which may be different from presentation ordering
- graphical content and its presentation
- meta information
Available PDF tools, by implementation language
- Python: PdfMiner,PyPDF2, reportlab, pdfrw
- Java: iText, Apache PDFBox, PDF Clown, PDFXStream
- C/C++: pdflib, qpdf, GNUpdf, hummuspdf, libharu
- PHP: fpdf, tcpdf, mPDF
Notes on tools we've used in our projects
OUR CHOICE - PDFMiner: PDFMiner is our
tool of choice for pythonistas, though it can have some issues with tabular
content. Unlike other PDF-related tools, it focuses entirely on getting and
analyzing text data. PDFMiner allows one to obtain the exact location of text
in a page, as well as other information such as fonts or lines. It includes a
PDF converter that can transform PDF files into other text formats (such as
HTML). It has an extensible PDF parser that can be used for other purposes than
PyPDF2xml: convert PDF to XML as an
intermediate format. Built on pdfminer. Started as an alternative to poppler’s
pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.
Reportlab: There is an open source version, and a
paid version which adds the Report Markup Language (an alternative method of
defining your document).
PDFRW:One of the fastest pure Python PDF
parser available.It can be used with rst2pdf to faithfully reproduce vector
images. It can be used in standalone, or in conjunction with reportlab to
reuse existing PDFs in new ones.
OUR CHOICE - IText: iText is a library that
allows you to generate PDF files on the fly. The iText classes are very useful
for people who need to generate read-only, platform independent documents
containing text, lists, tables and images. The library is especially useful in
combination with Java(TM) technology-based Servlets: The look and feel of HTML
is browser dependent; with iText and PDF you can control exactly how your
servlet's output will look.
OUR CHOICE - Apache PDFBox: For Java devs,
IText/PDFBox are our tools of choice if the requirement focuses on the text
PDF Clown: PDF Clown for Java (PDF Jester) is a Java
1.5 library for reading, manipulating and writing PDF files, with multiple
abstraction layers to satisfy different programming styles: from the lower
level (PDF object model) to the higher (PDF document structure and content
PDFXStream: It is written in 100% pure Java, with
no native components or dependencies. Its only requirement is a compliant Java
1.5 (or higher) JVM. It supports both text and image extraction from PDF and
out-performs most of the other PDF Parsers but this is a licensed software
where the free version only supports text extraction.