Pdfminer

  1. Pypdf2

We have tried to get an impression on PyMuPDF’s performance. While we know this is very hard and a fair comparison is almost impossible, we feel that we at least should provide some quantitative information to justify our bold comments on MuPDF’s top performance.

Often, you’ll get data from coworkers in.pdf form. This is visually appealing and easy to casually skim through, but an absolute nightmare to get data from. For example, I receive about 50 pdf files every two weeks and need to extract data from tables on the first and fifth pages. The following are 18 code examples for showing how to use pdfminer.pdfdocument.PDFDocument.These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. In order to use pdfminer.highlevel, you will need to run pip3 install pdfminer.six. Then in order to use the package in your code, you will need to add the line import pdfminer.highlevel after your import pdfminer line. This is because Python does not automatically import subpackages by default.

Following are three sections that deal with different aspects of performance:

  • document parsing
  • text extraction
  • image rendering

In each section, the same fixed set of PDF files is being processed by a set of tools. The set of tools varies – for reasons we will explain in the section.

Here is the list of files we are using. Each file name is accompanied by further information: size in bytes, number of pages, number of bookmarks (toc entries), number of links, text size as a percentage of file size, KB per page, PDF version and remarks. text % and KB index are indicators for whether a file is text or graphics oriented.E.g. Adobe.pdf and PyMuPDF.pdf are clearly text oriented, all other files contain many more images.

Part 1: Parsing¶

How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task completely, in one go, front to end. pdfrw too, has a lazy strategy for parsing, meaning it only parses those parts of a document that are required in any moment.

To yet find an answer to the question, we therefore measure the time to copy a PDF file to an output file with each tool, and doing nothing else.

These were the tools

All tools are either platform independent, or at least can run both, on Windows and Unix / Linux (pdftk).

Poppler is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists – in so far somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to Qt and Cairo. We may still include it in future, when a more handy Windows installation is available. We have seen however some analysis, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler’s PDF code base Xpdf.

Image rendering of MuPDF also is about three times faster than the one of Xpdf when comparing the command line tools mudraw of MuPDF and pdftopng of Xpdf – see part 3 of this chapter.

ToolDescription
PyMuPDFtool of this manual, appearing as “fitz” in reports
pdfrwa pure Python tool, is being used by rst2pdf, has interface to ReportLab
PyPDF2a pure Python tool with a very complete function set
pdftka command line utility with numerous functions

This is how each of the tools was used:

PyMuPDF:

pdfrw:

PyPDF2:

pdftk:

Observations

These are our run time findings (in seconds, please note the European number convention: meaning of decimal point and comma is reversed):

If we leave out the Adobe manual, this table looks like

PyMuPDF is by far the fastest: on average 4.5 times faster than the second best (the pure Python tool pdfrw, chapeau pdfrw!), and almost 20 times faster than the command line tool pdftk.

Where PyMuPDF only requires less than 13 seconds to process all files, pdftk affords itself almost 4 minutes.

By far the slowest tool is PyPDF2 – it is more than 66 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2’s bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is only 21.5 times slower than PyMuPDF, 4.5 times slower than pdfrw and 1.2 times slower than pdftk.

If we look at the output PDFs, there is one surprise:

Each tool created a PDF of similar size as the original. Apart from the Adobe case, PyMuPDF always created the smallest output.

Adobe’s manual is an exception: The pure Python tools pdfrw and PyPDF2 reduced its size by more than 20% (and yielded a document which is no longer linearized)!

PyMuPDF and pdftk in contrast drastically increased the size by 40% to about 50 MB (also no longer linearized).

So far, we have no explanation of what is happening here.

Part 2: Text Extraction¶

We also have compared text extraction speed with other tools.

The following table shows a run time comparison. PyMuPDF’s methods appear as “fitz (TEXT)” and “fitz (JSON)” respectively. The tool pdftotext.exe of the Xpdf toolset appears as “xpdf”.

  • extractText(): basic text extraction without layout re-arrangement (using GetText(…, output = “text”))
  • pdftotext: a command line tool of the Xpdf toolset (which also is the basis of Poppler’s library)
  • extractJSON(): text extraction with layout information (using GetText(…, output = “json”))
  • pdfminer: a pure Python PDF tool specialized on text extraction tasks

All tools have been used with their most basic, fanciless functionality – no layout re-arrangements, etc.

For demonstration purposes, we have included a version of GetText(doc, output = “json”), that also re-arranges the output according to occurrence on the page.

Here are the results using the same test files as above (again: decimal point and comma reversed):

Again, (Py-) MuPDF is the fastest around. It is 2.3 to 2.6 times faster than xpdf.

pdfminer, as a pure Python solution, of course is comparatively slow: MuPDF is 50 to 60 times faster and xpdf is 23 times faster. These observations in order of magnitude coincide with the statements on this web site.

Part 3: Image Rendering¶

We have tested rendering speed of MuPDF against the pdftopng.exe, a command lind tool of the Xpdf toolset (the PDF code basis of Poppler).

MuPDF invocation using a resolution of 150 pixels (Xpdf default):

PyMuPDF invocation:

Xpdf invocation:

The resulting runtimes can be found here (again: meaning of decimal point and comma reversed):

  • MuPDF and PyMuPDF are both about 3 times faster than Xpdf.
  • The 2% speed difference between MuPDF (a utility written in C) and PyMuPDF is the Python overhead.

This page explains how to use PDFMiner as a library from other applications.

Overview

PDF is evil. Although it is called a PDF'document', it's nothing like Word or HTML document. PDF is morelike a graphic representation. PDF contents are just a bunch ofinstructions that tell how to place the stuff at each exactposition on a display or paper. In most cases, it has no logicalstructure such as sentences or paragraphs and it cannot adaptitself when the paper size changes. PDFMiner attempts toreconstruct some of those structures by guessing from itspositioning, but there's nothing guaranteed to work. Ugly, Iknow. Again, PDF is evil.

[More technical details about the internal structure of PDF:'How to Extract Text Contents from PDF Manually'(part 1)(part 2)(part 3)]

Because a PDF file has such a big and complex structure,parsing a PDF file as a whole is time and memory consuming. However,not every part is needed for most PDF processing tasks. ThereforePDFMiner takes a strategy of lazy parsing, which is to parse thestuff only when it's necessary. To parse PDF files, you need to use atleast two classes: PDFParser and PDFDocument. These two objects are associated with each other.PDFParser fetches data from a file,and PDFDocument stores it. You'll also needPDFPageInterpreter to process the page contentsand PDFDevice to translate it to whatever you need.PDFResourceManager is used to storeshared resources such as fonts or images.

Figure 1 shows the relationship between the classes in PDFMiner.


Figure 1. Relationships between PDFMiner classes

Basic Usage

A typical way to parse a PDF file is the following:

Performing Layout Analysis

Pdfminer

Here is a typical way to use the layout analysis function:

A layout analyzer returns a LTPage object for each pagein the PDF document. This object contains child objects within the page,forming a tree structure. Figure 2 shows the relationship betweenthese objects.
LTPage
Represents an entire page. May contain child objects likeLTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
LTTextBox
Represents a group of text chunks that can be contained in a rectangular area.Note that this box is created by geometric analysis and does not necessarilyrepresents a logical boundary of the text. It contains a list of LTTextLine objects.get_text() method returns the text content.
LTTextLine
Contains a list of LTChar objects that representa single text line. The characters are aligned either horizontalyor vertically, depending on the text's writing mode.get_text() method returns the text content.
LTChar
LTAnno
Represent an actual letter in the text as a Unicode string.Note that, while a LTChar object has actual boundaries,LTAnno objects does not, as these are 'virtual' characters,inserted by a layout analyzer according to the relationship between two characters(e.g. a space).
LTFigure
Represents an area used by PDF Form objects. PDF Forms can be used topresent figures or pictures by embedding yet another PDF document within a page.Note that LTFigure objects can appear recursively.
LTImage
Represents an image object. Embedded images can be in JPEG or other formats, but currently PDFMiner does not pay much attention to graphical objects.
LTLine
Represents a single straight line.Could be used for separating text or figures.
LTRect
Represents a rectangle.Could be used for framing another pictures or figures.
LTCurve
Represents a generic Bezier curve.

Also, check out a more complete example by Denis Papathanasiou.

Obtaining Table of Contents

PDFMiner provides functions to access the document's table of contents('Outlines').

Some PDF documents use page numbers as destinations, while othersuse page numbers and the physical location within the page. SincePDF does not have a logical structure, and it does not provide away to refer to any in-page object from the outside, there's noway to tell exactly which part of text these destinations arereferring to. Glass house pdf free download.

Extending Functionality

You can extend PDFPageInterpreter and PDFDevice classin order to process them differently / obtain other information.

Pypdf2

Yusuke Shinyama