Parsing PDF with Python
Oct. 5, 2011, 10:01 p.m.
There's a library for parsing PDF files with python called PDFMiner. The website provides you with a How to Install (basically downloading and doing 'python setup.py install'). With the help of this lib you can search Text within PDF files or extract images and parse all kind of other layout objects in PDF files. A good starting point for using PDFMiner can be found on Denis Papathanasiou's blog.
Inspired by his work I've written a little script using PDFMiner that can search Text and extract images. The search phrase can be given as regular expressions via command line and it then searches all PDF documents within a given folder. You can also call it with a path to a folder to save images found within PDF's to. The basic usage:
./searchpdf.py -h
usage: searchpdf.py [-h] -p PDF_PATH [-e REGEXLIST] [-i IMAGE_PATH]
optional arguments:
-h, --help show this help message and exit
-p PDF_PATH Path to a directory with PDF files to be parsed
-e REGEXLIST Regular expression for searching PDFs
-i IMAGE_PATH Optional path for saving images found within PDF
If you want to have a look at the script code, you can download it here: searchpdf.py.txt. Just remove the suffix .txt and make it executable.