Mozilla Festival Day 1: Notes from Disassembling the world’s worst data wrapper: PDFs

It’s no secret that PDFs are a terrible way to distribute data, so some tips and tools on helping to extract data and information from PDFs.

Tabula

For extracting data in tables. Online version at try.tabula.technology. Also available a version to download and run locally.

If you have any issues, try the other detection mode.

Data can be exported to CSV and some other formats. Must have text-basd characters already, but won’t do OCR for you.

Can use online version to select the area you want and export the script and copy the script into the command line.

If you can, the local version will be much faster, and has more options.

pdftotext

Command line application to dump text from PDF that attempts to preserve layout (with layout switch), but generally need to regular expression to parse the information.

mudraw

Another command line tool that will extract text from PDF.

pdftk

Comes with a tool called pdfimages that will extract images from PDF files.

Notes

More notes on the session etherpad.

Published by

Cynthia

A librarian learning the ways of technology, accessibility, metadata, and people

Leave a Comment

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s