Extracting tabular data from PDFs with Camelot & Excalibur

By Vinayak Mehta

Elevator Pitch

Extracting tables from PDFs is hard since the format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. Camelot and Excalibur can help you solve this problem with their easy-to-use Python and web interfaces.

Description

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn’t work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames and let you download them in your favorite format including CSVs and Excel.

Notes

Outline

Introductions (3 min)
    - Greetings
    - Introduce myself
    - Set expectations for the talk
        - A Jupyter notebook that shows code usage, as shown in the README of the GitHub repo with brief explanation of each step

History of the Portable Document Format (2 mins)
    - The Camelot Project
    - PostScript, the page description language
    - Universal need for sharing documents

“I want to extract tables from this PDF? What do I do?” (5 mins)
    - How/where I stumbled across the problem
    - Problems with tabular data being released in PDFs
    - Why another library/tool?
        - Problems with existing libraries/tools

Camelot: PDF Table Extraction for Humans (7 mins)
    - Why the name? (1 min)
        - Monty Python and the Holy Grail reference
        - Fun-fact about a Monty Python reference used in the Python programming language
    - How to install and run? (1 min)
        - pip install camelot-py
        - A simple API inspired from requests and pandas
    - How to use? (5 mins)
        - Visual debugging
        - Add table areas and columns
        - Flag superscripts and subscripts
        - Shift and copy text in spanning cells of extracted table

Excalibur: The Web Interface to Camelot (7 mins)
    - Why the name? (1 min)
        - Monty Python and the Holy Grail reference
        - Fun-fact about a Monty Python reference used in the Python programming language
    - How to install and run? (1 min)
        - pip install excalibur-py
        - Built with configurability and scale in mind, Airflow-esque
    - How to use? (5 mins)
        - Upload and select page numbers
        - Table auto-detection
        - Draw table areas/columns
        - Input Camelot advanced settings
        - Save settings and select pre-saved settings
        - Download extracted tables in any format (CSV, Excel, JSON, HTML)

How to get involved (1 min)
    - Contributions are welcome!
    - Planned enhancements
        - OCR to extract text and tables from images
        - Removing OpenCV as a dependency
    - Links to documentation and issue tracker
    - Parting note and thank you

Questions (5 mins) 

Audience

Basic familiarity with the Python programming language will help the audience understand the talk better. The talk will briefly touch on the history of PDF and the PDF table extraction use-case, so knowing about that isn’t a prerequisite. The talk can be particularly helpful for data analysts, scientists and journalists since they work with a lot of open data (a lot of which is shared as PDFs) and have a recurrent need to extract tables from PDFs for analysis and record-keeping.

After watching this talk, the audience will have a high-level understanding of how the Portable Document Format works. They will also learn how to easily extract tabular data from any type of PDF (the table structures can be bizarre!) using Camelot (the Python library) or Excalibur (the web interface), access extracted tables as pandas DataFrames and save them into CSVs or Excel files.

Why me?

I’m Vinayak Mehta, author of both the Python library (Camelot [1]) and the web interface (Excalibur [2]). I have also written 3 blog posts on this topic and published two of them ([3] and [4]) on Hacker Noon. I’ll also be speaking on this topic at PyCon US this year[5].

I have given multiple 10-15 minute talks on this topic in Friday demos at SocialCops, where Camelot was created. I’ve also given an introductory talk on Apache Airflow at a PyData meetup in Delhi.[6]

[1] https://github.com/socialcopsdev/camelot
[2] https://github.com/camelot-dev/excalibur
[3] https://hackernoon.com/announcing-camelot-a-python-library-to-extract-tabular-data-from-pdfs-605f8e63c2d5
[4] https://hackernoon.com/an-open-source-science-tool-to-extract-tables-from-pdfs-into-excels-3ed3cc7f22e1
[5] https://us.pycon.org/2019/schedule/presentation/232/
[6] https://speakerdeck.com/vinayakmehta/introduction-to-apache-airflow-at-pydata-delhi-meetup-number-25