Manipulate and Extract/Burst PDF Files Into Images, Text and Other Components with Docsplit

Docsplit is a command line utility written in Ruby (can be used as a Ruby library too) that can be used for splitting apart documents like PDF (Portable Document Format) into their components like plain text, single pages, page images, metadata (title, author, etc.).

Internal Details

Docsplit by DocumentCloud is basically a thin wrapper around these excellent libraries:

GraphicsMagick – A set of tools and libraries for image reading, writing and manipulation, somewhat similar to ImageMagick. It can be used via command line as well as through different programming languages like C, C++, Lua, Perl, PHP, Python, Tcl, Ruby, etc. Docsplit uses it to generate page images which in turn requires Ghostscript.
Poppler – A PDF rendering library (in some ways similar to ghostscript) used by docsplit to extract text from PDF as well as metadata like author, date, create, subject, title, page count, etc.
Ghostscript – Used to render and rasterize PDF and Postscript files. Its a dependancy of GraphicsMagick.
PDFtk (The PDF Toolkit) – An open source cross-platform tool that allows various operations on PDF files like merging, splitting, rotating, watrmarking, decrypting and encrypting, filling forms, updating meta data, bursting into single pages and much more.
Tesseract – An open source Optical character recognition (OCR) engine.
LibreOffice – Its a free and open source office suite that is used by docsplit to convert documents to PDF when required.

Apart from GraphicsMagick and Poppler, all other libraries are optional for Docsplit to work (unless you require their functionalities).

Installation

Given that you have Ruby installed, grabbing docsplit is easy:

[bash]
$ gem install docsplit
[/bash]

Installing GraphicsMagick:

[bash]
$ brew install graphicsmagick # Mac
$ sudo apt-get install graphicsmagick # Ubuntu
[/bash]

Installing Poppler:

[bash]
$ brew install poppler # Mac
$ sudo apt-get install poppler-utils poppler-data # Ubuntu
[/bash]

Installing Ghostscript:

[bash]
$ brew install ghostscript # Mac
$ sudo apt-get install ghostscript # Ubuntu
[/bash]

Installing Tesseract:

[bash]
$ brew install tesseract # Mac
$ sudo apt-get install tesseract-ocr # Ubuntu
[/bash]

Installing PDFtk:

[bash]
$ sudo apt-get install pdftk # Ubuntu
[/bash]

For Mac, download the PDFtk Server binary.

Installing LibreOffice:

[bash]
$ sudo apt-get install libreoffice # Ubuntu
[/bash]

For Mac, download and install the latest release here.

Usage

Lets quickly see the usage of this library for various purposes. First we’ll burst a PDF to generate an image (png by default) for each page.

[bash]
$ docsplit images file.pdf
$ docsplit images file.pdf –format jpg –pages 5-10,25 # jpg format and pages 5-10 and 25
[/bash]

Split/Burst apart a document into multiple single-page PDFs.

[bash]
$ docsplit pages file.pdf
[/bash]

Extract complete plain-text (UTF-8 encoded) of a document into a single file or separate files:

[bash]
$ docsplit text file.pdf # extracts text in a single file
$ docsplit text file.pdf –pages all # extracts text in multiple files (1 per page)
[/bash]

Convert documents readable by LibreOffice into PDFs like doc, docx, ppt, xls as well as html, odf, rtf, swf, svg, and wpd:

[bash]
$ docsplit pdf document.doc
[/bash]

Retrieving different information related to the document like author, date, creator, keywords, producer, subject, title, length:

[bash]
$ docsplit length file.pdf
70 # pages
$ docsplit date file.pdf
Thu Jun 13 12:03:06 2013
[/bash]

Conclusion

Docsplit is an excellent tool to manipulate PDF files by splitting them into components as well as retrieve a lot of information about them. If you’ve got any questions, drop them in comments below.

Author: Rishabh

Rishabh is a full stack web and mobile developer from India. Follow me on Twitter. View all posts by Rishabh