Manipulate and Extract/Burst PDF Files Into Images, Text and Other Components with Docsplit

Docsplit is a command line utility written in Ruby (can be used as a Ruby library too) that can be used for splitting apart documents like PDF (Portable Document Format) into their components like plain text, single pages, page images, metadata (title, author, etc.).

Internal Details

Docsplit by DocumentCloud is basically a thin wrapper around these excellent libraries:

What's the one thing every developer wants? More screens! Enhance your coding experience with an external monitor to increase screen real estate.

  1. GraphicsMagick – A set of tools and libraries for image reading, writing and manipulation, somewhat similar to ImageMagick. It can be used via command line as well as through different programming languages like C, C++, Lua, Perl, PHP, Python, Tcl, Ruby, etc. Docsplit uses it to generate page images which in turn requires Ghostscript.
  2. Poppler – A PDF rendering library (in some ways similar to ghostscript) used by docsplit to extract text from PDF as well as metadata like author, date, create, subject, title, page count, etc.
  3. Ghostscript – Used to render and rasterize PDF and Postscript files. Its a dependancy of GraphicsMagick.
  4. PDFtk (The PDF Toolkit) – An open source cross-platform tool that allows various operations on PDF files like merging, splitting, rotating, watrmarking, decrypting and encrypting, filling forms, updating meta data, bursting into single pages and much more.
  5. Tesseract – An open source Optical character recognition (OCR) engine.
  6. LibreOffice – Its a free and open source office suite that is used by docsplit to convert documents to PDF when required.

Apart from GraphicsMagick and Poppler, all other libraries are optional for Docsplit to work (unless you require their functionalities).

Installation

Given that you have Ruby installed, grabbing docsplit is easy:

$ gem install docsplit

Installing GraphicsMagick:

$ brew install graphicsmagick # Mac
$ sudo apt-get install graphicsmagick # Ubuntu

Installing Poppler:

$ brew install poppler # Mac
$ sudo apt-get install poppler-utils poppler-data # Ubuntu

Installing Ghostscript:

$ brew install ghostscript # Mac
$ sudo apt-get install ghostscript # Ubuntu

Installing Tesseract:

$ brew install tesseract # Mac
$ sudo apt-get install tesseract-ocr # Ubuntu

Installing PDFtk:

$ sudo apt-get install pdftk # Ubuntu

For Mac, download the PDFtk Server binary.

Installing LibreOffice:

$ sudo apt-get install libreoffice # Ubuntu

For Mac, download and install the latest release here.

Usage

Lets quickly see the usage of this library for various purposes. First we’ll burst a PDF to generate an image (png by default) for each page.

$ docsplit images file.pdf
$ docsplit images file.pdf --format jpg --pages 5-10,25 # jpg format and pages 5-10 and 25

Split/Burst apart a document into multiple single-page PDFs.

$ docsplit pages file.pdf

Extract complete plain-text (UTF-8 encoded) of a document into a single file or separate files:

$ docsplit text file.pdf # extracts text in a single file
$ docsplit text file.pdf --pages all # extracts text in multiple files (1 per page)

Convert documents readable by LibreOffice into PDFs like doc, docx, ppt, xls as well as html, odf, rtf, swf, svg, and wpd:

$ docsplit pdf document.doc

Retrieving different information related to the document like author, date, creator, keywords, producer, subject, title, length:

$ docsplit length file.pdf
70 # pages
$ docsplit date file.pdf
Thu Jun 13 12:03:06 2013

Conclusion

Docsplit is an excellent tool to manipulate PDF files by splitting them into components as well as retrieve a lot of information about them. If you’ve got any questions, drop them in comments below.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download

Author: Rishabh

Rishabh is a full stack web and mobile developer from India. Follow me on Twitter.

2 thoughts on “Manipulate and Extract/Burst PDF Files Into Images, Text and Other Components with Docsplit”

  1. Gee, I had to jump through all these hoops to install everything on OS X, only to find that this creates images that are re-sized smaller than originals. 1500×1100 instead of 2000×1500. What good is that? I can specify the -d parameter but this re-sizes the image so the quality is reduced. I can already do that with ImageMagick and GhostScript. I can dump out full size images from Acrobat already. I need to be able to do it from command line.

  2. OK, I found what I need. It’s ‘pdfimages’ from the package ‘poppler-utils’, as installed above. It extracts the full images in seconds. Perfect.

    Usage: pdfimages [options]

    e.g.,

    pdfimages -j file.pdf file

    extracts JPG images to file-000.jpg .. file-xxx.jpg

    -j Normally, all images are written as PBM (for monochrome images) or PPM (for non-monochrome images) files. With this option, images in DCT format are saved as JPEG files. All non-DCT images are saved in PBM/PPM format as usual.

Leave a Reply

Your email address will not be published. Required fields are marked *