Docsplit is a command line utility written in Ruby (can be used as a Ruby library too) that can be used for splitting apart documents like PDF (Portable Document Format) into their components like plain text, single pages, page images, metadata (title, author, etc.).
Docsplit by DocumentCloud is basically a thin wrapper around these excellent libraries:
- GraphicsMagick – A set of tools and libraries for image reading, writing and manipulation, somewhat similar to ImageMagick. It can be used via command line as well as through different programming languages like C, C++, Lua, Perl, PHP, Python, Tcl, Ruby, etc. Docsplit uses it to generate page images which in turn requires Ghostscript.
- Poppler – A PDF rendering library (in some ways similar to ghostscript) used by docsplit to extract text from PDF as well as metadata like author, date, create, subject, title, page count, etc.
- Ghostscript – Used to render and rasterize PDF and Postscript files. Its a dependancy of GraphicsMagick.
- PDFtk (The PDF Toolkit) – An open source cross-platform tool that allows various operations on PDF files like merging, splitting, rotating, watrmarking, decrypting and encrypting, filling forms, updating meta data, bursting into single pages and much more.
- Tesseract – An open source Optical character recognition (OCR) engine.
- LibreOffice – Its a free and open source office suite that is used by docsplit to convert documents to PDF when required.
Apart from GraphicsMagick and Poppler, all other libraries are optional for Docsplit to work (unless you require their functionalities).
Given that you have Ruby installed, grabbing docsplit is easy:
$ gem install docsplit
$ brew install graphicsmagick # Mac $ sudo apt-get install graphicsmagick # Ubuntu
$ brew install poppler # Mac $ sudo apt-get install poppler-utils poppler-data # Ubuntu
$ brew install ghostscript # Mac $ sudo apt-get install ghostscript # Ubuntu
$ brew install tesseract # Mac $ sudo apt-get install tesseract-ocr # Ubuntu
$ sudo apt-get install pdftk # Ubuntu
For Mac, download the PDFtk Server binary.
$ sudo apt-get install libreoffice # Ubuntu
For Mac, download and install the latest release here.
Lets quickly see the usage of this library for various purposes. First we’ll burst a PDF to generate an image (png by default) for each page.
$ docsplit images file.pdf $ docsplit images file.pdf --format jpg --pages 5-10,25 # jpg format and pages 5-10 and 25
Split/Burst apart a document into multiple single-page PDFs.
$ docsplit pages file.pdf
Extract complete plain-text (UTF-8 encoded) of a document into a single file or separate files:
$ docsplit text file.pdf # extracts text in a single file $ docsplit text file.pdf --pages all # extracts text in multiple files (1 per page)
Convert documents readable by LibreOffice into PDFs like doc, docx, ppt, xls as well as html, odf, rtf, swf, svg, and wpd:
$ docsplit pdf document.doc
Retrieving different information related to the document like author, date, creator, keywords, producer, subject, title, length:
$ docsplit length file.pdf 70 # pages $ docsplit date file.pdf Thu Jun 13 12:03:06 2013
Docsplit is an excellent tool to manipulate PDF files by splitting them into components as well as retrieve a lot of information about them. If you’ve got any questions, drop them in comments below.