Islandora 8 RDM Text Extraction

The Islandora 8 RDM Project, funded by CANARIE, is pleased to announce that it is now possible to extract, edit and index text from images and PDFs on an Islandora 8 installation.

This Text Extraction module implements a new Media type called Extracted Text for both OCR and PDF extracted text. This module consists of two parts:

OCR Text Extraction

This module was a collaborative effort that utilizes Danny Lamb’s Hypercube microservice, which is a key part of the Crayfish suite. Hypercube uses Google’s open source project Tesseract to extract text from images. There are many uses for OCR, the most common is to pull text from scanned newspapers and books for cleanup, analysis, and discovery.

If a resource node is tagged as being both an Image and a Digital Document any Media tied to that node which is tagged as Original File will be sent to the Hypercube microservice for extraction. The extracted text is returned to Islandora 8 to be attached with the media_of field to the original resource node as a separate Extracted Text media.  This text is editable and indexed by Solr, making it fully searchable, offering enhanced value to your digital assets

PDF Text Extraction

Any media tagged as Original File with an application/pdf mime type, irrespective of the resource node’s model, will be sent to the Hypercube microservice to have its text programmatically extracted. The resulting text will be returned to Islandora 8 to be attached to the original parent node as a media with an editable block of text. As with OCR, the text block is editable and indexable by Solr, making it possible for users to search out ingested PDFs by words contained in that PDF’s body.

The Text Extraction Module is now incorporated into Islandora 8 core and is freely available for use.