The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

TitleThe Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
Publication TypeJournal Article
Year of Publication2017
AuthorsDamerow, Julia
Secondary AuthorsPeirson, B.R. Erick
Tertiary AuthorsLaubichler, Manfred
JournalJournal of Open Research Software
Volume5
Start Page26
Issue1
Date Published09/2017
Abstract

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).

URLhttp://doi.org/10.5334/jors.164
DOI10.5334/jors.164