The GoldenGATE Document Editor
Download the demo version of the GoldenGATE editor here (version date 2011.11.21.12.15, archive last modified 2011.11.21.12.15).
The demo version of the GoldenGATE editor includes all the resources needed to convert OCRed biosystematics documents into XML content marked up in the TaxonX XML schema. The description will guide you in marking up any of the test documents:
Description
The intention of the GoldenGATE editor is to build a bridge between NLP components and XML markup of natural language text according to arbitrary XML schemas. It allows the deployment of NLP components to marking up the bodies of literature they were designed for. In this way, it enables transforming the texts into XML content according to an XML schema that was designed to gain maximum benefit from the knowledge provided in them.
The GoldenGATE editor picks up the ideas of plug-in processing resources and pipelined processing implemented in the GATE framework (http://www.gate.co.uk), which has been widely used in many areas of NLP research. At the same time, it provides a full XML editor including assistance for manipulation of both text and markup, thus allowing users to improve data quality by manual intervention.
In order to achieve maximum flexibility and extensibility, the GoldenGATE editor provides plug-and-play interfaces on many levels: Individual automated components for markup creation and manipulation, entire groups of functionalities, components accessing documents in arbitrary storage locations, and arbitrary document data formats.
Publications
- Sautter, G., K. Böhm, and D. Agosti. 2006. A combining approach to find all taxon names (FAT) in legacy biosisystematics literature. Biodiversity Informatics 3, 41-53. (http://jbi.nhm.ku.edu/index.php/jbi/article/view/34/16)
- Sautter, G., D. Agosti, K. Böhm. 2007. Semi-Automated XML Markup of Biosystematics Legacy Literature with the GoldenGATE Editor. In Proceedings of PSB 2007, Wailea, HI, USA, 2007 (http://psb.stanford.edu/psb-online/proceedings/psb07/sautter.pdf)
- Sautter, G., K. Böhm, F.Padberg, and W. Tichy. 2007. Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In Proceedings of ECDL 2007, Budapest, Hungary, 2007 (http://idaho.ipd.uka.de/GoldenGATE/ecdl2007.pdf)
- Sautter, G., D. Agosti, K. Böhm, C. Klingenberg. 2009. Creating Digital Resources from Legacy Documents - an Experience Report from the Biosystematics Domain. In Proceedings of ESWC, Heraklion, Greece, 2009 (pdf)
- Sautter, G., K. Böhm, C. Kühne, T. Mathäß. 2010. ProcessTron: Efficient Semi-Automated Markup Generation for Scientific Documents. In Proceedings of JCDL, Gold Coast, Australia, 2010 (pdf)
- Sautter, G., K. Böhm. 2011. High-Throughput Crowdsourcing Mechanisms for Complex Tasks. In Proceedings of SocInfo, Singapore, 2011 (Best Paper Award) (pdf)
- Sautter, G., K. Böhm. 2012. Improved Bibliographic Reference Parsing Based on Repeated Patterns. In Proceedings of TPDL 2012, Pafos, Cyprus, 2012 (pdf)
- TaxonX wiki (http://wiki.cs.umb.edu/twiki/bin/view/Ants/WebHome)
Markup Examples
- Marked-Up taxonomic treatment on ant species Probolomyrmex tani, extracted from Fisher, B. L. 2007. A new species of Probolomyrmex from Madagascar. In: Advances in ant systematics (Hymenoptera: Formicidae): Homage to E.O. Wilson - 50 years of contributions, pp. 146-152: 148-150
- Marked-Up taxonomic treatment on ant species Monomorium dentatum, extracted from Sharaf, M. R. 2007. Monomorium dentatum sp. n., a new ant species from Egypt (Hymenoptera: Formicidae) related to the fossulatum group. Zoology in the Middle East (41), pp. 93-98: 94-97
Contact Information
- Christiana Klingenberg (Designer of the markup process and manual), State Museum of Natural History Karlsruhe, Karlsruhe, Germany
- Donat Agosti (Specialist for taxonomic literature), American Museum of Natural History, New York, USA
- Terry Catapano (Designer of TaxonX XML schema), Columbia University, New York, USA
- Guido Sautter (Developer of the GoldenGATE Document Markup System), Universität Karlsruhe (TH), Karlsruhe, Germany
Related Links
- TaxonX (XML schema for taxonomic publications)
- TaxonX Wiki
- Documents (documents digitized by the Madagascar dig lib project)
- Search Portal (search treatments from the Madagscar documents using the semantic markup created in GoldenGATE)
The package natively includes:
Resources for automated and semi-automated markup:
- Management and application of gazetteer lists for automated markup creation
- Management and application of regular expressions for automated markup
- Integration of third-party NLP applications
- Management and application of markup transformation rules
- Management and application of JAPE grammars
- Management and application of bundeling automated processing steps into pipelines, making them accessible as one
- Management and application of markup modification scripts, including a scripting console
- Batch processing lists of input files with some automated processor
- Management and application of AnnotationDiffer style markup quality assessment
- Management and application of markup quality assessment scripts
Components for viewing documents in specialized ways:
- View and edit a document as an XML tree
- View and edit annotations in a list using customly predefined or ad hoc XPath filtering
- View and edit a part of a document using a sliding window
Components for loading and storing documents:
- Document IO using file system
- Document IO using URLs
- Document upload to the Search & Retrieval Server backing the search portal
Components providing support for a variety of document formats:
Documentation:
Read the online help for the GoldenGATE editor here.
Read the JavaDoc of the GoldenGATE editor and its backing components (not for the actual implementations of the plugin interfaces, though)
- StringUtils (utilities for hanlding Strings and CSV data) download
- HtmlXmlUtils (utilities for handling XML, including XPath, based on a flexible, error correcting parser) download
- Gamta (basically a slightly reduced implementation of LMNL, but XML compatible and editable in exchange, requires StringUtils and HtmlXmlUtils on the classpath) download
- EasyIO (utilities for IO (file system, web, JDBC), especially for configuration data, plus eMail, requires mail.jar, StringUtils and HtmlXmlUtils on the classpath) download
- GoldenGATE (the core of the GoldenGATE editor, platform for all plug-ins, requires StringUtils, HtmlXmlUtils, Gamta and EasyIO on the classpath) download
Update Sources:
URLs where the GoldenGATE editor can fetch updates from. These URLs are not meant to access manually through a browser, but to be listed in the UpdateHosts.cnfg file in your GoldenGATE root folder. The GoldenGATE editor automatically downloads and installs updates from the URLs listed in this file.
- http://idaho.ipd.uka.de/GoldenGATE/Udates
- http://plazi.cs.umb.edu/GgServer/Updates
- http://plazi2.cs.umb.edu/GgServer/Updates
- http://plazi.org:8080/GgSRS/GgUpdates