Integrated Annotation of Biomedical Text: Creating the PennBioIE Corpus
Mark A. Mandel
(Institute for Research in Cognitive Science
Linguistic Data Consortium, University of Pennsylvania)
In 2001 the Institute for Research in Cognitive Science of the University of
Pennsylvania received a grant from the National Science Foundation to
develop "qualitatively better methods for automatically extracting
information from the biomedical literature". In this presentation I will
concentrate on the development of our corpora. Our texts consist of PubMed
abstracts in two biomedical domains (the molecular genetics of cancer and
the inhibition of cytochrome P-450 enzymes). All our annotation is standoff
rather than in-line, using software developed at Penn. After paragraph,
sentence, and word/token segmentation, we annotate each abstract for as many
as about 20 different entity types before applying and manually correcting
part of speech annotation, which in turn is necessary for manual syntactic
annotation (treebanking). The highly technical "dialects" of these abstracts
require correspondingly specialized tokenization and part of speech
procedures, and the entity annotators are permitted and required to correct
the results of the automatic tokenization as necessary. Similarly, all the
development of our definitions and guidelines has involved continual
intensive interaction between the biomedical specialists, the software
developers, and the annotators, whose insights and feedback have proved
essential to the process and thus to the success of the entire project.
[This material is based upon work supported by the (US) National Science
Foundation under Grant No. ITR-0205448.]