Summary
Traditional information extraction approaches apply algorithms to existing text, or apply human annotations to existing text before using both the annotations and text as input to an information extraction algorithm. This post presents an approach to help human writers work interactively with existing information extraction algorithms to create documents that those algorithms can parse more successfully.
System and Process
I created a proof-of-concept text-editing tool that displays an existing sentence and prompts the user to re-write it in such a way that as much as possible of the information contained in the original sentence is reflected in the subject-relation-object triples extracted from their re-written sentence. The tool displays the triples extracted from the re-written text, so a user can track their progress as they revise their re-written sentence. By showing writers what an information extraction algorithm can extract from each of their sentences, they can rewrite their sentences until the algorithm extracts more complete and accurate information. The editor is implemented as a web application and is publicly available [1][2][3]. The information extraction algorithm used to implement the editor is the Stanford CoreNLP OpenIE library [4].
A Very Casual Experiment
Experimental Setup
I drew a sample of sentences from an information extraction benchmark [5] described in a paper by Gabriel Stanovsky and Ido Dagan [6]. The first step in the experiment was to present a human re-writer with sentences from the benchmark and prompt them to re-write the sentences using the interactive editing tool in such a way that as much as possible of the information contained in the original sentence is parsed by the information extraction algorithm.
The second step was to show an evaluator the original sentence, the subject-relation-object triples produced by running the information extraction algorithm against the original sentence, and the triples produced by running the information extraction algorithm against the re-written sentence created in the first part of the experiment. The evaluator was prompted to pick which set of triples they believed captured the meaning of the original sentence more accurately and completely, with ties going to the original sentence’s triples.
Experimental Results
Of the 41 sentences presented to the re-writer by the editing tool, they were able to re-write 27 (66%) of them in such a way that they believed the triples the information extraction algorithm parsed from their re-written sentence were an improvement on the triples the information extraction algorithm parsed from the original sentence.
The evaluator from the second step of the experiment judged that 18 (44%) of the re-written sentences produced more complete and accurate subject-relation-object triples than the original sentences. Filtering out the sentences that the re-writer did not believe they could improve, the evaluator believed 18 of 27 (67%) of the rewritten sentences produced more complete and accurate information extraction results.
The re-writer reported spending less than one minute per sentence re-writing sentences in the editing tool.
Bibliography
[1] http://meaning-extractor.com.s3-website-us-east-1.amazonaws.com/
[2] https://github.com/pineapplevendor/interactive-nlp-api
[3] https://github.com/pineapplevendor/interactive_nlp_interface
[4] https://nlp.stanford.edu/software/openie.html
[5] https://github.com/gabrielStanovsky/oie-benchmark
[6] https://pdfs.semanticscholar.org/a32d/7aba28ce9f130934b8e892df5bf2cad97e21.pdf