The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source. The company says that the existing solutions did not work that well on their content, and the models struggled to recognise quotes that did not match a classic pattern. Some models were returning too many false positives and identifying generic statements as quotes.
Co-referencing, or the process of establishing the source of a quote by finding the correct reference in the text, was also an issue, especially when the source’s name was mentioned in several sentences or even paragraphs before the quote itself.
To train a model to identify quotes in the text, the company used two tools created by Explosion – Spacy, one of the leading open-source libraries for advanced natural language processing using deep neural networks, and Prodigy, an annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.
Together with AFP, the team manually annotated more than 800 news articles with three entities: content (the quote, in quotation marks), source (the speaker, which might be a person, an organisation, etc), and cue (usually a verb phrase, indicating the act of speech or expression).
The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. The first batch of annotations turned out to be quite noisy and inconsistent, but the team were getting better and better with each iteration.
The model correctly identified all three entities (content, source, cue) in 89% of cases. Considering each entity separately, content scored the highest (93%), followed by a cue (86%) and source (84%).
The company says that it looks forward to building a robust co-reference resolution system and exploring further deep learning. Challenges such as identifying meaningful quotes and content will also be addressed.