UHG
Search
Close this search box.

Creating A ML Solution That Accurately Extracts Quotes From News Articles

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source.

Share

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source. The company says that the existing solutions did not work that well on their content, and the models struggled to recognise quotes that did not match a classic pattern. Some models were returning too many false positives and identifying generic statements as quotes.

Co-referencing, or the process of establishing the source of a quote by finding the correct reference in the text, was also an issue, especially when the source’s name was mentioned in several sentences or even paragraphs before the quote itself. 

To train a model to identify quotes in the text, the company used two tools created by Explosion –  Spacy, one of the leading open-source libraries for advanced natural language processing using deep neural networks, and Prodigy, an annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.

Together with AFP, the team manually annotated more than 800 news articles with three entities: content (the quote, in quotation marks), source (the speaker, which might be a person, an organisation, etc), and cue (usually a verb phrase, indicating the act of speech or expression).

The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. The first batch of annotations turned out to be quite noisy and inconsistent, but the team were getting better and better with each iteration.

The model correctly identified all three entities (content, source, cue) in 89% of cases. Considering each entity separately, content scored the highest (93%), followed by a cue (86%) and source (84%).

The company says that it looks forward to building a robust co-reference resolution system and exploring further deep learning. Challenges such as identifying meaningful quotes and content will also be addressed. 

📣 Want to advertise in AIM? Book here

Picture of Victor Dey

Victor Dey

Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.