NLP News, Stories and Latest Updates https://analyticsindiamag.com/news/nlp/ Artificial Intelligence news, conferences, courses & apps in India Mon, 15 Jul 2024 12:29:22 +0000 en-US hourly 1 https://analyticsindiamag.com/wp-content/uploads/2019/11/cropped-aim-new-logo-1-22-3-32x32.jpg NLP News, Stories and Latest Updates https://analyticsindiamag.com/news/nlp/ 32 32 Kapur’s AlterEgo ‘Thinks’ Ahead of Musk’s Neuralink https://analyticsindiamag.com/ai-breakthroughs/kapurs-alterego-thinks-ahead-of-musks-neuralink/ Wed, 31 Jan 2024 11:52:40 +0000 https://analyticsindiamag.com/?p=10111732 Kapurs AlterEgo

AlterEgo, developed by Arnav Kapur, uses neuromuscular signals to produce speech through a sensory and auditory feedback system.

The post Kapur’s AlterEgo ‘Thinks’ Ahead of Musk’s Neuralink appeared first on AIM.

]]>
Kapurs AlterEgo

While everyone is busy talking about how awesome Elon Musk’s Neuralink is, and its first human implant, calling it Telepathy and whatnot, AlterEgo creator Arnav Kapur, who ordered a pizza with just a thought five years ago, seems to have already surpassed its purpose. Simply put, it’s mind control without implants. 

“The idea was conceptualised 10-12 years ago, but I got to build a business in hardware around it in 2015-16. It became public only in 2018,” said Kapur, in an exclusive interview with AIM, sharing the details about his AI-enabled invention, AlterEgo, that uses neuromuscular signals to produce speech through a sensory and auditory feedback system. 

It was easier said than done. “We were doing this from scratch, because nothing like this had been built,” he added, attributing the delay to the integration of multiple features such as signal processing hardware, AI, and NLP engine into the device. 

Arnav Kapur demonstrating AlterEgo on the show 60 Minutes. Source: CBS News

Accessibility in Tech

While the device may have manifold uses, AlterEgo was primarily built to help people with speech disabilities. “The project is close to my heart. At the time, I didn’t know about the number of people that were suffering from unusual speech pathologists, and there are a range of conditions that are not even categorised properly. Everybody’s speech is very different. We could augment their ability to communicate,” said Kapur, saying how his device would assist people with disabilities. 

In his TED talk in 2019, Kapur spoke about an old patient who had been diagnosed with ALS for over 12 years and had lost the ability to speak. Using AlterEgo, the patient was able to convey his surprising first message of wanting to reboot his existing computer system. 

Doug, who has been diagnosed with ALS for 12 years, uses AlterEgo. Source: TED

Speaking about the efforts in bringing accessibility in tech, Kapur wishes there was more enthusiasm, and vigour around it. “There are a lot of challenges working with such devices and patients and I think big tech could do a lot more in looking at these problems,” he said. 

Having worked with visually impaired people, Kapur believes that the tech that is built for them needs to be inclusive. “A lot of people don’t like to use something that’s specially designed for them. People with different conditions want to feel included rather than have a different lifestyle,” he said. 

Bringing Seamless AI Integration to the World

Contrastingly different from Musk’s concept of BCI hardware, Kapur’s concept of AI involves seamless integration, the most important being non-invasive, “nothing that requires surgery,” he said. “I always thought of AI as an extension, and something that would complement human intelligence. AI should be seamlessly integrating and also complementing you. It should not unplug you from the world.” 

The team at AlterEgo is looking to scale the product and make it commercial. They have collaborated with a number of labs and individuals; however, they are taking the time to refine the device to ensure a release that people can use very intuitively. “I think you have to get it right in terms of the actual interface,” he said. 

Aligns with OpenAI

Interestingly, OpenAI’s CTO Mira Murati also held a similar belief; she hoped to make interacting with computers ‘as intuitive as playing with a ball.’

Kapur, who looks at technology and AI under different lenses, believes that in terms of science of AI, there is still a big piece to be solved. “I think the fundamental behind ChatGPT is that it’s found a very efficient compression scheme of the internet. It’s trained to do question answering very well, instead of just decoding text, but I think there’s still a huge puzzle piece missing,” he said. 

Kapur’s take on gadgets and devices are quite different too. “Even though smartphones and computers are great, they’re not exactly designed to augment you. We don’t test, we are hunched, we are sort of interfacing. It’s designed as an external box, and I think we tend to give AI and computers way more personality than we ought to.”  

A Free Soul 

Born and raised in Delhi, Kapur, has always been interested in science and arts while growing up, and he has never distinguished between the two. Having dabbled with different streams, including computational biology at Harvard Medical (prior to joining MIT for his Masters and PhD), Kapur’s fascination with both the medical and tech world has always fueled his future research projects. “I love to code. I like working on hardware interfaces, and I like reading theory as well,” said Kapur, who is also amazed with Calculus which he believes is the base for AI. 

He is currently working with a number of independent projects where he helps companies with their AI infrastructure.“I’m sort of like a nomad, so I go from place to place, and part-time sort of work on those projects, and the idea is to fund my research,” he said. Kapur has also done a short alternative stint with an aerospace company in Bangalore too. 

The post Kapur’s AlterEgo ‘Thinks’ Ahead of Musk’s Neuralink appeared first on AIM.

]]>
Former NVIDIA Exec-led RagaAI Aims to Fix Failing AI Models https://analyticsindiamag.com/ai-breakthroughs/former-nvidia-exec-led-ragaai-aims-to-fix-failing-ai-models/ Thu, 25 Jan 2024 09:30:00 +0000 https://analyticsindiamag.com/?p=10111459 Former NVIDIA Exe Raga AI

“I believe that Gen AI and AI is magical, and the impact it has on our lives and society is profound, but the failures are also very detrimental,” said Gaurav Agarwal, founder and CEO of RagaAI

The post Former NVIDIA Exec-led RagaAI Aims to Fix Failing AI Models appeared first on AIM.

]]>
Former NVIDIA Exe Raga AI

Years ago in San Francisco, Gaurav Agarwal escaped a near-death experience while driving a semi-autonomous vehicle. At the last moment, Agarwal had to intervene to override the system, as the car failed to apply brakes at a crucial juncture. This incident got him thinking about how such failures can be avoided. He found that it is extremely difficult to detect, diagnose, and fix these failures. These efforts eventually led to him building RagaAI.

Coming out of stealth mode with a $4.7million funding, is the AI-focused startup RagaAI. Building the world’s first automated platform for detecting AI issues, diagnosing and fixing them, RagaAI looks to remove human interventions for continuous improvements. 

“At RagaAI, we are addressing one of the most important problem statements of the AI industry, which is AI failures,” said Gaurav Agarwal, founder and CEO of RagaAI, in an exclusive interaction with AIM. “If AI from the leading companies of the world, OpenAI’s of the world, are failing, then you can think about the thousands of companies that are building and deploying AI.” 

The Need For Fix

Over the years, Agarwal has witnessed the struggles of teams building AI. “As a data scientist, one wants to spend a lot of time building AI, rather than testing, which comes as an afterthought. And, this becomes a big issue,” said Agarwal. 

“We have been working heads down to build the core technology which is a foundation model that we call RagaAI DNA. We have tested the technology and have proven the technology with our customers, and now we believe it’s time for us to start scaling,” said Agarwal. “I believe that Gen AI and AI is magical, and the impact it has on our lives and society is profound, but the failures are also very detrimental.”

AI Failures Are Imminent

Broadly categorised into three buckets, the kind of failures occur at data level, AI model/architecture level, and the operational and environmental conditions level. “For example, when we train AI on large GPUs, we usually end up deploying it on a different GPU. So, the environment has changed, which can lead to performance, accuracy, latency and cost change,” said Agarwal. 

‘RagaAI’ simply translates to ‘tuning of AI,’ referencing the company’s mission of finding and fixing errors in AI models. The platform is built on their foundational model RagaAI DNA, where over 300 tests are built in order to look at all types of AI failures. 

RagaAI Flow Diagram. Source: RagaAI

Robust Model For All

Speaking about the evolution of AI, Agarwal highlighted the adoption of AI models over the years. “If you go back to 20 years ago, it was regression models or simplistic statistical models. 10 years ago, it was deep learning models which became very important in 2015, and now in the past two to three years, it’s been generative models,” he said.

RagaAI is built on top of open-source and proprietary models. “We have basically done a lot of fine tuning and customization on top of that,” said Agarwal. 

Interestingly, the platform supports testing for all kinds of models. Agarwal explains the types that can be broadly bucketed into four categories: large language models used in chatbots, image/video in computer vision models where customers in video surveillance, retail, aerospace, and others use it, structure of tabular data where customers in finance, insurance are benefited, and NLP speech/audio where call centres use for monitoring sales calls of customers. 

“Obviously, you have to do customisation for different domains, but we are one platform which caters to all different elements,” said Agarwal. 

RagaAI has customers from Fortune 200 and Fortune 500 companies from US, Europe and India. He also confirmed that their customers have witnessed over a 90% reduction of failures with RagaAI. The company has strategic partnerships with tech giants Nvidia, Qualcomm, and AWS. 

With respect to the competitive landscape, there are other platforms that are building on similar lines, however, not to the extensive limit to RagaAI’s testing modules. “We have seen other companies offering some of this, but they provide up to 10% of what we are offering, maybe 20 or 30 tests. We offer 300+ tests, and also address different kinds of modalities.” 

Powerful Team Expertise

Agarwal, who comes with a vast experience in the field of AI for over 20 years, has built a robust team at RagaAI. Having worked with Texas Instruments on the business and product team, followed by heading autonomous mobility segments at NVIDIA and Ola, Agarwal believes that his interactions with customers and third-party users has brought a lot of experience

said that it has built a lot of personal experience, which “has been a real motivator for me to start RagaAI.”  

The company has large teams in the US and India. There are around 40 people here with a leadership team comprising five members. They come with experience from big tech companies such as Amazon, and premier graduate schools including Harvard, IIT and IIM. The company is looking to hire in Europe as well. 

The post Former NVIDIA Exec-led RagaAI Aims to Fix Failing AI Models appeared first on AIM.

]]>
Ex-Nvidia & Ola Exec Launches RagaAI for testing and fixing AI https://analyticsindiamag.com/ai-news-updates/ex-nvidia-ola-exec-launches-ragaai-for-testing-and-fixing-ai/ Tue, 23 Jan 2024 13:20:51 +0000 https://analyticsindiamag.com/?p=10111376

Led by tech pioneer Gaurav Aggarwal, multimodal AI testing platform RagaAI emerges from stealth mode

The post Ex-Nvidia & Ola Exec Launches RagaAI for testing and fixing AI appeared first on AIM.

]]>

RagaAI, an AI-focused startup, has come out of stealth mode and has successfully closed a $4.7m seed funding round. Pi Ventures spearheaded the funding round, joined by international investors such as Anorak Ventures, TenOneTen Ventures, Arka Ventures, Mana Ventures, and Exfinity Venture Partners. 

RagaAI addresses the need to ensure performance, safety and reliability of AI models, by providing an automated and comprehensive AI testing platform for companies. RagaAI is backed by advisors from Amazon, Google, Meta, Microsoft and NVIDIA

RagaAI DNA

RagaAI DNA which is the foundational model of RagaAI uses automation to detect issues, diagnose and fix them. Offering over 300 different tests, the model is able to identify issues such as data drift, edge case detention, poor data labelling, bias in data, and many more. Furthermore, it is a multimodal platform that supports LLMs, images/videos, 3D, audio, NLP and structured data. It reduces 90% of the risks while accelerating AI development by more than 3x. 

Tech Pioneer

Coming from a rich technology background of computer vision and machine learning, RagaAI was founded by Gaurav Agarwal in January 2022. He has previously worked with Texas Instruments and moved on to head mobility business at Ola and computing giant NVIDIA. 

“At Ola & NVIDIA, I saw the significant consequences of AI failures due to lack of comprehensive testing. Our Foundation Models “RagaAI DNA” is already solving this problem across large fortune 500 companies,” said Gaurav Aggarwal, CEO and founder of RagaAI. 

The founding team at RagaAI has a collective AI expertise of over 50 years. The company has already provided solutions for companies in various sectors including ecommerce, automotive, and others, with multiple use cases. 

RagaAI’s funding round will be used to advance research and development, with a focus on improving AI testing tools. 

“Driven by their patent-pending drift detection technology, RagaAI, an AI testing platform, is well-suited to solve these massive problems for the AI deployments globally. At pi Ventures, we believe in backing founders who can create disruptive solutions for global impact. In our view, Gaurav and his stellar team at Raga are fulfilling that goal in a big way. We are pleased to be associated with them,” said Manish Singhal, founding partner of pi Ventures that spearheaded the funding round. 

The post Ex-Nvidia & Ola Exec Launches RagaAI for testing and fixing AI appeared first on AIM.

]]>
What BloombergGPT Brings to the Finance Table https://analyticsindiamag.com/innovation-in-ai/what-bloomberggpt-brings-to-the-finance-table/ Tue, 04 Apr 2023 07:30:00 +0000 https://analyticsindiamag.com/?p=10090735

The latest LLM by Bloomberg, trained on 700 billion tokens, is an ingredient model said to boost Bloomberg Terminal service

The post What BloombergGPT Brings to the Finance Table appeared first on AIM.

]]>

Last week, Bloomberg released a research paper on its large language model BloombergGPT. Trained on over 50 billion parameters, the LLM model will be a first-of-its-kind AI generative model catering to the finance industry. While the move may set a precedent for other companies, for now, the announcement sounds like a push for the data and news company to seem relevant in the AI space. 

Interestingly, Bloomberg already has Bloomberg Terminal, which employs NLP and ML-trained models for offering financial data. So, naturally, the question that arises is: how much of a value-add is BloombergGPT and where does it stand in comparison to other GPT models? 

Training and Parameters

Bloomberg’s vast repository of financial data over the past forty years, has been used for training the GPT model. It is trained on 363 billion token proprietary datasets (financial documents) available from Bloomberg. In addition, 345 billion token public datasets were also incorporated to result in a total of 700 billion tokens for training. 

The company claims that the new model (Bloomberg GPT) will help improve their already existing NLP tasks such as sentiment analysis – a method that helps predict market prices – news classification, headline generation, question-answering, and other query-related tasks. 

On the face of it, the new LLM model appears great, but is still very limited in its approach. It’s not a multilingual model, has biases and toxicity and is a closed model.

Multilingual

BloombergGPT, the 50-billion parameter ‘decoder-only causal language model’ is not trained on multilingual data. Their training dataset, called FinPile, includes news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives, and they are all in the English language. For instance,  to train the model on data from press conferences, transcripts of company press conferences through speech recognition were used in the English language. The absence of multi-languages limits input training data. 

BLOOM, which has the same model architecture and software stack as BloombergGPT (though BLOOM is trained on higher parameters of 175 billion), is multilingual. Similar is the case with GPT-3, which is also trained on multilingualism and 175 billion parameters. 

Biases and Toxicity

Bloomberg has mentioned that the possibility of the “generation of harmful language remains an open question”. LLMs are known for their biases and hallucinations, a problem that large trained models, such as ChatGPT, are also combatting. LLM bias can be highly detrimental when utilised in finance models, as accurate and factual information determines the rightful prediction of market sentiments. However, BloombergGPT does not address this concern completely. The company is still evaluating the model and believes that “existing test procedures, risk and compliance controls” will help reduce the problem. Bloomberg is also studying their FinPile dataset which contains lesser biases and toxic language, which will ultimately curb the generation of inappropriate content. 

Closed Model

BloombergGPT is a closed model. Apart from the parameters and general information, details such as the weights of the model are not mentioned in their research paper. It is possible that since this model is based on decades of Bloomberg data, clubbed with its sensitive nature of information, the LLM will not become open sourced. Besides, the model is set to target their Bloomberg Terminal users, who are already availing the service at a subscription cost. However, the company does have plans to release training logs of the model. 

In a conversation with AIM, Anju Kambadur, head of AI Engineering at Bloomberg, said: “BloombergGPT is about empowering and augmenting human professionals in finance with new capabilities to deal with numerical and computational concepts in a more accessible way.” Bloomberg has been using AI, Machine Learning and NLP for more than a decade but each of them required a custom model. “With BloombergGPT, we will be able to develop new applications quicker and faster, some of which have been thought about for years and not developed yet,” he said. 

“Conversational English can be used to post queries using Bloomberg Query Language (BQL) to pinpoint data, which can then be imported into data science and portfolio management tools.” 

Kambadur clarified that BloombergGPT is not a chatbot. “It is an ingredient model that we are using internally for product development and feature enhancement.” The model will help power AI-enabled applications like Bloomberg Terminal, but also power back-end workflows within our data operations. Clients may not engage with the model directly but will be using it through the Terminal functions in the future. 

Comparison

Below is a comparison with other models GPT-NeoX (trained on 20B parameters) and FLAN-T5-XXL (trained on 11B parameters). BloombergGPT, updated on the latest information, is able to answer the questions accurately when compared to other similarly-trained LLMs.  

Source: arxiv.org

BloombergGPT fared better on financial tasks when compared to other similar open models of the same size and was even evaluated on the ‘Bloomberg internal benchmarks’ and other general-purpose NLP benchmarks such as BIG-bench Hard, knowledge assessments, reading comprehension and linguistic tasks.  

The post What BloombergGPT Brings to the Finance Table appeared first on AIM.

]]>
Google and Replit’s Quest to Become the Next Copilot X https://analyticsindiamag.com/ai-origins-evolution/google-and-replits-quest-to-become-the-next-copilot-x-2/ Thu, 30 Mar 2023 10:00:00 +0000 https://analyticsindiamag.com/?p=10090416

With the Google partnership, Replit believes that they will now get access to newer models as they are released which will ultimately reach the developers and help with the goal of “accelerating tech into everyone’s hands”

The post Google and Replit’s Quest to Become the Next Copilot X appeared first on AIM.

]]>

Google Cloud recently announced its partnership with ‘Replit’, a cloud-based integrated development environment that allows developers to write and deploy codes in various programming languages from their web browsers. 

The partnership aims to turn “non-developers into developers”. With the Google partnership, Replit will get full access to Google Cloud’s infrastructure and Google’s machine learning platform, ‘Vertex AI’. Along with improving productivity, Replit claims that programmers can code complex-architected software in 1/1000th of the time

A day after the partnership announcement, CEO and Head of Engineering at Replit, Amjad Masad confirmed that the company will continue to remain an “open platform” and that they are open to work with more companies to “expand the ecosystem”.

The Need

Masad, in an interview with Semafor, explains the limitations they faced during the initial days. When GPT-2 was out in 2019, he had started playing around with code generation but it was only after GPT-3’s release that its potential was obvious for people. They were also not able to do much because OpenAI was strict about what gets productised and what does not. 

“In order to produce something like a Copilot, you have to do a lot of low level engineering, have access to weights and be really fast.” This was something Replit did not have access to but Microsoft had that advantage due to their “special relationship” with OpenAI. It was only after models started getting open sourced, that the company was able to build on their own. 

With the Google partnership, Amjad Masad believes that they will now get access to newer models as they are released which will ultimately reach the developers and help with the goal of “accelerating tech into everyone’s hands”.

Platforms and products that work seamlessly in silos are ultimately limited when it comes to full adoption. For example, for a developer who wants to utilise LLMs in their daily work, there needs to be an integrated development environment (IDE) where LLMs are implemented for wider functionality. This is probably where Replit and Google Cloud’s partnership will shine

Replit has already been implementing artificial intelligence through Ghostwriter—an AI-powered ‘coding partner’ launched in October 2022. Ghostwriter works on LLM which is trained on publicly available code that is fine tuned by Replit. The company even launched a chatbot for Ghostwriter last month, named ‘Ghostwriter Chat’. It is considered the ‘first conversational AI programmer’ to have an interactive experience like ChatGPT

Over 30% of the codes developed in Ghostwriter are generated by Ghostwriter coding AI. Powered by LLM chat applications, full programme codes can be generated. 

Vertex AI allows users to train and deploy AI applications and ML models. AutoML, an option for model training provided by Vertex AI, allows users to train image, text, or video data without writing codes. Vertex AI’s multimodal training model will subsequently help elevate user functionalities in Replit. 

Battle of the Behemoths

Running similar functionalities, Replit’s closest competitor is GitHub. With the announcement of Google Cloud’s partnership, the spotlight has returned to the race of the tech giants supporting both companies. 

Microsoft’s GitHub was first launched in 2008 and is used by over 94 million developers.  Comparatively new player ‘Replit’—founded in 2016—has raked over 20 million developers, as mentioned in their company blog. With Google Cloud partnership, Replit aims to support “one billion software creators” and ultimately also support the goal of enabling companies to promote development using AI. 

While there are similar functionalities for both, including the presence of AI features, Replit’s Ghostwriter has an edge over GitHub in certain parameters. In Replit, there is a “real-time multiplayer editor” option, and users can build, test and deploy “directly from the browser”—a unique function that is exclusive to Replit. In addition, the Replit app enables ‘voice commands’. Users can instruct the application via voice prompts to say “make an app” for a specific need. The application will also provide the source code, in case the user needs further modifications. 

Is Partnership the Way Ahead?

With Microsoft’s GitHub Copilot considered an essential for coders and Microsoft bringing ChatGPT-like capabilities to GitHub with Copilot X, Google’s push to make a mark in the developer community is evident through its new partnership with Replit.  

To remain relevant in the AI race—and probably take on fellow giant Microsoft—forming crucial partnerships with existing players is a hopeful route. 
Not far behind is another power partnership—AWS and Hugging Face. Amazon Web Services’ partnership with Hugging Face , a company that develops and maintains open-source libraries for natural language processing (NLP) and machine learning, is another initiative taken by a tech giant to accelerate next-generation ML models by helping developers build them.

The post Google and Replit’s Quest to Become the Next Copilot X appeared first on AIM.

]]>
Gladio Announces Audio Transcription API built on OpenAI Whisper https://analyticsindiamag.com/ai-news-updates/gladio-announces-audio-transcription-api-built-on-openai-whisper/ Thu, 16 Feb 2023 08:56:51 +0000 https://analyticsindiamag.com/?p=10087508

Gladio’s Audio transcription API is built on Whisper-Large-v2 of OpenAI and has a WER of 1%

The post Gladio Announces Audio Transcription API built on OpenAI Whisper appeared first on AIM.

]]>

Jean-Louis Queguiner, the founder of Gladio, which works with AI deployment, announced the release of Audio transcription alpha. Built on OpenAI’s Whisper-Large-v2, the speech-to-text API is able to transcribe a 1h file in 10s with a Word Error Rate as low as 1%. It is believed to be more accurate than other products in the market by at least 5 times. The company believes that this would open up the immense scope in the audio intelligence space and broaden future applications in AI with plug-and-play APIs.  

Whisper is a pre-trained model for Audio Speech Recognition (ASR). These models have been trained on 680k hours of data. It was proposed by Alec Radford from OpenAI. The large-v2 model is trained for 2.5 times more epochs for improved efficiency. Whisper generates human-readable transcriptions, which means that the ASR system will be able to output commas, periods, hyphens and other punctuation marks. This will result in high-quality transcriptions resulting in a low Word Error Rate (WER). 

Integrating the latest NLP and deep learning research, the API for alpha is built on neural network optimization, which has resulted in improved inference speed by around 60 times compared to other similar providers in the market. Gladio is currently working on 250 models to create a “holistic intelligence solution” which can perform more than 45 tasks, including translation, summaries, gender detection and sentiment analysis. 

Inference speed is another parameter that is considered. The baseline was established by comparing the inference speed of other STT providers. At 16KHz sampling rate and 16 bits encoding, alpha was able to score 1 hour of Audio in both mono and stereo configuration, and this was compared with the results of other models that can deliver the same task within the same parameters. 

Source: Twitter

The company also believes that “democratizing access” to AI should not only be cost-centric. It should be about simplifying the complexity of the tools used. 

The post Gladio Announces Audio Transcription API built on OpenAI Whisper appeared first on AIM.

]]>
Interesting AI Papers Submitted at ICLR 2023 https://analyticsindiamag.com/ai-origins-evolution/interesting-ai-papers-submitted-at-iclr-2023/ Mon, 03 Oct 2022 09:30:00 +0000 https://analyticsindiamag.com/?p=10076285

In a recent announcement, ICLR 2023 confirmed their date of submissions along with marking January 20, 2023 as the final date of decision.

The post Interesting AI Papers Submitted at ICLR 2023 appeared first on AIM.

]]>

The International Conference on Learning Representatives(ICLR) is one of the largest AI conferences held annually—with 2023 as its eleventh edition. In a recent announcement, ICLR 2023 confirmed their date of submissions along with marking January 20, 2023 as the final date of decision. 

Here are a few papers from the recent ICLR 2023 submission release: 

https://twitter.com/chriswolfvision/status/1575539227365806080?s=20&t=NVxT5Jggw-gNv-j2fBCpbw

Dream Fusion: Text to 3D using 2D diffusion 

Diffusion models have caused recent developments in text-to-image synthesis trained on billions of image text pairs. This work proposes the elimination of the need to adopt large-scale datasets for de-noising the 3D synthesis and replacing it by employing a pre-trained 2D text-to-image diffusion model to perform text-to-3D synthesis. The paper further examined ‘Dream Fusion’ as a method for converting text to a 3D model that uplifts text to image models, optimising neRFs and eliminating the need for datasets with 3D objects and labels. In addition, the 3D model would require no training data and changes to image diffusion models—indicating the efficacy of pre-trained image diffusion models as priors. 

Read the full paper here

Quantum Reinforcement Learning 

The paper introduces a new vision for intelligent quantum cloud computing in the financial system. It combines effective learning methods to eliminate the risk of fraud by integrating fraud detection in financial services through the ‘Quantum Reinforcement Learning’ method. This research improves upon simulating financial trading systems and building financial forecasting models—offering promising prospects for portfolio risk in the financial system and deploying algorithms to analyse large-scale data in real-time effectively. 

Read the full paper here

Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics

In this paper, the author takes a deep interest in the impact of label error on model group-based disparity metrics. In particular, they empirically categorise the varying levels of label errors in testing data and training which in turn affects the disparity metrics—specifically, group calibration. They also demonstrated empirical sensitivity tests to measure the corresponding change in the disparity metric. Results suggest that real-world label errors are less pernicious to model learning dynamics than synthetic flipping. They also propose an approach on a variety of datasets and find a 10–40% progress which corresponds to alternative methods in determining training inputs that enhance a model’s disparity metric. Overall, this work shows the need to adopt this proposed approach to help surface training input correction and improve a model’s group-based disparity metrics. 

Read the full paper here.

Suppression Helps: Lateral Inhibition-inspired Convolutional Neural Network For Image Classification

This paper proposes the lateral inhibition-inspired design of convolutional neural networks for image classification. The model works on plain convolution and the convolutional block with residual connection while being consistent with already residing molecules. Furthermore, researchers explore the filter dimension in the lateral direction—thereby offering a lateral inhibition-inspired (LI) design, incorporating a low-pass filter for inhibition decay. Their results demonstrate improvements with minor increase in parameters on the ImageNet dataset for classifying images. These results simultaneously demonstrate the advantage of their design and scholarly work to help researchers consider the value of feature learning for image classification. 

Read the full paper here. 

Towards Robust Online Dialogue Response Generation

This paper attempts to improve online dialogue generation and hierarchical sampling-based methods to ease the disparity between training and real-world testing. Their model works on chatbots in generating uneven responses in real-world applications, mainly in multi-turn settings. To dig this deeper, the research work adopts reinforcement learning and re-ranking methods to optimise dialogue coherence during inference and training. Researchers even performed experiments to show the usefulness of this method in generating strong online responses to bots and self-talk conversations. Essentially, this research works on improving online dialogue discrepancy while also enhancing dialogue coherence implicitly. 

Read the full paper here. 

FARE: Provably Fair Representation Learning

This work proposes the introduction of FARE (Fairness with Restricted Encoders), the first FRL method with provable fairness guarantees. The fair representation learning (FRL) aims to produce fair classifiers via data preprocessing in comparison to prior methods that achieve worse accuracy-fairness tradeoffs. The model is helpful in producing tight upper bounds on several datasets and simultaneously delivering practical fairness and accuracy tradeoffs. 

Read the full paper here. 

Towards a Complete Theory of Neural Networks with Few Neurons

This work evaluates the landscape of neural networks with few neurons. They examined the dynamics of overparameterized networks by proving that a student network with one neuron has only one critical point (its global minimum) when learning from a teacher network with several neurons. They further proved how a neuron addition mechanism turns a minimum into a line of critical points with transitions from saddles to local minima via non-strict saddles. Following this, researchers discuss how the insights gleaned from their novel proof techniques are likely to shed light on the dynamics of neural networks with few neurons in depth.

Read the full paper here. 

The post Interesting AI Papers Submitted at ICLR 2023 appeared first on AIM.

]]>
Amazon Unveils New AI Language Model that Beats GPT-3 https://analyticsindiamag.com/ai-news-updates/amazon-unveils-new-ai-language-model-that-beats-gpt-3/ Thu, 08 Sep 2022 15:11:26 +0000 https://analyticsindiamag.com/?p=10074698

The new language model outperformed OpenAI’s GPT-3 and Google’s PaLM on various NLP benchmarks

The post Amazon Unveils New AI Language Model that Beats GPT-3 appeared first on AIM.

]]>

Amazon Alexa AI researchers recently unveiled Alexa Teacher Models (AlexaTM 20B) that beats GPT-3 on NLP benchmarks. The 20-billion-parameter sequence-to-sequence (seq2seq) language model showcases SOTA capabilities on few-shot learning. The model is yet to be released publicly. 

Check out the GitHub repository here

Unlike OpenAI’s GPT-3 or Google’s PaLM, which are decoder-only models, AlexaTM 20B is a seq2seq model that contains an encoder and a decoder allowing better performance on machine translation (MT) and summarization. 

Sequence-to-sequence model is a special class of recurrent neural network architecture, typically used to solve complex language problems, including machine translation, creating chatbots, question answering, text summarisation, etc. 

With 1/8 number of parameters, the new language model by Amazon outperformed GPT-3 on SQuADv2 and SuperGLUE benchmarks. The multilingual model achieves excellent performance on few-shot MT tasks, even on low-resource languages, on the Flores-101 dataset. 

On several other benchmarks like MLSum, AlexaTM outperformed all other models for 1-shot summarization in Spanish, German, French and most language pairs on 1-shot MT tasks. On low-resourced languages like Tamil, Telugu, and Marathi, the improvement was significant. On English-based languages, the model outperformed GPT-3 on MT tasks but came second to the larger PaLM model.

Saleh Soltan, senior applied scientist on Amazon, said that, “the proposed style of pretraining enables seq2seq models that outperform much larger decoder-only LLMs across different tasks, both in a few-shot setting and fine-tuning.”

The post Amazon Unveils New AI Language Model that Beats GPT-3 appeared first on AIM.

]]>
BITS Pilani launches MTech in AI and ML https://analyticsindiamag.com/ai-news-updates/bits-pilani-launches-mtech-in-ai-and-ml/ Thu, 18 Aug 2022 05:29:34 +0000 https://analyticsindiamag.com/?p=10072983 BITS Pilani launches MTech in AI and ML

The program consists of four semesters including a dissertation in the final semester and will cover areas like NLP, computer vision, robotics and cyber security.

The post BITS Pilani launches MTech in AI and ML appeared first on AIM.

]]>
BITS Pilani launches MTech in AI and ML

BITS Pilani announced the launch of ‘MTech in AI and ML’ for working professionals to enhance their conceptual and hands-on knowledge about contemporary AI and ML techniques like deep learning and reinforcement learning.

REGISTER >>

Launched by the Work Integrated Learning Programmes (WILP) division of the institute, the programme consists of four semesters, including a dissertation in the final semester that will cover a wide range of skills for technology professionals and help them advance their careers as AI and ML scientists. 

The course also features online lectures on weekends conducted by BITS Pilani faculty.

The course will focus on AI application areas such as natural language processing (NLP), computer vision, robotics and cyber security with software application and associated system support with implementation beyond data science using tools and technologies like Tensorflow for Deep Learning, Python libraries, OpenCV for computer vision, NLTK for NLP.

Professor Anita Ramachandran, head of the computer science and information system group of WILP at BITS Pilani, emphasised the importance of the course and said that the subjects and electives of the programme are designed to aid the comprehensive development of knowledge and skills and familiarise ML engineers with algorithms such as supervised, unsupervised, and reinforcement learning along with application areas like NLP, computer vision, robotics, and cyber security.

Professor Ramachandran added that this program will help engineers better understand the underlying ethical issues in applying AI and ML.

According to reports, the global AI market was valued at USD 65.48 billion in 2020 and is projected to reach USD 1581.7 billion in value by 2030.

Click here to download the brochure and apply for the course.

The last date to apply for admission to the course is September 12, 2022.

The post BITS Pilani launches MTech in AI and ML appeared first on AIM.

]]>
The Time to Move towards Responsible CX is Now or Never https://analyticsindiamag.com/ai-origins-evolution/the-time-to-move-towards-responsible-cx-is-now-or-never/ Thu, 11 Aug 2022 06:30:00 +0000 https://analyticsindiamag.com/?p=10072604 Why are Consulting Firms Building LLMs

According to a 2022 global study of over 23,000 consumers, 80% believe businesses need to improve their customer experience (CX). The same report warns that 9.5% of the revenue might be at risk due to bad CX. In the current digitised world, there are no two ways about the fact that businesses that truly stand […]

The post The Time to Move towards Responsible CX is Now or Never appeared first on AIM.

]]>
Why are Consulting Firms Building LLMs

According to a 2022 global study of over 23,000 consumers, 80% believe businesses need to improve their customer experience (CX). The same report warns that 9.5% of the revenue might be at risk due to bad CX. In the current digitised world, there are no two ways about the fact that businesses that truly stand out from their competitors are those that provide top-notch, delightful customer experiences.

To be precise, CX is king! That’s resolved. But can we afford a king with unlimited powers? The answer is a thunderous NO.

Are brands crossing the line? 

Understanding customers and strategizing products as per their needs enhances CX and accelerates the company’s revenue growth. However, on the flip side, we are on the verge of a sustainability crisis. To offer better CX to consumers, brands are crossing a line – the red line. 

To begin with, let me take you through a simple example. To target lower-income customers, single servings of foodstuff like soup and sauces are sold at low rates in tiny, multilayered packaging. Flexible packaging material is made with layers of different types of plastic that provide different qualities each. It becomes challenging to separate these different layers, and the components have little or no commercial value. Hence, most plastic sachets cannot be recycled, forming a huge chunk of our non-biodegradable waste. 

With the advancement of technologies like AI, companies are chasing customers aggressively. This calls for a balance between customer experience and environmental choices. So, what is the way out? 

Need to move towards responsible CX 

Today, businesses worldwide are experimenting with artificial intelligence, machine learning, and advanced analytics to boost CX.  However, many of these added advantages have a flip side that needs to be explored. 

First, the population of Gen Y is constantly on the rise, and so is the demand for digitisation. However, the race towards extensive digitisation is hurting energy efficiency. A clear illustration of the potential effects of digitisation on efficiency and demand may be seen in video-streaming services. 

Although it takes significantly lesser energy to download a video than it does to create, market, and buy a DVD, the accessibility and convenience make it easy to download and watch different content every night. The availability of digitalised services can unquestionably increase utilisation and the resultant energy demand. 

Second comes AI, Natural Language Processing (NLP), to be specific, is a darling of the retail sector, especially when it comes to automating customer engagement 24×7 throughout the year. Researchers from the  University of Massachusetts recently conducted a study to evaluate the life  cycle of training various well-known large NLP models in Amherst. To their surprise, the process can produce more than 626,000 pounds of Co2  equivalent, almost five times the lifetime emissions of the average American car (including the manufacture of the car itself). 

Additionally, personalisation engines let marketers decide what kind of experience is best for each customer or prospect based on their previous interactions, the current context, and their anticipated purpose. These engines assist marketers in identifying, selecting, customising, and delivering communications such as content, offers, and other interactions across customer touchpoints. As a result, we are moving  towards promoting too much consumerism. The need to create things rises along with the demand for them. This causes more pollution emissions, increased land use, deforestation, and finally, climate change. 

The way ahead 

With the adoption of ML models across industries, the need to gather and analyse enormous volumes of data, and thereby the demand to have larger data centres has boomed. There is a need to set boundaries. An aggressive and careless approach to enhance customer experience may be  beneficial in the shorter run, but disastrous in the long run. 

According to a survey by The Economist, consumers feel that companies and governments have equal responsibility to bring about beneficial environmental change. Online searches for sustainable products have increased by more than 70% on a global scale. The way that consumers interact with sustainable businesses has changed. This pattern is not exclusive to first-world nations. Concerns about global warming are also linked to consumer satisfaction in emerging and developing  nations. Now, it is for businesses to open their eyes wide open and move away from irresponsible CX to responsible CX.

The post The Time to Move towards Responsible CX is Now or Never appeared first on AIM.

]]>
The dazzling ascent of Hugging Face https://analyticsindiamag.com/intellectual-ai-discussions/the-dazzling-ascent-of-hugging-face/ Wed, 18 May 2022 11:30:00 +0000 https://analyticsindiamag.com/?p=10067374

Hugging Face started out as an NLP-powered personalised chatbot.

The post The dazzling ascent of Hugging Face appeared first on AIM.

]]>

Hugging Face has built serious street cred in the AI & ML space in a short span. The north star of the company is to become the Github of machine learning. To that end, Hugging Face is doubling down on its efforts to democratise AI and ML through open source and open science. Today, the platform offers 100,000 pre-trained models and 10,000 datasets for NLP, computer vision, speech, time-series, biology, reinforcement learning, chemistry and more.

“Companies today can not only host models and datasets on Hugging Face, but test them, collaborate on them, run them in production and assess them for a more ethical use,” said Julien Chaumond, co-founder and CTO at Hugging Face.

The beginning

Hugging Face started out as an NLP-powered personalised chatbot. To improve the NLP capabilities, the startup built a library with machine learning models and natural language datasets. Additionally, the founders open-sourced parts of the library.

“We realised that Conversational AI is the hardest task of ML. Our CSO Thomas Wolf was training really cool models and taking pre-trained models and adapting them to do Conversational AI. It was hard! Nonetheless, the tools required to do that were not limited to just achieving Conversational AI but could be applied to all NLP tasks and even most ML tasks too.

What we have seen in ML is the rise in transfer learning, where pre-trained models are used on large amounts of data; that works for all modalities, not just text.

It started with computer vision, when people worked on ImageNet, transfer learning really got amplified in 2017 – 2018 with the release of BERT and GPT-2, among others. But now we see transfer learning is working for every single subfield of machine learning like audio, time series, RIL, etc. The tools we have built like our hub works for everything in machine learning. So our focus is to double down on the hub and make sure we do everything for the community,” said Julien. 

Also Read: Why Is Hugging Face Special?

ML for all

“I started working on ML back in 2005-06. But back then there were no real-world applications, so I mostly stuck with software engineering. What I feel now is there is a solid intersection of software engineering and machine learning. ML was a detached field from software engineering. It was a lot more “sciencey.” A lot of success we have had on Hugging Face comes from the fact that we have made a good blend of machine learning and software engineering. We can make machine learning a lot more accessible using best practices in software engineering. We make it easy for everyone to get into ML,” said Julien.

In 2017, Google and OpenAI introduced ‘transformers’ architecture, and took NLP to the next level. However, most of the companies looking to harness the power of NLP didn’t have the resources to build models from the ground up. Enter Hugging Face: The startup’s open-source library, launched around the same time, allowed these companies to ride the NLP wave.

Hugging Face believes machine learning is the technology trend of the decade and is quickly becoming the default method of building technology. We realised early on that our platform must be extensible, modular, and open rather than an off-the-shelf API for machine learning to truly empower companies and the ML community at large,” said Clement Delangue, CEO and co-founder at Hugging Face.  

“We never wanted to be the product of a single company, but rather the collaboration of hundreds of different companies. As a result, we’ve always taken an open-source, collaborative, platform-based approach to machine learning,” said Clement

“Good” machine learning

For the longest time, machine learning was driven by engineers and scientists. It was all about trying to achieve the best potential metrics on datasets and not a lot of people were actually thinking about how to build a good data set. It was mostly viewed as an engineering issue where you would try to maximise the accuracy of your model in a specific task. Over the last few years, ML as a field has matured a lot. Now, models are used in production for real-world usage which did not happen before,” said Julian.

Hugging Face has built a platform with a community-first approach (just like GitHub), giving tens of thousands of companies the ability to build machine learning models at a fraction of the cost. 

“Everyone in the community is more aware that using a bad model (a model that is trained on a really biased data set or a super partial data set) that doesn’t reflect what is going to happen in the real world is really bad machine learning. Good machine learning is about trying to set out the data collection in a way that the data set is going to reflect the real-world usage of the model and is unbiased. You should be able to tweak your model in a way that is going to limit the biases or remove them entirely. Make it more transparent.

The community as a whole is improving on these subjects, and we are trying to help in any way we can,” he added.

Forget AGI! There are bigger things to focus on! 

“Though many practitioners emphasise the long term impact of machine learning and eventually AGI that mostly points towards singularity or a “terminator” effect, we chose to focus on the limitations and challenges of ML that need to be tackled now like biases, privacy, and energy consumption. We want to focus on short-term issues like these rather than worry about AGI which we may or may not achieve in the next 50 years,” said Clement.

Clement believes through openness, transparency and collaboration, the ML community can drive responsible and inclusive progress, understanding and accountability. In August 2021, Hugging Face onboarded AI ethicist Dr Margaret Mitchell, who co-created Model Cards. She now guides Hugging Face to create tools to bring fairness to algorithms. “Many NLP models today are incredibly biased. So, I believe it is critical in our field today to simply acknowledge that and build transparency tools, bias mitigation tools so that we can take that into account and make sure we use them the right way,” he said.

The company aims to build a better AI founded on open source, open science, ethics and collaboration.

Hugging Face also has BigScience, a collaborative workshop around large language models gathering more than 1,000 researchers of all backgrounds and disciplines. The community is now working towards training the world’s largest open-source multilingual language model

“There has always been this trend and this ability to release research for the entire field to have access to and be able to, for example, mitigate biases, create counter powers, and mitigate the negative effects that it can have. To me, it’s critical that researchers continue to be able to share their models and data sets publicly so that the entire field can benefit from them. Perhaps, just to complement what we’ve done with Hugging Face, an initiative called BigScience has been launched, bringing together nearly a thousand researchers from around the world,” said Clement.

Future perfect

Early this month, Hugging Face raised USD 100 million in Series C funding led by Lux Capital. Sequoia, Coatue and existing investors including Addition, Betaworks, AIX Ventures, Cygni Capital, Kevin Durant, Olivier Pomel (co-founder & CEO at Datadog) etc participated in the round. 

“Given the value of machine learning and its increasing popularity, usage is deferred revenue. I don’t see a world in which machine learning is the default way to build technology and Hugging Face is the leading platform for it, and we don’t generate several billion dollars in revenue,”

Clement Delangue

Hugging Face aims to create a positive impact on the AI field by focusing on responsible AI through openly sharing models, datasets, training procedures, and evaluation metrics. The team believes open source and open science bring trust, robustness, reproducibility, and continuous innovation.

The post The dazzling ascent of Hugging Face appeared first on AIM.

]]>
Is NLP innovating faster than other domains of AI https://analyticsindiamag.com/ai-origins-evolution/is-nlp-innovating-faster-than-other-domains-of-ai/ Mon, 16 May 2022 04:30:00 +0000 https://analyticsindiamag.com/?p=10067040 NLP Innovation

Why is there such intense competition in this field, or in other words, are other AI domains lagging behind NLP in terms of innovation?

The post Is NLP innovating faster than other domains of AI appeared first on AIM.

]]>
NLP Innovation

Meta recently introduced a 175 billion parameter Open Pretrained Transformer (OPT) model. Meta claims that this massive model, which is trained on publicly available data sets, is the first language technology system of this size to be released with its pretrained models and training code. In what can be considered a rare occurrence, Meta open-sourced this model. 

The OPT model joins the ranks of several other advanced language models that have been developed and introduced recently. The NLP field of AI has seen a massive innovation in the past few years, with participation from leading tech companies of the world. Why is there such intense competition in this field, or in other words, are other AI domains lagging behind NLP in terms of innovation?

Progress in NLP

The field of AI is fragmented broadly into domains that target different kinds of problems. Some systems are used for solving problems that involve navigation and movement through physical spaces, like autonomous vehicles and robotics; others deal with computer vision-related applications – differentiating and categorising images and patterns; common sense AI. Other forms of AI solve critical and specific problems. Like DeepMind’s AlphaFold solved a 50-year-old challenge. This innovation has accelerated the drug discovery process manifold. 

That said, natural language processing is arguably the hottest field of AI. Even in humans, being multilingual and having language proficiency have been considered major indicators of intelligence. It is generally considered suggestive of an ability to parse complex messages and decipher coding variations across context, slang, and dialects. It is hardly surprising that AI researchers consider teaching machines the ability to understand and respond to natural language a great feat and even a step toward achieving general intelligence.

Speaking of innovation in this field, a widely considered breakthrough, the 175 billion parameter GPT-3 was released by OpenAI in 2020. A complex neural network, GPT-3 has been trained on 700 gigabytes of data scraped from across the web, including Wikipedia and digitalised books. GPT-3 set a precedent for even larger, advanced and, in some cases, computationally inexpensive models. 

Innovation that supports NLP

There have been several stages in the evolution of the natural language processing field. It started in the 80s with the expert system, moving on to the statistical revolution, to finally the neural revolution. Speaking of the neural revolution, it was enabled by the combination of deep neural architectures, specialised hardware, and a large amount of data. That said, the revolution in the NLP domain was much slower than other fields like computer vision, which benefitted greatly from the emergence of large scale pre-trained models, which, in turn, were enabled by large datasets like ImageNet. Pretrained ImageNet models helped in achieving state-of-the-art results in tasks like object detection, human pose estimation, semantic segmentation, and video recognition. They enabled the application of computer vision to domains where the number of training examples is small, and annotation is expensive. 

One of the most definitive inventions in recent times was the Transformers. Developed at Google Brains in 2017, Transformers is a novel neural network architecture and is based on the concept of the self-attention mechanism. The model outperformed both recurrent and convolutional models. It was also observed that a Transformer requires lesser computational power to train and is a better fit for modern machine learning hardware that speeds up training by order of magnitude. It became the architecture of choice for NLP problems, replacing earlier models like LSTM. The additional training parallelisation allowed training on a much larger dataset than it was once possible. 

Thanks to Transformers and the subsequent invention of BERT, NLP achieved its ‘ImageNet moment’. BERT revolutionised NLP, and since then, a wide range of variations of these models have been proposed, such as RoBERTa, ALBERT, and XLNet. Beyond Transformers, several representation techniques like ELMo and ULMFiT have made headlines by demonstrating that pretrained language models can achieve state-of-the-art results on a range of NLP tasks.

“Transformer architecture has revolutionised NLP by enabling language generation and fine-tuning on a scale never previously seen in NLP. Furthermore, these models perform better when trained on large amounts of data; hence organisations are focusing on training larger and larger language models with little change in the model architecture. Big firms like Google and Meta, which can afford this type of training, are developing novel language models, and I expect more of the same from other large corporations,” said Shameed Sait, head of artificial intelligence at tmrw.

Echoing the same sentiment, Anoop Kunchukuttan, Microsoft researcher and the co-founder of AI4Bharat, said, “Interestingly, deep learning’s benefits were initially seen largely in the field of computer vision and speech. What happened was that NLP got some kind of a headstart in terms of the kind of models that were introduced subsequently. The attention-based mechanism, for example, led to great advancements in NLP. Also, the introduction of self-supervised learning influenced progress in the NLP field.”

Access to massive data

One of the major advantages that NLP is the availability of a massive amount of datasets to train advanced models on. Hugging Face, a startup which is building the ‘GitHub for Machine Learning’, has been working on democratising AI, with a special focus on NLP. Last year, Hugging Face released Datasets, a community library for NLP, which was developed over a year. Developed by over 250 developers, this library contains 650 unique datasets aimed at standardising end-user interface, version control, documentation and offering a lightweight frontend for internet-scale corpora.

Similarly, Facebook AI open-sourced FLORES-101 database to improve multilingual translation models. It is a many-to-many evaluation dataset covering 101 different languages. By making this information available publicly, Facebook wants to accelerate progress in NLP by enabling developers to generate more diverse and locally relevant tools.

The biggest benefit that language modelling has is that the training data is free with any text corpus. The availability of a potentially unlimited amount of training data is particularly important as NLP does not only deal with the English Language.

Towards AGI? Just not there yet

When GPT-3 model was released, a lot of over-enthusiastic publications termed it the first step toward AGI. While the model of this magnitude and processing power is nothing short of a technological marvel, considering it a move towards AGI is a bit of a stretch.

The New York University emeritus professor Gary Marcus, an author of the recent book ‘‘Rebooting AI,’’ said in an earlier interview with Analytics India Magazine, “The specific track we are on is large language models, an extension of big data. My view about those is not optimistic. They are less astonishing in their ability not to be toxic, tell the truth, or be reliable. I don’t think we want to build a general intelligence that is unreliable, misinforms people, and is potentially dangerous. For instance, you have GPT-3 recommending that people commit suicide.

There’s been enormous progress in machine translation, but not in machine comprehension. Moral reasoning is nowhere, and I don’t think AI is a healthy field right now.”

In a rare occurrence, Marcus’s rival Yann LecCun seems to agree with him. In a separate conference, Lecun called language an epiphenomenon of human intelligence. He added that there is a lot to intelligence which has nothing to do with language. “That’s where we should attack things first. … [Language] is number 300 in the list of 500 problems that we need to face,” Yann LeCun said.

So while language models and the domain of NLP might be certainly important to achieve AGI, it is simply not enough. For the time being, with the impending GPT-4 announcement and other language models waiting to be introduced, one may continue to see accelerated progress in the field for a long time to come.

The post Is NLP innovating faster than other domains of AI appeared first on AIM.

]]>
Hugging Face raises USD 100 Mn in Series C https://analyticsindiamag.com/ai-news-updates/hugging-face-raises-usd-100-mn-in-series-c/ Mon, 09 May 2022 16:30:56 +0000 https://analyticsindiamag.com/?p=10066629

With the new funding, we will be doubling down on research, open-source, products and responsible democratisation of AI.

The post Hugging Face raises USD 100 Mn in Series C appeared first on AIM.

]]>

Hugging Face has raised USD 100 million in Series C funding led by Lux Capital. Sequoia, Coatue and existing investors including Addition, Betaworks, AIX Ventures, Cygni Capital, Kevin Durant, Olivier Pomel (co-founder & CEO at Datadog) etc participated in the round. 

“Machine learning is becoming the default way to build technology. Hugging Face is the most used ML platform and community with over 10,000 companies using it, 100,000 pre-trained models & 10,000 datasets shared on the hub for NLP, computer vision, speech, time-series, biology, reinforcement learning, chemistry and more. Not only does Hugging Face host models and datasets but empowers companies to test them, collaborate on them, run them in production and assess them for a more ethical use thanks to its amazing community,” said Julien Chaumond, co-founder and CTO at Hugging Face.

Hugging Face aim to create a positive impact on the AI field by focusing on responsible AI through openly sharing models, datasets, training procedures, and evaluation metrics. The team believes that open source and open science bring trust, robustness, reproducibility, and continuous innovation.

“Though many practitioners emphasise the long term impact of machine learning and eventually AGI that mostly points towards singularity or a “terminator” effect, we chose to focus on the limitations and challenges of ML that need to be tackled now like biases, privacy, and energy consumption. We believe that through openness, transparency and collaboration, we as a community can foster responsible and inclusive progress, understanding and accountability to mitigate these challenges. This way, we aim to build a better future where AI is founded on open source, open science, ethics and collaboration. With the new funding, we will be doubling down on research, open-source, products and responsible democratisation of AI,” said Clement Delangue, CEO and co-founder at Hugging Face.

Hugging Face is also leading BigScience, a collaborative workshop around large language models gathering more than 1,000 researchers of all backgrounds and disciplines. The community is now working towards training the world’s largest open-source multilingual language model.

Hugging Face started its life as a chatbot and has come a long way to become “the home of machine learning.” In the past 12 months, the company has grown from 30 to 120+ members and is actively hiring.

The post Hugging Face raises USD 100 Mn in Series C appeared first on AIM.

]]>
What can we expect from GPT-4? https://analyticsindiamag.com/ai-origins-evolution/what-can-we-expect-from-gpt-4/ Thu, 21 Apr 2022 12:30:00 +0000 https://analyticsindiamag.com/?p=10065453

GPT-4 will not have 100 trillion parameters.

The post What can we expect from GPT-4? appeared first on AIM.

]]>

Going by the release cycle of the GPT franchise, the launch of the fourth generation is imminent, if not overdue. Last year, Sam Altman, the CEO of OpenAI, in a Q&A session at AC10 online meetup, spoke about the impending GPT-4 release. The release is probably on tap for July-August this year. However, OpenAI has kept a tight lid on the release date, and there is no definitive information available in the public domain on the same. But, one thing is for sure: GPT-4 will not have 100 trillion parameters.

GPT-3, released in May 2020, has 175 billion parameters. The third generation in the GPT-n series uses deep learning to produce human-like text. On September 22, 2020, Microsoft licensed the exclusive use of GPT-3. Based on the available information and Sam Altman’s statements at the Q&A session, we have compiled a list of improvements to expect in GPT-4.

Size doesn’t matter

Large language models like GPT-3 have achieved outstanding results without much model parameter updating. Though GPT-4 is most likely to be bigger than GPT-3 in terms of parameters, Sam Altman has clarified that size won’t be the differentiator for the next generation of OpenAI’s autoregressive language model. The parameter figures are likely to fall between GPT-3 and Gopher; between 175 billion-280 billion.

NVIDIA and Microsoft’s love-child Megatron-Turing NLG held the title of the largest dense neural network at 530 billion parameters (roughly 3x GPT-3) until Google’s PaLM (540 billion parameters) took the cake. Interestingly, smaller models such as Gopher (280 billion parameters) and Chinchilla (70 billion parameters) have outperformed MT-NLG across several benchmarks.

In 2020, OpenAI’s Jared Kaplan and the team claimed performance improved with the number of parameters. The PaLM model showed performance improvements from scale have not yet plateaued. However, Sam Altman has hinted that OpenAI is taking a different approach. He said OpenAI would no longer focus on making extremely large models but rather on getting the most out of smaller models. The AI research lab will look at other aspects — such as data, algorithms, parameterisation, or alignment — to bring significant improvements.

GPT-4 – a text-only model

Multimodal models are the deep learning models of the future. Because we live in a multimodal world, our brains are multisensory. Perceiving the world in only one mode at a time severely limits AI’s ability to navigate and comprehend it. Making GPT-4 a text-only model could be an attempt to push language models to their limits, adjusting parameters like model and dataset size before moving on to the next generation of multimodal AI.

Sparsity

Sparse models that use conditional computation in different parts of the model to process different inputs have been successful. Such models scale easily beyond the 1 trillion parameter mark without incurring high computing costs. However, the benefits of MoE approaches taper off on very large models. GPT-4, like GPT-2 and GPT-3, will be a dense model. In other words, all parameters will be used to process any given input.

Optimisation

Assuming that GPT-4 could be larger than GPT-3, the number of training tokens required to be compute-optimal (according to DeepMind’s findings) could be around 5 trillion– an order of magnitude greater than current datasets. The number of FLOPs required to train the model to achieve minimal training loss would be 10–20x that of GPT-3. In the Q&A, Altman has said GPT-4 would require more computing than GPT-3. OpenAI will focus on optimising variables than scaling the model. 

In alignment

The OpenAI’s north star is a beneficial AGI. The OpenAI is likely to build on the learnings from InstructGPT models, which are trained with humans in the loop. InstructGPT was deployed as the default language model on OpenAI’s API and is much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through their alignment research. However, the alignment was limited to OpenAI employees and English-speaking labellers. GPT-4 is likely to be more aligned with humans compared to GPT-3.

The post What can we expect from GPT-4? appeared first on AIM.

]]>
Council Post: Transforming NLP capabilities into impactful business applications https://analyticsindiamag.com/ai-origins-evolution/council-post-transforming-nlp-capabilities-into-impactful-business-applications/ Wed, 30 Mar 2022 07:30:00 +0000 https://analyticsindiamag.com/?p=10063909 Council Post: Transforming NLP capabilities into impactful business applications

Integrating a core NLP functionality or solution with other modules is essential to make it impactful.

The post Council Post: Transforming NLP capabilities into impactful business applications appeared first on AIM.

]]>
Council Post: Transforming NLP capabilities into impactful business applications

The global Natural Language Processing (NLP) market size is expected to grow from USD 11.6 billion in 2020 to USD 35.1 billion by 2026, at a CAGR of 20.3% during the forecast period. NLP has existed in various shapes and forms for a while now, from digitising documents to OCR to text analytics to voice analytics. However, it has grown significantly over the years in terms of core capabilities.

For businesses, NLP applications are a good way to showcase potential. In the last few years, while the technical competency has evolved exponentially through new algorithms, large scale language models, higher accuracy processes, a bigger gain for the field of NLP has been its proven value across business functions by converting the core functionalities into use case driven applications.

Focusing on the business problem

Just converting a pdf into digital form, extracting text from images, identifying top themes from a blob of text, classifying the sentiment of a statement with high accuracy, converting speech to text etc are considered the core functionalities of NLP. We have made huge strides across these areas. However, to make an NLP solution impactful, one needs to start from the top starting from the problem-solving to the final insights driving business decisions.

For example, how one builds a sentiment classification model is different for a gaming company compared to a retail firm. “I killed a lot of people” is actually a positive statement, which means the user is enjoying the game. Similarly, the relevant topics in social media data for a merchandising team will not be the same for a customer support team. NLP has potential applications across business verticals to solve various problems and generate insights:

Sales merchandising

  • Do customers find value for money in your company’s products?
  • How does your brand perception stack up against competitors?

Product development

  • What features are liked by customers?
  • What pain points can be addressed?

Marketing

  • What was the reach, awareness and impact of the campaigns?
  • Did the campaign meet customer expectations?

Supply chain and delivery

  • Are customers happy with delivery quality and speed?
  • How efficiently did we improve on them?

Customer support

  • What are the top issues faced by the customers?
  • How to solve them?

Creating easy to consume outputs for end-users

Business stakeholders do not care about the sophisticated models or algorithms used to solve a problem: What they are more concerned about is what’s in it for them. Innovators have to make it simple for the end-user to use the technological innovations, and for that, you need to know what kind of output is needed to solve it. We tend to make it jazzy by giving a ton of possible NLP driven outputs which may not be valuable to the business.

Interdisciplinary and integrated solutions 

Integrating a core NLP functionality or solution with other modules is essential to make it impactful. NLP is the ability to transform natural language into the computational language to get some insights out of it. However, what one does with it may require a few more steps to cover the last mile. This may be a predictive model (based on what the customer said on social media how likely are they to purchase the product), a recommender system (based on what the customer explained his/her problem is), a competitive benchmarking insight (on the topic of pricing, what is your brand’s perception on value for money among customers vs competitors). The key here is to cover the last mile of the bridge between what an NLP solution has produced to make it consumable for business. This requires an integrated end-to-end solution architecture and design approach instead of a stand-alone NLP functionality.

Keeping these three principles in mind is essential to ensure NLP solutions successfully drive the desired business impact. While there is no denying that the technical capability needs to progress at speed, the direction of progress should be determined by the above three principles. 

The future is exciting for NLP. What needs to be kept in mind is to augment it with use-case-driven design, customisations specific to problems, and transforming it into easy-to-consume outputs.

This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill out the form here.

The post Council Post: Transforming NLP capabilities into impactful business applications appeared first on AIM.

]]>
Why does DistilBERT love movies filmed in India, not Iraq? https://analyticsindiamag.com/ai-news-updates/why-does-distilbert-love-movies-filmed-in-india-not-iraq/ Mon, 21 Mar 2022 13:10:21 +0000 https://analyticsindiamag.com/?p=10063190

Aurelien said the DistilBERT’s bias might be developed during pretraining.

The post Why does DistilBERT love movies filmed in India, not Iraq? appeared first on AIM.

]]>

Aurélien Geron, an ML consultant, a former Googler and the author of Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow highlighted a bias in DistilBERT, a small, fast, cheap and light Transformer model. Aurelien plotted a sentiment analysis on the model for movies filmed in different countries. The resultant map showed the highest positive sentiment to movies filmed in India and the lowest for movies filmed in Iraq.

“I’m sure this model is used by analysts to measure the market’s sentiment in financial news feeds like Bloomberg’s. Are they compensating for the model’s country bias? I frankly doubt it,” Aurelien tweeted.

Aurelien’s tweet attracted a lot of comments from ML professionals. Nils Reimers, NLP researcher at huggingface.co said: “The issue is that the model does only have a positive and negative class, but no neutral class. Hence it has to predict some sentiment to this neutral statement which does not make much sense. So I mainly see an issue with the model design to only have positive/negative classes.”

Chidananda AV, another ML practitioner, asked: “Do you think the induced bias is due to the data involving film reviews which did not have a good distribution (while finetuning) or due to bias in data corpus while pretaining resulting in a negative/positive factor towards a topic(country in this subject).” 

“I don’t think it’s movie-related at all. I think it’s because of a strong bias built up during pre-training. For example, Germany is one of the very few countries in Western Europe to have a negative bias, and I’m pretty sure that’s WW2 related rather than movie-related,” Aurelien replied.

The post Why does DistilBERT love movies filmed in India, not Iraq? appeared first on AIM.

]]>
My journey in data science: Asha Vishwanathan, Verloop.io https://analyticsindiamag.com/intellectual-ai-discussions/my-journey-in-data-science-asha-vishwanathan-verloop/ Fri, 11 Mar 2022 11:30:00 +0000 https://analyticsindiamag.com/?p=10062586

I worked on changing the ride-sharing scene in Bengaluru using machine learning methodologies and data science.

The post My journey in data science: Asha Vishwanathan, Verloop.io appeared first on AIM.

]]>

Asha Vishwanathan leads the machine learning division at Verloop.io, a Conversational AI platform. Her expertise lies in NLP and computer vision and her career spans 15+ years. She got bitten by the data science bug in 2014 and made a pivot from analytics. “ML piqued my interest as it was the perfect blend of programming and data related technologies that I had been dealing with in my career,” said Asha.

Register for our Free hands-on workshop

Analytics India Magazine spoke to Asha to understand her data science journey. “Disregarding your past experience and starting afresh is a challenge that one must overcome while transitioning between different job profiles,” she added. 

Excerpts:

AIM: In spite of a successful career in analytics, what made you switch to the world of data science?

Asha Vishwanathan: After working for almost a decade with major IT brands, from corporates to budding startups, I took a sabbatical from work to explore my ‘Ikigai’ and went back to basics to understand what actually made sense to me. Upon realising I have an affinity towards programming, I moved to machine learning. Things got interesting when I took a course on data science through Coursera and my interest in the field developed.

AIM: How did you develop your data science skillset?

Asha Vishwanathan: I started with a lot of general reading on the basics. I started with research on business intelligence as it was something that I had worked on before. Having a background in business intelligence, data visualisation, reporting and dashboarding, I understood the nuances and found data science in this field to be a natural extension of what I had been doing. My own past experiences and understanding of the field guided me through the vast universe of data science. I started off by doing an introductory course through Coursera to gain a basic understanding of the field and later found a lot of online material. However, the online resources lacked a structured approach and one can easily get lost in the world of data science. Doing advanced courses on R and Big Data through Jigsaw gave me clarity on the path I wanted to take. In 2020, I did a business analytics course at the Indian Institute of Management, Bangalore which gave me a sweeping understanding of the data science scene.

AIM: How did you feel when Harvard Business Publishing chose your case study on skilling needs of the Indian ecosystem?

Asha Vishwanathan: The project was actually about identifying the skilling needs in the Indian ecosystem. The government provides a lot of schemes and initiatives to train and enable the common people. But, many drop out or are unable to complete such courses. We wanted to find out the root cause of this discontinuation and our research revealed that the skillset provided by the courses did not match the required job profiles. So we decided to look at building a system that would recommend the type of skills that the target audience needs for specific job profiles. It was a challenging process as many candidates were from a blue-collar segment and lacked a basic resume. Taking into account people from different backgrounds like coconut vendors, labourers, plumbers, etc., we created a model to predict the possibility of such individuals passing the aforementioned certifications using previous assessment data. We took data from soft skills required from the job, skills that the individuals have married them all in a comprehensive recommendation model that caught the eye of Harvard business publications. 

AIM: When did your data science journey actually begin?

Asha Vishwanathan: I was already working in data science before joining IIMB. But I decided to do more to understand the breadth of the field. I soon realised that I have stumbled into a niche of machine learning while working in Kernel Insights. The road took many turns after I got certifications from Jigsaw and I joined a company called Poolcircle. There, I worked on changing the ride-sharing scene in Bengaluru using ML methodologies and data science. To change the way people commute, we looked into making a product that recommended routing strategies. After Poolcircle, I joined an early-stage startup called Kernal Insights where I tackled computer vision problems. 

AIM: What kind of projects are you heading in Verloop.io?

Asha Vishwanathan: After joining Verloop.io, I tackled projects around NLP and conversational AI as I realised there are a lot of similarities between how the models work across computer vision. I faced challenges specific to chatbots, the whole ecosystem that it caters to and their implementation in typical B2B, SaaS type models. Working on the hosting and deployment of such models is especially interesting as it is not just restricted to machine learning algorithms but also about scaling up the project to meet industry standards. Looking for a solution from an end-to-end standpoint is what excites me the most.

AIM: What’s your advice for data aspirants? 

Asha Vishwanathan: Data science is a detail-oriented and rigorous field. In order to make it in this market, you must get down to the basic principles and understand if it is their cup of tea. You should also be able to persevere through algorithms day after day and work on similar problems without being saturated. Don’t just blindly dive into it because it’s trending, you might not like it later.

The post My journey in data science: Asha Vishwanathan, Verloop.io appeared first on AIM.

]]>
A guide to GluonNLP: Deep Learning framework for NLP https://analyticsindiamag.com/developers-corner/a-guide-to-gluonnlp-deep-learning-framework-for-nlp/ Tue, 01 Mar 2022 08:30:00 +0000 https://analyticsindiamag.com/?p=10061812

GluonNLP is a Natural language processing Deep learning-based toolkit. This toolkit includes cutting-edge pre-trained models, training scripts, and training logs to help with rapid prototyping and reproducible research.

The post A guide to GluonNLP: Deep Learning framework for NLP appeared first on AIM.

]]>

Natural language processing is one of the most explored and currently trending topics in machine learning. By the NLP daily digital needs such as smart assistance, language translation, text prediction, etc are being addressed. In context to the various libraries used in this field, today in this post we are going to discuss a GluonNLP Natural language processing Deep learning-based toolkit. This toolkit includes cutting-edge pre-trained models, training scripts, and training logs to help with rapid prototyping and reproducible research. We also offer modular APIs with flexible building pieces for easy customization. Following are the major points that we are going to discuss in this post.     

Table of contents

  1. The GluonNLP
  2. Design of the library 
  3. Generating text sequence with GluonNLP

Let’s first understand the library structure.

The GluonNLP

Deep learning has spurred rapid progress in artificial intelligence research, resulting in remarkable discoveries on long-standing problems in a wide range of natural language processing areas. Deep learning frameworks like MXNet, PyTorch, TensorFlow, Caffe, Apache, and Theano make this possible. 

These frameworks have been crucial in the transmission of ideas in the field.  In particular, imperative tools, which were perhaps popularized by Chainer, are straightforward to develop,

learn, read, and debug. Such benefits hasten the imperative programming interface. 

Jian Guo et al create and develop the GluonNLP toolkits for deep learning in natural language processing using MXNet’s imperative Gluon API. GluonNLP simultaneously provides modular APIs to allow customization by reusing efficient building blocks; pretrained state-of-the-art models, training scripts, and training logs to enable fast prototyping and promote reproducible research; and models that can be deployed in a wide variety of programming languages, including C++, Clojure, Java, Julia, Perl, Python, R, and Scala.

Features of library

Here we’ll discuss the major highlights of this library. 

Modular API

Users may tailor their model design, training, and inference by reusing efficient components across various models with GluonNLP’s modular APIs. Data processing tools, models with individual components, initialization procedures, and loss functions are examples of common components.

Take the data API of GluonNLP, which is used to design efficient data pipelines, as an example of how the modular API supports efficient implementation.

with data provided by users In natural language processing jobs, inputs are frequently of various shapes, such as sentences of various lengths. As a result, the data API includes a set of utilities for sampling inputs and converting them into mini-batches that may be computed quickly.

Pre-trained models

Building on such modular APIs, GluonCV/NLP provides pre-trained state-of-the-art models, training scripts, and training logs via the model zoo, enabling fast prototyping and encouraging repeatable research. Over 200 models have been supplied by GluonNLP for natural languages processing tasks such as word embedding, language modelling, machine translation, sentiment analysis, natural language inference, dependency parsing, and question answering.

Generating text sequence with GluonNLP

In this section by leveraging this library API,  how to sample and generate a text sequence using a pre-trained language model.  Using a language model, we can sample sequences based on the likelihood that they will appear in our model for a particular vocabulary size and sequence length. 

Given the context from previous time steps, a language model predicts the likelihood of each word happening at each time step.GluonNLP provides two samplers for generating from a language model for this purpose: BeamSearchSampler and SequenceSampler, of which we will use SequenceSampler.

Let’s now quickly install the dependencies.  

# install dependencies
!pip install gluonnlp 
!pip install mxnet 

To begin, load an AWD LSTM language model, which is a state-of-the-art RNN language pre-trained language model from which we will sample sequences.

# loading the pre-trained model
import mxnet as mx
import gluonnlp as nlp
 
ctx = mx.cpu()
lm_model, vocab = nlp.model.get_model(name='awd_lstm_lm_1150',
                                      dataset_name='wikitext-2',
                                      pretrained=True,
                                      ctx=ctx)

A scorer function is required for Sequence Sampler to function. As the scorer function, we will utilize the BeamSearchScorer, which implements the scoring function with a length penalty.

# scorer
scorer = nlp.model.BeamSearchScorer(alpha=0, K=5, from_logits=False)

Next, we need to define a decoder based on the pre-trained language model.

#decoder
class LMDecoder(object):
    def __init__(self, model):
        self._model = model
    def __call__(self, inputs, states):
        outputs, states = self._model(mx.nd.expand_dims(inputs, axis=0), states)
        return outputs[0], states
    def state_info(self, *arg, **kwargs):
        return self._model.state_info(*arg, **kwargs)
decoder = LMDecoder(lm_model)

Now that we have a scorer and a decoder, we’re ready to construct a sampler. The example code below shows how to make a sequence sampler. We’ll make a sampler with 5 beams and a maximum sample length of 100 to control softmax activation.

# create sampler
seq_sampler = nlp.model.SequenceSampler(beam_size=5,
                                        decoder=decoder,
                                        eos_id=eos_id,
                                        max_length=100,
                                        temperature=0.97)

Next, we’ll produce sentences that begin with “I enjoy swimming.” We feed the language model [‘I,’ ‘love,’ ‘to’] to retrieve the starting states and set the initial input to be the word ‘swim’. 

# generate samples
bos = 'I love to swim'.split()
bos_ids = [vocab[ele] for ele in bos]
begin_states = lm_model.begin_state(batch_size=1, ctx=ctx)
if len(bos_ids) > 1:
    _, begin_states = lm_model(mx.nd.expand_dims(mx.nd.array(bos_ids[:-1]), axis=1),
                               begin_states)
inputs = mx.nd.full(shape=(1,), ctx=ctx, val=bos_ids[-1])

All this can be combined with a helper function by which using a single line we can generate the sequence. 

# helper function
def generate_sequences(sampler, inputs, begin_states, num_print_outcomes):
 
    samples, scores, valid_lengths = sampler(inputs, begin_states)
    samples = samples[0].asnumpy()
    scores = scores[0].asnumpy()
    valid_lengths = valid_lengths[0].asnumpy()
    print('Generation Result:')
 
    for i in range(num_print_outcomes):
        sentence = bos[:-1]
 
        for ele in samples[i][:valid_lengths[i]]:
            sentence.append(vocab.idx_to_token[ele])
 
        print([' '.join(sentence), scores[i]])

Below now we can generate the sequence.

generate_sequences(seq_sampler, inputs, begin_states, 5)

Here is the output of the function,

As we can see the generated context is quite suitable for our original sentence.

Final words

Through this post, we have discussed the GluonNLP, a deep learning-based library to address various task-related NLP such as sentiment analysis, word embeddings, sequence generation, etc. We may experiment with various applications of natural language processing by leveraging its modular APIs and pre-trained models.   

References

The post A guide to GluonNLP: Deep Learning framework for NLP appeared first on AIM.

]]>
A guide to document embeddings using Distributed Bag-of-Words (DBOW) model https://analyticsindiamag.com/developers-corner/a-guide-to-document-embeddings-using-distributed-bag-of-words-dbow-model/ https://analyticsindiamag.com/developers-corner/a-guide-to-document-embeddings-using-distributed-bag-of-words-dbow-model/#respond Tue, 22 Feb 2022 04:30:00 +0000 https://analyticsindiamag.com/?p=10061235

There are different variants of the Doc2Vec model and Distributed Bag-of-Words DBOW is one of them which is better among its peers

The post A guide to document embeddings using Distributed Bag-of-Words (DBOW) model appeared first on AIM.

]]>

In one of our previous articles, we had an introduction to the Doc2Vec model which is an important model for document embedding. The document embedding technique produces fixed-length vector representations from the given documents and makes the complex NLP tasks easier and faster. There are different variants of the Doc2Vec model and Distributed Bag-of-Words DBOW is one of them which is better among its peers. In this article, our discussion is focused on document embeddings using the DBOW model with a hands-on implementation with Gensim. The major points to be discussed in the article are listed below.

Table of contents 

  1. What are document embeddings?
  2. What are Doc2Vec models?
  3. Distributed Bag-of-Words (DBOW)
  4. Implementing DBOW using Gensim

Let’s start with understanding document embeddings,

What are document embeddings?

Beyond practising when things come to the real-world applications of NLP, machines are required to understand what is the context behind the text which surely is longer than just a single word. For example, we want to find cricket-related tweets from Twitter. We can start by making a list of all the words that are related to cricket and then we will try to find tweets that have any word from the list. 

This approach can work to an extent but what if any tweet related to cricket does not contain words from the list. Let’s take an example of any tweet that contains the name of an Indian cricketer without mentioning that he is an Indian cricketer. In our daily life, we may find many applications and websites like Facebook, twitter, stack overflow, etc which use this approach and fail to obtain the right results for us. 

To cope with such difficulties we may use document embeddings that basically learn a vector representation of each document from the whole world embeddings. This can also be considered as learning the vector representation in a paragraph setting instead of learning vector representation from the whole corpus.  

Document embedding can also be considered as an approach of discrete approximation of word embeddings. Since approximation or learning the vector representation of word embedding converts the whole corpus into vectors we find difficulties in establishing the contextual relationship between the words in vector form. Extracting small corpus and converting them into vector representation gives a scope of establishing the contextual relationships between words.

For example, homonyms words can have different contexts in different paragraphs and for a simple vector representation, it is difficult to differentiate between different meanings. While discrete the whole corpus paragraph-wise or sentence-wise and then generating vector representation of discrete corpus can give us more meaningful vector representation. 

What are Doc2Vec models?

As we have discussed above, document embeddings can be considered as discrete vector representations of word embeddings. In the field of document embedding, we mainly find the implementation of the Doc2Vec model for helping us in making document embedding. In a sum-up of the whole theory behind Doc2Vec, we can say that Doc2Vec is a model for vector representation of paragraphs extracted from the whole word embedding or text documents. A detailed explanation of the Doc2Vec model can be found in this article

We can also say Doc2Vec models are similar to the Word2Vec models. While talking about the vector representation of words in Word2Vec models we contextualize words by learning their surroundings and the Doc2Vec can be considered as vector representations of words while there is the addition of context of a paragraph. 

Also, Doc2Vec models have two variants similar to Word2Vec:

  • Distributed memory model
  • Distributed bag of words

These variants also have similarities with variants of Word2Vec. The distributed memory model is similar to the continuous bag of word models and the distributed bag of words is similar to the skip-gram model. In this article, we are focused on a distributed bag of words model. Without wasting so much time let’s move towards the introduction of a distributed bag of words model.

 Distributed Bag-of-Words (DBOW)

The Distributed Bag-Of-Words (DBOW) model is similar to the skip-gram variant of word2vec because this also helps in guessing the context words from a target word. The difference between the distributed bag of words and the distributed memory model is that the distributed memory model approximates the word using the context of surrounding words and the distributed bag of words model uses the target word to approximate the context of the word. When talking about the difference between a distributed bag of words and skip-gram models, the skip-gram model uses the target word as input whereas a distributed bag of words includes paragraph ID as input to predict randomly sampled words from the document. The below figure explains the working of the distributed bag of words model:

In this image, we can see that a paragraph ID is used for words that are randomly sampled from the paragraph’s word embeddings.  By starting from paragraph ID it predicts a number that combines the skip-gram model. In many comparisons, we can find that the distributed bag-of-words models produces better results than the distributed memory models. Let’s see how we can implement a distributed bag of words model.

Implementing DBOW model using Gensim

To implement a distributed bag of words model we are going to use the python language and gensim library. A detailed introduction to the Gensim library can be found here. We can install the library using the following lines of codes.

!pip install --upgrade gensim

Now let’s move forward to the next step where we are required to import some modules.

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Using the above lines of codes, we called a sample data named common_text and model Doc2Vec. We have also imported a module TaggedDocument so that we can process our data as per model requirements.  Let’s see how the data is:

pprint(common_texts)

Output:

Here we can see that we have 9 sentences in the sample data. Or we can consider them as paragraphs because the Doc2Vec model works on paragraphs. Let’s process the data set and provide paragraph id using the TaggedDocuments module.

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]

pprint(documents)

Output:

Here we can see tags with the words in our tagged documents. Let’s train our model on the document.

model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4, dm=0)

In the above code, we have instantiated a model with tagged document, feature vector of dimensionality 5, the distance between the current and predicted word in a sentence is 2, and 4  worker threads to train the model. With this, we have ignored words with a frequency lower than 1. One thing that is noticeable here is if dm = 0 in instantiation then only we can utilize the distributed bag of words model. Let’s check the details of the models.

model.build_vocab(documents)

print(str(model))

Output:

As we can see, here we have created a DBOW variant of the Doc2Vec model for a distributed bag of words. Let’s check how it is doing the document embedding by inferring the vector for a new document.

vector = model.infer_vector(["human", "interface"])

pprint(vector)

Output:

The above output can be compared with other vectors via cosine similarity. The model we trained is based on the iterative approaches of approximation so there can be a possibility of repeated inferences of the same text to be different vectors. The above output also gives the vector representation of the words that we used with the infer_vector module. 

Final words

In the article, we have discussed document embedding and found that Doc2Vec models are required to model document embeddings from word corpus. We also discussed that the distributed bag of words model is a type of Doc2Vec model that can be used for better performance in our NLP tasks.

References 

The post A guide to document embeddings using Distributed Bag-of-Words (DBOW) model appeared first on AIM.

]]>
https://analyticsindiamag.com/developers-corner/a-guide-to-document-embeddings-using-distributed-bag-of-words-dbow-model/feed/ 0
Getting started with Gensim for basic NLP tasks https://analyticsindiamag.com/ai-mysteries/getting-started-with-gensim-for-basic-nlp-tasks/ https://analyticsindiamag.com/ai-mysteries/getting-started-with-gensim-for-basic-nlp-tasks/#respond Sat, 19 Feb 2022 10:44:05 +0000 https://analyticsindiamag.com/?p=10061075

Gensim is an open-source python package for natural language processing with a special focus on topic modelling. It is designed as a topic modelling library, allowing users to apply common academic-based models in production or projects.

The post Getting started with Gensim for basic NLP tasks appeared first on AIM.

]]>

Gensim is an open-source python package for natural language processing with a special focus on topic modelling. It is designed as a topic modelling library, allowing users to apply common academic-based models in production or projects. So, in this article, we will talk about this library and its main functions and features, as well as various NLP-related tasks. Below are the major points that we are going to discuss throughout this post. 

Table of contents

  1. What is Gensim?
  2. Features of Genism
  3. Hands-on NLP with Gensim
    1. Creating a dictionary from a list of sentence
    2. Bag-of-words
    3. Creating Bi-gram
    4. Creating TF-IDF matrix

Let’s first discuss the Gensim library.

What is Gensim?

Gensim is open-source software that performs unsupervised topic modelling and natural language processing using modern statistical machine learning. Gensim is written in Python and Cython for performance. It is designed to handle large text collections using data streaming and incremental online algorithms, which sets it apart from most other machine learning software packages that are only designed for in-memory processing. 

Gensim is not an all-encompassing NLP research library (like NLTK); rather, it is a mature, targeted, and efficient collection of NLP tools for subject modelling.  It also includes tools for loading pre-trained word embeddings in a variety of formats, as well as using and querying a loaded embedding.

Features of Genism

Following are some of the features of the gensim.

Gensim provides efficient multicore implementations of common techniques including Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), and Hierarchical Dirichlet Process to speed up processing and retrieval on machine clusters (HDP).

Using its incremental online training algorithms, Gensim can easily process massive and web-scale corpora. It is scalable since there is no need for the entire input corpus to be fully stored in Random Access Memory (RAM) at any given time. In other words, regardless of the size of the corpus, all of its methods are memory-independent.

Gensim is a strong system that has been used in a variety of systems by a variety of people. Our own input corpus or data stream can be easily plugged in. It’s also simple to add other Vector Space Algorithms to it.

Hands-on NLP with Gensim

In this section, we’ll address some of the basic NLP tasks by using Gensim. Let’s first start with creating the dictionary. 

1. Creating a dictionary from a list of sentence

Gensim requires that words (aka tokens) be translated to unique ids in order to work on text documents. To accomplish this, Gensim allows you to create a Dictionary object that maps each word to a unique id. We may do this by transforming our text/sentences to a list of words and passing it to the corpora.Dictionary() method. 

In the following part, we’ll look at how to really do this. The dictionary object is often used to generate a Corpus of ‘bag of words.’ This Dictionary, as well as the bag-of-words (Corpus), are utilized as inputs to Gensim’s topic modelling and other models.

Here is the snippet that creates the dictionary for a given text.

text = [
   "Gensim is an open-source library for",
   "unsupervised topic modeling and",
   "natural language processing."
]
# get the separate words
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dictionary
dict_ = corpora.Dictionary(text_tokens)
# get the tkens and ids
pprint(dict_.token2id)

2. Bag-of-words

The Corpus is the next important item to learn if you want to use gensim effectively (a Bag of Words). It is a corpus object that contains both the word id and the frequency with which it appears in each document. 

To create a bag of word corpus, all that is required is to feed the tokenized list of words to the Dictionary after it has been updated. doc2bow(). To generate BOW, we’ll continue from the tokenized text from the previous example.

# tokens
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dict
dict_ = corpora.Dictionary()
#BOW
BoW_corpus = [dict_.doc2bow(doc, allow_update=True) for doc in text_tokens]
pprint(BoW_corpus)

The (0, 1) in line 1 indicates that the id=0 word appears just once in the first sentence. Similarly, the (10, 1) in the third list item indicates that the word with the id 10 appears in the third phrase once. And so forth.

3. Creating Bi-gram 

Certain words in paragraphs invariably appear in pairs (bigram) or in groups of threes (trigram). Because the two terms when joined make the actual entity. Forming bigrams and trigrams from phrases is critical, especially when working with bag-of-words models. It’s simple and quick with Gensim’s Phrases model. Because the built Phrases model supports indexing, simply send the original text (list) to the built Phrases model to generate the bigrams.

from gensim.models.phrases import Phrases
# Build the bigram models
bigram = gensim.models.phrases.Phrases(text_tokens, min_count=3, threshold=10)
#Construct bigram
pprint(bigram[text_tokens[0]])

4. Creating TF-IDF matrix

Like the regular corpus model, the Term Frequency – Inverse Document Frequency (TF-IDF) model reduces the weight of tokens (words) that appear frequently across texts. Tf-Idf is calculated by dividing a local component, such as term frequency (TF), by a global component, such as inverse document frequency (IDF), and then normalizing the result to unit length. As a result, phrases that appear frequently in publications will receive less weight. 

There are various formula modifications for TF and IDF. Below is the way by which we can obtain the TF-IDF matrix. The blow snippets first obtain the frequency given by the BOW and later by the TF-IDF.

from gensim.utils import simple_preprocess
from gensim import models
import numpy as np
# data to be processed
doc = [
   "Gensim is an open-source library for  ",
   "unsupervised topic modeling and",
   "natural language processing."]
 
# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in doc])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in doc]
 
# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

Now moving with TF-IDF, we just need to fit the model and access the weights by loops and conditions for each word. 

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')
 
# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

Here is the output.

Final words

Through this article, we have discussed the Python-based library called Gensim, which is a modular kind of library that gives us the facility to build SOTA algorithms and pipelines for NLP-related problems. This post is all about getting started with Gensim where we have practically addressed some of the basic tasks related to NLP and understood the same.

Reference

The post Getting started with Gensim for basic NLP tasks appeared first on AIM.

]]>
https://analyticsindiamag.com/ai-mysteries/getting-started-with-gensim-for-basic-nlp-tasks/feed/ 0
How algorithm understands text in NLP https://analyticsindiamag.com/ai-origins-evolution/how-algorithm-understands-text-in-nlp/ https://analyticsindiamag.com/ai-origins-evolution/how-algorithm-understands-text-in-nlp/#respond Sun, 06 Feb 2022 12:30:00 +0000 https://analyticsindiamag.com/?p=10059947

Hashing is the process of mapping tokens to indexes in such a way that no two tokens map to the same index.

The post How algorithm understands text in NLP appeared first on AIM.

]]>

Machine learning (ML) and other approaches are used in natural language processing (NLP), and they usually work with numerical arrays known as vectors that represent each instance (also known as an observation, entity, instance, or row) in the data set. The collection of all these arrays is referred to as a matrix, and each row in the matrix represents a single instance. Each column indicates a feature when looking at the matrix by its columns (or attribute).

The initial step in NLP is to turn the collection of text occurrences into a matrix, with each row being a numerical representation of a text instance (a vector). However, there are a few terms to understand before getting started with NLP.

Step by Step NLP process 

A document is a single instance in NLP, whereas a corpus is a collection of instances. A document might be as simple as a short phrase or name or as complex as a complete book, depending on the problem at hand.

A decision must be made regarding how to decompose a document into smaller parts through a process known as tokenisation. Tokens are created as a result of this operation. They are the smallest units of meaning that the algorithm can take into account. The vocabulary is the collection of all tokens found in the corpus.

Taking words as a token is a typical choice; in this example, a document is represented as a bag of words (BoW). The BoW model searches the entire corpus for vocabulary at the word level, which means that the vocabulary is the set of all the words found in the corpus. The algorithm then counts the number of times each term appears in the corpus for each document. 

Most terms in the corpus will not appear in most documents, resulting in a lot of zero counts for a lot of tokens in a document. That’s essentially it in terms of concept, but when a data scientist generates the vectors from these, they must verify that the columns line in the same way for each row.

Hashing 

Permuting the row of this matrix, or any other design matrix (a matrix that represents instances as rows and features as columns), has no effect on its meaning. Column permutations are the same way. Data Scientists get a variable ordering of the columns depending on how they map a token to a column index, but no meaningful change in the representation. Hashing is the process of mapping tokens to indexes in such a way that no two tokens map to the same index. A hash, hashing function, or hash function is a specific implementation.

Vocabulary based Hashing 

NVIDIA constructed an implicit hash function while vectorising by hand. They allocated an initial index, 0, to the first word which had not been seen, assuming a 0-indexing scheme. The index was then incremented, and the operation was repeated. “This” was mapped to the 0-indexed column, “is” to the 1-indexed column, and “the” to the 3-indexed columns using NVIDIA’s hash function. There are benefits and drawbacks to using a vocabulary-based hash function.

Mathematical Hashing 

Fortunately, there is another way to hash tokens: use a non-cryptographic mathematical hash function for each instance. This form of hash function maps objects (represented by their bits) to a defined range of integers or numbers using a combination of arithmetic, modular arithmetic, and algebra (bits). The maximum value defines how many columns are in the matrix because the range is known. The range is rather big in general; however, for most rows, the majority of columns will be 0. As a result, a sparse representation reduces the amount of memory needed to hold the matrix, and algorithms can efficiently execute sparse matrix-based operations.

Furthermore, because there is no vocabulary, vectorisation with a mathematical hash function does not necessitate any vocabulary storage overhead. As a result, parallelisation is not limited, and the corpus can be broken into any number of processes, allowing each section to be vectorised independently. The generated matrices can be stacked to form the final matrix once each procedure has finished vectorising its part of the corpora. By reducing bottlenecks, this parallelisation, which is facilitated by the use of a mathematical hash function, can substantially speed up the training pipeline.

The post How algorithm understands text in NLP appeared first on AIM.

]]>
https://analyticsindiamag.com/ai-origins-evolution/how-algorithm-understands-text-in-nlp/feed/ 0
Council Post: How AI & NLP are driving the digital transformation in insurance https://analyticsindiamag.com/ai-origins-evolution/council-post-how-ai-nlp-are-driving-the-digital-transformation-in-insurance/ https://analyticsindiamag.com/ai-origins-evolution/council-post-how-ai-nlp-are-driving-the-digital-transformation-in-insurance/#respond Wed, 26 Jan 2022 10:30:00 +0000 https://analyticsindiamag.com/?p=10059175 How AI & NLP are driving the digital transformation in insurance

To successfully implement AI, NLP solutions in the insurance segment, companies should adopt a case prioritisation framework based on size, wider applicability, and complexity.

The post Council Post: How AI & NLP are driving the digital transformation in insurance appeared first on AIM.

]]>
How AI & NLP are driving the digital transformation in insurance

The insurance industry is undergoing a massive, tech-driven shift. The next decade will be crucial in deciding the future of the insurance sector. Industry leaders have a massive role to play, particularly in terms of adopting disruptive technologies throughout the value chain, starting from underwriting to policy servicing and claim settlement. 

According to Data Bridge Market Research, the AI in the insurance market is expected to touch $6.92 billion by 2028, growing at a CAGR of 24.05 percent for the forecast period of 2021 to 2028. The sector’s growth is expected to be fueled by AI technologies, including machine learning, deep learning, natural language processing (NLP) and robotic automation. 

Below, we discuss how insurance companies are leveraging AI, along with some use cases, challenges, and solutions.

AI, NLP adoption 

For any business working in the insurance space, the first and foremost step is to list all the sub-processes within the value chain instead of solving the complete value chain or a chunk of processes together. This includes size, wider applicability, and complexity. Based on these parameters, the right processes should be prioritised for a minimum viable product (MVP).

For example, a use case that involves extraction from two-three document types can give you volume, complexity, and wider applicability, such as email submission in underwriting. 

It is important to ensure the first use case is successful as it paves the path for other use cases. Once the first successful MVP implementation is set, a roadmap should be created for multiple AI-based proof-of-value (POVs) and integrate these use cases to deliver enhanced efficiency, effectiveness and customer experience. 

Challenges in deploying AI-at-scale 

Many global insurance companies’ technology and data science teams are exploring multiple generic products to solve structural problems. However, such products tend to reach a saturation point after a few easy, quick wins. Due to the limited capabilities of these generic products, some of the leading companies are struggling to deploy AI at scale, and are now looking at solving the next set of business challenges related to unstructured, handwritten, video and voice data.  

The major roadblocks in deploying AI at scale include: 

  • Continuous upgrades, and modifications in dependent systems
  • Limited business domain knowledge of tech teams
  • Lack of human-in-the-loop concept

Building comprehensive solutions to address these challenges is easier said than done. An end-to-end AI implementation leverages many tech systems, including ingestion from document management systems, to final posting into business applications such as policy admin system (PAS). While developing solutions, it is best to plan and accommodate all dependent systems upgrades or changes to avoid last-minute hurdles. Thus, timing and system flexibility are critical for smooth AI implementation. 

Moreover, successful AI implementation requires contributions from various resources, like, AI-NLP data scientists, data and tech engineers, business and project managers. However, as we move to solve the next level of challenges, it is important that tech teams upskill themselves and learn business nuances (understanding underwriters instructions). A deep understanding of business nuances will enable solutions that can address business complexities and multi-user functionality. 

Today, market expectations for 100 percent automation or state-thru-processing from AI solutions are reasonable, and current generalised products have been able to deliver this, albeit for simple problems. In my opinion, expecting 100 percent automation is the reason why these products are limited to straightforward cases. 

The way out of this problem is to accept the fact that machines cannot independently learn and solve problems and require human assistance. A well-known example that elucidates this better is self-driving cars or autonomous vehicles. 

While AI, NLP solution does its job with high precision, some instances are far too complex for machines to interpret. A common example of this is underwriting risk for customers who have either submitted partial or contradictory information. Human intervention is required in such cases to process contextual information.  Thus, human-in-loop enables ‘assisted’ ingestion of outputs by a human after augmenting business judgement. 

Use cases to consider 

There are multiple use cases that can be considered, including invoices, contracts, statements of values, endorsements, etc. 

Business submissions in the underwriting space is one such use case. It provides size, wider applicability and moderate to high complexity and can be prioritised over other use cases. However, the process requires interpretations from email and various unstructured documents (application quote, proposal, etc.). To extract information from multiple documents, numerous NLP models are required. Once these NLP models are created, they can be applied to a wider canvas for delivering AI at scale. 

Also, the submissions process for a transaction may stretch to a few months. However, AI can automate the process and reduce cycle time to a few days. In addition, the AI solution enables interpretation from emails and attached documents and provides underwriting assistants with the requisite information to review or modify to complete the transaction. 

Final thoughts 

To successfully implement AI, NLP solutions in the insurance segment, companies should adopt a case prioritisation framework based on size, wider applicability, and complexity.

The companies should first re-draft their AI at scale roadmap, as generic products have limited scope. The tech teams, including AI data scientists, and data and tech engineers, should upskill their domain understanding. In addition to this, the AI-at-scale solution designs should be flexible and well thought through along with dependent systems. Lastly, human-in-the-loop is essential for any AI implementation.

This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill the form here.

The post Council Post: How AI & NLP are driving the digital transformation in insurance appeared first on AIM.

]]>
https://analyticsindiamag.com/ai-origins-evolution/council-post-how-ai-nlp-are-driving-the-digital-transformation-in-insurance/feed/ 0
Happy birthday PyTorch! The open-source ML library completes 5 years since its public launch https://analyticsindiamag.com/ai-news-updates/happy-birthday-pytorch-the-open-source-ml-library-completes-5-years-since-its-public-launch/ https://analyticsindiamag.com/ai-news-updates/happy-birthday-pytorch-the-open-source-ml-library-completes-5-years-since-its-public-launch/#respond Thu, 20 Jan 2022 07:32:09 +0000 https://analyticsindiamag.com/?p=10058826

PyTorch is a free and open-source software released under the Modified BSD license.

The post Happy birthday PyTorch! The open-source ML library completes 5 years since its public launch appeared first on AIM.

]]>

Today, PyTorch is celebrating the fifth anniversary of its public launch. “We didn’t expect to come this far, but here we are ☺. We are now at 2K Contributors, 90K Downstream Projects, 3.9M lines of “import torch” on GitHub. But more importantly, we’re still receiving lots of love and having a great ride,” said PyTorch’s social media handles.

PyTorch, an open-source ML library based on the Torch library, is used for applications like computer vision and natural language processing (NLP). PyTorch is primarily developed by Facebook’s AI Research lab (FAIR).

Wishes all around

Logan Kilpatrick, Developer Community Advocate at Julia Language and a Board of Director at Num FOCUS wrote: “Congrats to the PyTorch community and , Soumith Chintala for leading the project to such success! Looking forward to seeing what happens in the next 5 years”

Lambert Rosique, Artificial Intelligence and Software Manager at Touch Sensity, said, “What’s awesome with PyTorch is that it went from “used mostly by researchers” to “used in industry too” in only two-three years :o Congratulations!

Stefan Ojanen, Product Manager at Genesis Cloud, posted on LinkedIn: “Well done! With ML starting to benefit all aspects of life, the whole world benefits from the healthy competition between PyTorch and TensorFlow 👏 It’s also good that they have stayed differentiated in meaningful ways, so both will have their place for many years to come.”

Péter Salamon, AI Product Owner & Machine Learning Engineer, said: “Congratulation to the team! Thank you for making it available for the public! Your work is a big contribution to AI developers and beyond!”

Shubham Shrivastava, Machine Learning and Computer Vision Research Scientist – Autonomous Vehicles at Ford Greenfield Labs, wrote: “Happy birthday to one of the most helpful libraries for AI applications in the recent history.”

Jeswanth G, PMP, an AI Team Lead at Elevate Tech, said, “Eagerly waiting for compatible PyTorch version for Mac M1 to use GPU.”

On a lighter note,  developers also used the opportunity to mock the recruiters who ask for unreasonable work experiences: 

Raviteja Kolapalli, Data Scientist || Data Engineer said, “Recruiters: 7+ years of experience using Pytorch.

Me: o_o”

James Horine, Quantitative Research Lead also said, “Average number of years of PyTorch experience required by recruiters? 7?”

PyTorch is a free and open-source software released under the Modified BSD license.  Adam Paszke, Soumith Chintala, Sam Gross, and Gregory Chanan worked on building the initial system and now has core maintainers, and a broader set of developers that directly merge pull requests and own various parts of the core codebase.

According to the official website, “PyTorch adopts a governance structure with a small set of maintainers driving the overall project direction with a strong bias towards PyTorch’s design philosophy where design and code contributions are valued,”

The post Happy birthday PyTorch! The open-source ML library completes 5 years since its public launch appeared first on AIM.

]]>
https://analyticsindiamag.com/ai-news-updates/happy-birthday-pytorch-the-open-source-ml-library-completes-5-years-since-its-public-launch/feed/ 0
Natural Language Processing for Absolute Beginners https://analyticsindiamag.com/ai-mysteries/natural-language-processing-for-absolute-beginners/ Mon, 03 Jan 2022 11:30:00 +0000 https://analyticsindiamag.com/?p=10057540 Natural Language Processing

NLP proffers an advancement in interpreting, reading and hearing the human-oriented data, to extract sentiments and outcomes from it.

The post Natural Language Processing for Absolute Beginners appeared first on AIM.

]]>
Natural Language Processing

Before diving into the definition of natural language processing it is extremely important to explore why it came into existence. Our personal computers communicate in a language known as machine language. Unlike human natural language, a machine language uses a series of zeroes and ones often called bits to communicate to the outer world and is vaguely puzzling for humans. To bridge this gap it was necessary for machines to act and talk like humans and hence NLP was invented to proffer intelligent human to machine interaction.

WHAT IS NLP?

Natural language processing is a subset of Artificial Intelligence, Computer science and Human linguistic processing providing the ability for computers/ machines to understand, process and acquire significant insights from human natural languages.

In this era, NLP is scaling exponentially in vast areas such as defence, finance, entertainment, healthcare, automation, etc.

SIGNIFICANCE

Regularly humongous amounts of data are generated in our system in real-time using transactions, sensors, devices, dialects, etc. The generated data is highly unstructured data. NLP allows us to make this data manageable, less ambiguous and analyse the data at a much faster rate as compared to humans.

NLP proffers an advancement in interpreting, reading and hearing the human-oriented data, to extract sentiments and outcomes from it.

APPLICATIONS

The significance of NLP is better understood by exploring its real-life applications. 

Social media analysis measures and classifies sentiments over online data so that we can target toxic comments or cyberbullying, forecast and analyse major influences driving the social media communities. Companies also use sentiments to track their product or service behaviour in the market.

Text analytics are employed in organizations to identify customer dissatisfaction and classify the customers so that the company can implement improvements and serve their consumers even better.

Fraud analytics is used in document search and banking systems to restrict and detect fraudulent behaviour. Multinational companies use NLP to detect and classify unusual activities, for example how our mailing systems such as Gmail and Rediffmail categorize spam emails in your inbox.

Speech interfacing to understand and act on speech prompts given by humans. Virtual agents like Microsoft’s Cortana and Amazon’s Alexa use their speech recognition and NLU techniques to create your shopping list, give you weather updates, etc.

Machine Translation using NLP is used to auto-translate source language to another desired language such as google translate.

Digital marketing/ campaigning exploits data-driven NLP methods to automate and dig into customer/ target audience portfolios, social media market research to build their content strategies. for example chatbot assistance as a recommender.

NLP LIBRARIES AND TECHNIQUES

LIBRARIES

NLP is effectively implemented using programming languages such as Python and R. Let’s look at some common, ready to use packages/libraries used for NLP in R and Python.

R libraries

tm  is a text mining library which is used extensively to perform data preprocessing and mapping techniques. tm houses various functions for text and metadata analysis.

languageR contains functions to perform statistical analysis on the textual data. It provides built-in functions to implement correlation, regression techniques.

Dplyr is a data manipulation library that offers functions to filter, arrange and sample the dataset so it smooths out data manipulation and processing.

Lsa library is used to perform latent semantic analysis in R. Decomposing a document feature matrix is an exceptional feature of lsa which clearly represents semantic behaviour of the data.

Python Libraries

NLTK natural language toolkit as the name suggests is an essential toolkit for numerous NLP tasks ranging from parsing, stemming up to classification and clustering.

CoreNLP library is one of the fastest libraries for NLP written in java. It performs major tasks such as POS tagging, dependency parsing and supports several other languages besides English

Spacy library supports built-in ready to use statistical models and NLP pipeline which automates preprocessing activities such as tagging, tokenizing, NER, etc.

TextBlob is a beginner level library that can be used to implement basic analysis tasks in R.

Now, let’s learn and implement some NLP techniques provided by such libraries.

TECHNIQUES

Data Collection

To perform linguistic processing over any problem statement first you should acquire the data set relevant to the problem. Data collection involves activities to retrieve specific information from the big data. NLP uses common methods such as APIs, web scrapers, plugins and existing datasets/databases to gather data for processing.  

Data Segmentation

NLP is a collection of preprocessing and analysis. In order to attain near accurate results, it is essential to preprocess the data for anomalies and refine it. Segmentation is a technique to fragment the dataset using separators so to get a clear picture. Default separators are commas or full stops.

Tokenization

It is a process of further segmenting the data by converting sentences into a sequence of separate words referred to as tokens and getting rid of punctuation characters.

from nltk import word_tokenize, sent_tokenize

data = “It is an amazing story about the journey to ones destiny. It inspires you to achieve your dreams, work hard for it and see how the entire universe conspires to make it happen”

print(“word tokenization”)

print(word_tokenize(data))

print(“segmentation”)

print(sent_tokenize(data))

Stop Words Removal

Generally the raw data set contains some dialects or words which provide meaning while conversing but depict little or no meaning when it comes to processing, in such situations NLP uses technique called stop words removal to remove non important grammar words example or, and, the, are ,etc. This technique performs data sanitization over the data set to keep it concise and accurate.

Stemming

Stemming is another technique of preprocessing which is used to reduce a word to its stem word by removing affixes added. Stemming is an important practice since affixes can increase the dimensionality of the data which is undesirable.

Lemmatization

It is a NLP technique to reduce any word to its root word. Lemmatization maps different dictionary words to its cognitive set as ‘was’,’are’,’is’ is reduced to ‘be’.

Dependency Detection

This method is implemented after preprocessing, a method used to establish and identify possible relationships between words in a data set.

Chunking

Chunking is a method in NLP which takes words in unstructured data as an input and groups them into a chunk or a cluster of parts of speech.It is used to retrieve phrases from a description. In a scenario for a sentence “We had chicken and curry for dinner”  if you want to chunk nouns it will output a set of “chicken curry dinner”.

CONCLUSION

In this article, we grasped the basics of Natural Language processing, its importance in computational linguistics and its usage in handling voluminous unstructured textual data.

NLP has gained a lot of credibility in the area of data science and analysis due to its immense development and hence it is essential to master the NLP technology.

REFERENCES

The post Natural Language Processing for Absolute Beginners appeared first on AIM.

]]>
Interesting Algorithms Released By Meta AI In 2021 https://analyticsindiamag.com/ai-origins-evolution/interesting-algorithms-released-by-meta-ai-in-2021/ Fri, 24 Dec 2021 09:30:00 +0000 https://analyticsindiamag.com/?p=10057019

Let us take a look at a few of such interesting algorithms that came from Meta AI this year

The post Interesting Algorithms Released By Meta AI In 2021 appeared first on AIM.

]]>

Facebook has been in the spotlight this year for undergoing a major brand change and renaming itself as Meta. As one of the leading innovation companies, globally, we saw Meta come out with really interesting algorithms and models in areas ranging from computer vision, robotics, 3D simulation, NLP, and more.

Also Read:

Let us look at a few of them.

Habitat 2.0

Meta came out with Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. It worked on all stack levels, including data, simulation, and benchmark tasks. One of such areas is ReplicaCAD, an artist-authored, annotated, reconfigurable 3D dataset of apartments with articulated objects, said Meta. H2.0 comes with a physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850× real-time) on an 8-GPU node. The Home Assistant Benchmark (HAB) is a suite of common tasks for assistive robots that works with mobile manipulation capabilities.

For more details, click here.

Animating children’s hand-drawn figures of humanlike characters

Just days back, Meta came out with something really unique – calling it “a first-of-its-kind method for automatically animating children’s hand-drawn figures of people and humanlike characters” in minutes by using AI. The prototype system it has built allows them to do so by uploading their drawings with the option of even downloading their animated drawings. Meta said that it wanted to build an AI system that can identify and automatically animate the humanlike figures in children’s drawings with a high success rate and without any human guidance.

For more details, click here.

Ego4D

Meta introduced Ego4D, a massive-scale egocentric video dataset and benchmark suite. It said that Ego4D comes with 3,025 hours of daily life activity video spread over hundreds of scenarios, such as outdoor, workplace, leisure, home, etc. It added that parts of the videos come with audio, 3D meshes of the environment, eye gaze, stereo, and synchronized videos from multiple egocentric cameras at the same event. 

For more details, click here.

Few-Shot Learner to take on harmful content

Meta came out with new AI technology called Few-Shot Learner (FSL) that can adapt to take action on new or evolving types of harmful content within weeks instead of months. FSL works on more than 100 languages and has the capability to learn from different kinds of data (both image and text). In addition, it can work on AI models that are already being used to detect harmful content.

“Few-shot learning” starts with a large, general understanding of many different topics, then uses much fewer, and in some cases zero, labelled examples to learn new tasks, said Meta. 

For more details, click here.

XLS-R: Self-supervised speech processing for 128 languages

XLS-R is a self-supervised model for speech tasks. It improves upon previous multilingual models by training on nearly ten times more public data in more than twice as many languages. Meta said that it fine-tuned XLS-R to perform speech recognition, speech translation, and language identification, setting a new state of the art on a diverse set of benchmarks. This includes BABEL, CommonVoice, and VoxPopuli for speech recognition, CoVoST-2 on foreign-to-English translation, and VoxLingua107 for language identification.

It is trained on more than 436,000 hours of publicly available speech recordings based on wav2vec 2.0. In addition, meta has expanded this model to 128 different languages, increasing it nearly two and a half times from its predecessor.

For more details, click here.

The post Interesting Algorithms Released By Meta AI In 2021 appeared first on AIM.

]]>
A Guide to Term-Document Matrix with Its Implementation in R and Python https://analyticsindiamag.com/developers-corner/a-guide-to-term-document-matrix-with-its-implementation-in-r-and-python/ Sun, 19 Dec 2021 10:30:00 +0000 https://analyticsindiamag.com/?p=10056160

For text data, the term-document matrix is a kind of representation that helps in converting text data into mathematical matrices

The post A Guide to Term-Document Matrix with Its Implementation in R and Python appeared first on AIM.

]]>

In natural language processing, we are required to perform various types of text preprocessing tasks so that the mathematical operations can be performed on the data. Before applying mathematics to such data, data is required to be represented in the mathematical format. For text data, the term-document matrix is a kind of representation that helps in converting text data into mathematical matrices. In this article, we are going to discuss the term-document matrix and we will see how we can make one. We will do a hands-on implementation of term-document matrices in R and Python programming languages for a better understanding. The major points to be discussed in this article are listed below.

Table of Contents

  1. What is a Term-Document Matrix?
  2. Term-Document Matrix in R
  3. Term-Document Matrix in Python
    1. Using Pandas
    2. Using Text Mining
  4. Application of Term-Document Matrix       

Let’s start the discussion by understanding what the term-document matrix is.

What is a Term-Document Matrix?

In natural language processing, we see many methods of representing text data. Term document matrix is also a method for representing the text data. In this method, the text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.

Index Sentences 
1I love football
2Messi is a great football player
3Messi has won seven Ballon d’Or awards 

Here, we can see a set of text responses. The term-document matrix of these responses will look like this:

IlovefootballMessiisagreatplayerhaswonsevenBallon d’Orawards
I love football1110000000000
Messi is a great football player0011111100000
Messi has won seven Ballon d’Or awards 0001000011111

The above table is a representation of the term-document matrix. From this matrix, we can get the total number of occurrences of any word in the whole corpus and by analyzing them we can reach many fruitful results. Term document matrices are one of the most common approaches which need to be followed during natural language processing and analyzing the text data. More formally we can say that it is the way to represent the relationship between words and sentences presented in the corpus.   

Since R and python are two common languages that are being used for the NLP, we are going to see how we can implement a term-document matrix in both of the languages. Let’s start with the R language.    

Implementation in R

In this section of the article, we are going to see how we can create a term-document matrix using the R language. For this purpose, we are required to install the tm(text mining) library in our environment.

Installing library:

install.packages("tm")

Using the above lines of codes, we can install the text mining library. Instead of term-document and document-term matrix, we have various facilities available in the library from the field of text mining and others. 

Importing the library:

library(tm)

Using the above lines of code we can call the library.

Importing data:

For making a term-document matrix in R, we are using crude data which comes with the tm  library and it is a volatile corpus of 20 news articles which are dealing with crude oil. 

data("crude")

Lets inspect the crude vcorpus 

inspect(crude[1:2])

Output:

Here is the output. We can see the character counts and metadata information in vcorpu. For more detailed information, we can use the help function of the R.

help(crude)

Output:

Here we can also use the corpus for making the term-document matrix but we are using vcorpus because of its explainability after converting it to a term-document matrix.

Making Term-Document Matrix:

tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))
tdm

Output:

Here we can see the details of the term-document matrix. Let’s inspect some values from it.

inspect(tdm[100:110, 1:9])

Output:

Here in the output, we can see some of the values of the term-document matrix and some of the information regarding these values. We can also inspect the values using our chosen words from the documents.

inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])

Output:

We can also make the document term matrix using the functions provided by the tm library as,

dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     FALSE),
                                         stopwords = TRUE))
dtm

Output:

Let’s inspect the document term matrix.

inspect(dtm)

Output:

The basic difference between the term-document matrix and document term matrix is that the weighting of the term-document matrix is based on the term frequency (TF) and in the document term matrix the weighting is based on term frequency-inverse document frequency(TF-IDF).

The below image is a representation of a word cloud using the document term matrix that we have made earlier. We can make it using the following codes:

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wordcloud(names(freq), freq, min.freq=400, max.words=Inf, random.order=FALSE, colors=brewer.pal(8, "Accent"), scale=c(7,.4), rot.per=0)

Here in the image, we can see that we are required to clean the data to get more proper results. Since the motive of the article is to learn the basic implementation of the document term matrix,  we will be focused on this motive only. Let’s see how we can perform it on the Python programming language. 

Implementation in Python

In this section of the article, we are going to see how we can make the document term matrix using the python languages and libraries built under python language. In python, there are various ways using which we can perform this. Before going on any of the processes let’s define a document. Here we are taking the sentences from the above-given table. Let’s start by defining the documents.

sentence1 = "I love football"
sentence2 = "Messi is a great football player"
sentence3 = "Messi has won seven Ballon d’Or awards "

As we have said that in python, we can do it in various ways. Here we will be discussing two simplest ways for performing this. The first way of making the term-document matrix is to use the functions from the pandas and scikit learn libraries. Let’s see how we can perform this.

Using Pandas

Importing the libraries

 import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Adding the sentences 

docs = [sentence1, sentence2, sentence3]
print(docs)

Output:

Defining and fitting the count vectorizer on the document.

vec = CountVectorizer()
X = vec.fit_transform(docs)

Converting the vector on the DataFrame using pandas 

df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.head()

Output:

Here we can see the document term matrix of the documents which we have defined. Now let’s see how we can perform this using our second way where we have a library named textmining which has a function for making the document term matrix from the text data.

Using Text Mining 

Installing the library:

pip install textmining3

Output:

Initializing function for making term-document matrix.

import textmining
tdm = textmining.TermDocumentMatrix()
print(tdm)

Output:

Here we can see the type of object in the output which we have defined for making the term-document matrix.

Fitting the documents in the function.

tdm.add_doc(sentence1)
tdm.add_doc(sentence2)
tdm.add_doc(sentence3)

Converting the term-document matrix in the Pandas data frame.

tdm=tdm.to_df(cutoff=0)
tdm

Output:

Here we can see the document term matrix which we have created using the text mining library.

Application of Term-Document Matrix 

We can say that making a term-document matrix from the text data is one of the tasks which comes in between the whole NLP project. Term document matrix can be used in various types of NLP tasks, some of the tasks we can perform using the term-document matrix are as follows:

  • By performing the singular value decomposition on the term-document matrix, search results can be improved to an extent. Using it on the search engine, we can improve the results of the searches by disambiguating polysemous words and searching for synonyms of the query.
  • Most of the NLP processes are focused on mining one or more behavioural data from the corpus of text. Term document matrices are very helpful in extracting the behavioural data. By performing multivariate analysis on the document term matrix we can reach the different themes of the data.

Final Words

Here in this article, we have seen what is a term-document matrix with an example along with how we can make the term-document matrix using the R and python programming languages. In the end, we have also discussed some major applications of the term-document matrix.

References 

The post A Guide to Term-Document Matrix with Its Implementation in R and Python appeared first on AIM.

]]>
NeuSpell: A Neural Net Based Spelling Correction Toolkit https://analyticsindiamag.com/ai-mysteries/neuspell-a-neural-net-based-spelling-correction-toolkit/ Sat, 18 Dec 2021 12:30:00 +0000 https://analyticsindiamag.com/?p=10056157

Spell check features, or spell checkers, are software applications that check words against a digital dictionary to ensure they are correctly spelled. Words that are identified as misspelled by the spell checker are usually highlighted or underlined.

The post NeuSpell: A Neural Net Based Spelling Correction Toolkit appeared first on AIM.

]]>

Spell check features, or spell checkers, are software applications that check words against a digital dictionary to ensure they are correctly spelled. Words that are identified as misspelled by the spell checker are usually highlighted or underlined. Among the numerous spelling checking tools and applications available, this post will focus on NeuSpell, a neural network, and Python-based spelling checking toolbox. The following are the key points that will be addressed in this article.

Table of Contents

  1. How does the Spell Checker Works?
  2. Under the Hood of NeuSpell
  3. Models in NeuSpell
  4. Implementation details of NeuSpell
  5. Implementing NeuSpell

Let’s start the discussion by understanding how various tools work for spell correction.

How does the Spell Checker Works?

When presenting a document to clients, professors, or any other audience, saying something smart and valuable is crucial. However, if your content is riddled with typos, misspellings, and errors, most people are likely to overlook it. Perfect copy is a sign of professionalism, and most businesses expect nothing less from their documentation. A spell checker program or the spell checking functions provided by a word processor are two useful tools that computer users can use to edit their documents.

The most common type of error in written text is misspelt words. As a result, spell checkers are commonplace, appearing in a variety of applications such as search engines, productivity and collaboration tools, messaging platforms, and so on. Many high-performing spelling correction systems, on the other hand, are developed by businesses and trained on massive amounts of proprietary user data. 

Many freely available off-the-shelf correctors, such as Enchant, GNU Aspell, and JamSpell, on the other hand, do not make effective use of the misspelt word’s context. For example, based on the context, they fail to distinguish between thaught and taught or thought: “Who thaught you calculus?” vs. “I never imagined I’d be given the fellowship.”

Under the Hood of NeuSpell

In their paper, Sai Muralidhar et al. propose a spelling checker toolkit called NeuSpell. They show a spelling correction toolkit that consists of several neural models that accurately capture context around misspellings. They use several text noising strategies to train these neural spell correctors by curating synthetic training data for spelling correction in context. 

For word-level noising, these strategies use a lookup table, and for character-level noising, they use a context-based character-level confusion dictionary. Harvest isolated misspelling-correction pairs from various publicly available sources to populate this lookup table and confusion matrix.

NeuSpell is an open-source toolkit for English spelling correction. This toolkit includes ten different models that are tested against naturally occurring misspellings from a variety of sources. When models are trained on our synthetic examples, correction rates improve by 9% (absolute) when compared to training on randomly sampled character perturbations. 

The correction rate is increased by another 3% when richer contextual representations are used. This toolkit allows users to use proposed and existing spelling correction systems through a unified command line and a web interface.

Models in NeuSpell

This toolkit includes ten different spelling correction models, (i) including two commercially available nonneural models, (ii) four published neural models for spelling correction, and (iii) four of our extensions. The following are the details of the first six systems:

SC-LSTM

It uses semi-character representations fed through a bi-LSTM network to correct misspelt words. The semi-character representations combine one-hot embeddings for the first, last, and bag of internal characters.

CHAR-LSTM-LSTM

The model creates word representations by feeding each character into a bi-LSTM. These representations are then fed into a second biLSTM that has been trained to predict the corrective action.

CHAR-CNN-LSTM

This model, like the previous one, uses a convolutional network to create word-level representations from individual characters.

BERT

A pre-trained transformer network is used in the model. The word representations are obtained by averaging the sub-word representations, which are then fed to a classifier to predict its correction.

GNU Aspell

To score candidate words, it employs a combination of the Metaphone phonetic algorithm, Ispell’s near-miss strategy, and a weighted edit distance metric.

They enhanced the SC-LSTM model with deep contextual representations from pre-trained ELMo and BERT to better capture the context around a misspelt token. They append them to semi-character embeddings before feeding them to the biLSTM or to the biLSTM’s output because the best point to integrate such embeddings varies by task. Our toolkit currently includes four such trained models: ELMo/BERT coupled with a semi-character-based bi-LSTM model at input/output.

The Implementation details of NeuSpell

In NeuSpell, neural models are trained by treating spelling correction as a sequence labelling task, with a correct word labelled as itself and its correction labelled as to its correction. The abbreviation UNK refers to labels that aren’t in the dictionary. A softmax layer is used to train models to output a probability distribution over a finite vocabulary for each word in the input text sequence.

During training, they used 50,100,100,100 sized convolution filters with lengths of 2,3,4,5 in CNNs and set the hidden size of the bi-LSTM network in all models to 512. The bi-LSTM outputs were dropped out at 0.4, and the models were trained using cross-entropy loss. 

For models with a BERT component, we used the BertAdam optimizer, and for the rest, we used the Adam optimizer. The default parameter settings are used with these optimizers. I used a batch size of 32 examples and trained for 3 epochs of patience.

Replace UNK predictions with their corresponding input words during inference, then evaluate the results. The accuracy (percentage of correct words among all words) and word correction rate of the models are then assessed (percentage of misspelt tokens corrected). 

To use ELMo and BERT, the libraries AllenNLP and Huggingface were used. The Pytorch library is used to implement all of the neural models in this toolkit, and they are compatible with both CPU and GPU environments.

Now let’s see how we can implement NeuSpell.

Implementing NeuSpell

To move further we need to install the NeuSpell from its official repository by cloning and installing the dependencies from the requirement.txt file as mentioned in the repository or we can directly install it by using pip command as pip install neuspell.

Import all the dependencies 

import neuspell
from neuspell import BertChecker, CnnlstmChecker

Now instantiate the BertChecker Class and download the pre-trained model.

checker_bert = BertChecker()
# Download BERT Pre-trained model
checker_bert.from_pretrained()

Now let’s take some samples of incorrectly spelled sentences and see how the model can correct them. 

checker_bert.correct("I luk foward to receving your reply")

And here is the output.

Let’s take another example,

checker_bert.correct_strings(["Thee wors are often used together. You can go to the defition of spellig or the defintion of mistae. Or, see other combintions with mistke.", ])

The beautiful thing that I observed from this toolkit is that we can even pass our text file directly and it can return the cleaned version in the form of text file just like in the above example by using just a single line of code as below.

checker_bert.correct_from_file(src="/content/History_100.txt")

The above code returns a clean_version.txt in the local directory.

Further from this step, we can also evaluate our text files. For that under the checker_bert.evaluate() we need to pass the original clean file and the corrupted file as shown below.

checker_bert.evaluate(clean_file="clean_version.txt", corrupt_file="History_100.txt")

Final Words 

Through this post, we have seen how Spelling checker tools can play a vital role. We talked about NeuSpell, a spelling correction toolkit with ten different models, in relation to the various toolkits available. Unlike popular open-source spell checkers, our models accurately capture the context around misspelt words, and we have seen everything beforehand.

References

The post NeuSpell: A Neural Net Based Spelling Correction Toolkit appeared first on AIM.

]]>
Most Popular NLP Papers Of 2021 https://analyticsindiamag.com/ai-mysteries/most-popular-nlp-papers-of-2021/ Fri, 17 Dec 2021 11:30:00 +0000 https://analyticsindiamag.com/?p=10056151 NLP, NLP Papers

Natural Language Processing includes the analysing of data to extract and process meaningful information.

The post Most Popular NLP Papers Of 2021 appeared first on AIM.

]]>
NLP, NLP Papers

Natural Language Processing or NLP is a technique to teach computers to process and comprehend human/natural languages. NLP is a part of data science and includes the analysis of data to extract, process, and output meaningful information. Some of the important applications of NLP include: 

  • Text mining 
  • Text and sentiment analysis 
  • Speech generation 
  • Text classification 
  • Speech Generation 
  • Speech Classification 

In this article, Analytics India Magazine lists the top journals for NLP that one must read. These journals are information repositories that can help one stay at the top of their NLP game. 

(Note that the list is in no particular order.)

Dynabench: Rethinking Benchmarking in NLP 

This year, researchers from Facebook and Stanford University open-sourced Dynabench, a platform for model benchmarking and dynamic dataset creation. Dynabench runs on the web and supports human-and-model-in-the-loop dataset creation. It addresses how contemporary models quickly achieve performance on benchmark tasks but fail on simple examples or real-world scenarios. Dynabench helps in dataset creation, model development, and model assessment which leads to more robust and informative benchmarks.

Causal Effects of Linguistic Properties 

This paper on Causal Effects of Linguistic Properties deals with the problem of using observational data. The paper addresses challenges related to the problem before developing a practical method. Based on the result, it introduces TextCause— an algorithm to estimate the causal effects of linguistic properties. It leverages distant supervision to improve noisy proxies’ quality; and BERT, the pre-trained language model, to adjust for the text. Finally, it presents an applied case study to investigate the effects. The paper was presented at the NAACL 2021. 

Transformer-based Binary Word Sense Disambiguation 

Released at the second International Conference on NLP and Big Data, this paper deals with the word sense disambiguation problem as a classification task and presents a model for text ambiguity problems with the help of transformers. In recent solutions for NLP tasks, transformers have shown improvements. However, researchers find the correct meaning of every word in a particular text in this task. This paper further depicts how the usage of pre-train transformer models improve the accuracy of the architecture. These experiments also showcase how NLP task performance can be improved with the help of data augmentation techniques. 

Single Headed Attention RNN: Stop thinking with your head 

Published by Harvard University graduate Steven Merity, the paper ‘Single Headed Attention RNN: Stop thinking with your head’, introduces a state-of-the-art NLP model called Single Headed Attention RNN or SHA-RNN. The author does so by using the example of the LSTM model with SHA in order to achieve state-of-the-art, byte-level language model results on enwik8

NLP applied on issue trackers 

The NLP applied on issue trackers paper discusses the various NLP techniques, including top analysis, similarity algorithms (N-grams, Jaccard, LSI algorithm), descriptive statistics, and others, along with machine learning (ML) algorithms such as support vector machines (SVM) and Decision trees. These techniques are usually used for a better understanding of the characteristics, classification, lexical relations, and prediction of duplicate development tasks. Tuning the different features to predict the development tasks with a Fidelity loss function, a system can identify duplicate tasks with almost 100 percent accuracy. 

Attention in Natural Language Processing

Attention is a popular mechanism in neural architectures and has been realised in various formats. However, owing to the fast-paced advances in this domain, a systematic overview of attention is still missing. This paper defines a unified model for attention architectures in NLP while focusing on those that are designed to work with vector representations of textual data. The writers have proposed a taxonomy of attention models according to four dimensions: 

  • Representation of input 
  • Compatibility function 
  • Distribution function 
  • Multiplicity of the input and output 

Additionally, the paper provides instances of how prior information can be exploited in attention models while discussing ongoing research efforts and open challenges, providing extensive categorisation of the huge body of literature. 

The post Most Popular NLP Papers Of 2021 appeared first on AIM.

]]>
Google Adds Fast Wordpiece Tokenization To Tensorflow https://analyticsindiamag.com/ai-news-updates/google-adds-fast-wordpiece-tokenization-to-tensorflow/ Tue, 14 Dec 2021 07:17:45 +0000 https://analyticsindiamag.com/?p=10055885

Google’s LinMaxMatch approach improves performance, makes computation faster and reduces complexity

The post Google Adds Fast Wordpiece Tokenization To Tensorflow appeared first on AIM.

]]>

Google presented the ‘Fast WordPiece Tokenization’ at EMNLP 2021, where they developed an improved end-to-end WordPiece tokenisation system. It has the capability to speed up the tokenisation process, saving computing resources and reducing the overall model latency. In comparison to traditional algorithms, this approach improved performance up to 8x faster as it reduces the complexity of the computation by order of magnitude. Google has applied this in a number of systems and has released it in TensorFlow Text.

Using the greedy longest-match-first strategy to tokenise a single word, WordPiece iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is called MaxMatch and has been used in Chinese word segmentation since the 1980s. Despite its wide use in NLP, it is still computation intensive. Google has proposed an alternative to this – the LinMaxMatch, which has a tokenisation time that is strictly linear with respect to n.

In other terms, if tire-matching cannot match an input character for a given node, the standard algorithm backtracks to the last character where a token was matched and restarts the trie matching procedure from there. This results in repetitive and wasteful iterations. Instead of backtracking, Google’s method triggers a ‘failure transition’ in two steps. It first collects the precomputed tokens stored at that node, known as ‘failure pops’ and then follows the precomputed ‘failure link’ to a new node from where the trie matching process continues. As n operations are required to read the entire input, LinMaxMatch is asymptotically optimal for the MaxMatch problem.

As the existing systems pre-tokenise the input text by splitting it into words by whitespace, punctuation and characters, and then call WordPiece tokenisation on each resulting word, Google has proposed an end-to-end WordPiece tokeniser. It combines pre-tokenisation and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie-matching and failure transitions and only checks for whitespace and punctuation characters among the few input characters that are not handled by the loop. This makes it more efficient as it traverses the input only once, performs fewer whitespace/punctuation checks, and skips the creation of intermediate words.

The post Google Adds Fast Wordpiece Tokenization To Tensorflow appeared first on AIM.

]]>
Creating A ML Solution That Accurately Extracts Quotes From News Articles https://analyticsindiamag.com/ai-news-updates/creating-a-ml-solution-that-accurately-extracts-quotes-from-news-articles/ Tue, 30 Nov 2021 12:15:08 +0000 https://analyticsindiamag.com/?p=10054578

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source.

The post Creating A ML Solution That Accurately Extracts Quotes From News Articles appeared first on AIM.

]]>

The Guardian recently announced that it has joined forces with Agence France-Presse (AFP) to work on a machine learning solution that accurately extracts quotes from news articles and matches them with the right source. The company says that the existing solutions did not work that well on their content, and the models struggled to recognise quotes that did not match a classic pattern. Some models were returning too many false positives and identifying generic statements as quotes.

Co-referencing, or the process of establishing the source of a quote by finding the correct reference in the text, was also an issue, especially when the source’s name was mentioned in several sentences or even paragraphs before the quote itself. 

To train a model to identify quotes in the text, the company used two tools created by Explosion –  Spacy, one of the leading open-source libraries for advanced natural language processing using deep neural networks, and Prodigy, an annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.

Together with AFP, the team manually annotated more than 800 news articles with three entities: content (the quote, in quotation marks), source (the speaker, which might be a person, an organisation, etc), and cue (usually a verb phrase, indicating the act of speech or expression).

The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. The first batch of annotations turned out to be quite noisy and inconsistent, but the team were getting better and better with each iteration.

The model correctly identified all three entities (content, source, cue) in 89% of cases. Considering each entity separately, content scored the highest (93%), followed by a cue (86%) and source (84%).

The company says that it looks forward to building a robust co-reference resolution system and exploring further deep learning. Challenges such as identifying meaningful quotes and content will also be addressed. 

The post Creating A ML Solution That Accurately Extracts Quotes From News Articles appeared first on AIM.

]]>