Two years ago, OpenAI released the GPT-3 model trained on 175 billion parameters. Since then, large language models (LLMs) have been the rage.
On Wednesday, MetaAI and Papers with Code announced the release of Galactica, an open-source large language model trained on scientific knowledge, with 120 billion parameters. The AI-generative tool will aid academic researchers by producing extensive literature reviews, generating Wiki articles on any topic, accessing lecture notes on scientific texts, producing answers to questions, solving complex mathematical solutions, annotating molecules and proteins, and more.
Galactica is trained on a large number of scientific papers, research materials, knowledge bases and numerous other sources, including scientific texts as well as modalities such as proteins and compounds.
Any output can be generated based on Galactica’s vast database by simply entering the prompt at galactica.org.
The new model is designed to tackle the issue of information overload when accessing scientific information through search engines, where there is no proper organisation of scientific knowledge. In comparison, Galactica is built with the mission of organising science—that is, a model that can store, combine and reason about scientific knowledge.
Research published shows that Galactica outperforms other model on different metrics:
(i) It beats the latest GPT-3 by 68.2% versus 49.0% on technical knowledge probes such as LaTeX equations.
(ii) In the measure of reasoning, it also surpasses Chinchilla on mathematical MMLU with 41.3% to Chinchilla’s 35.7%, and PaLm 540B on MATH with a score of 20.4% versus 8.8%.
(iii) It is also found to be better than BLOOM and OPT-175B on BIG-bench despite not being trained on general corpus.
The paper can be accessed here.
However, the AI community has been quick to address the issues surrounding this model. David Chapman took to twitter to explain how bad the output generated was based on some examples from the Hacker News discussion forum:
But, outside the issues with the model, the scientific community has also lauded Meta’s efforts in collating and indexing scientific works, databases, and code bases.
Large language model breakthroughs
Besides GPT-3 and Galactica, LLM such as YaLM are trained on 100 billion parameters, while models such as BLOOM and PaLM are trained on 176 billion, and 540 billion parameters respectively. We have also seen the rise of protein language models solving the decade old protein folding problem, and in the most recent development, the GenSLM model, which is able to predict Covid variants.
Moreover, we are amidst an age of ‘text-to-anything’, which is trained on massive language models and developed by companies like OpenAI, Microsoft, Google, etc. To the long list, we can now add ‘text-to-science-research’ as the new AI tool disrupting existing processes of scientific research and publication.