Particle physics research

Behind HuggingFace’s BigScience project, which mobilizes research on major language models

Companies like Google and Facebook are deploying large language models (LLMs) for translation and moderation of content. Meanwhile, OpenAI GPT-2 and GPT-3 are the most powerful language models capable of writing exceptionally convincing text passages in a range of different styles (as well as complete musical compositions and complete writing of computer code). Start-up, also, create dozens of their own LLM products and services based on the models created by these tech giants.

Very soon, all of our digital interactions will likely be filtered through LLMs, which is bound to have a fundamental impact on our society.

Yet despite the proliferation of this technology, very little research is done on the environmental, ethical and societal concerns it raises. Today, the big tech giants hold all the power to determine how this transformative technology develops, as AI research is expensive, and they are the ones with the deep pockets, which gives them the power to censor or clutter the research that throws them in a bad light.

The dangers of LLMs

There are a number of concerns about the rapid growth of LLMs that many leaders in the AI ​​community say are under-researched by big tech companies. These include:

  • The data used to build these models often comes from unethical and non-consensual sources.
  • Role models are fluent in conversation and presumably human, but they do not understand what they are saying and often propagate racism, sexism, self-harm, and other dangerous views.
  • Today, many of the advanced features of LLMs are only available in English, which makes their application for content moderation unsafe in non-English speaking countries.
  • When fake news, hate speech, and death threats are not moderated from the dataset, they are used as training data to create the next generation of LLMs, allowing for further (or escalation) ) toxic language models on the Internet.

What is the BigScience project?

the BigScience Project, led by Hugging Face, is a one-year research workshop that drew on previous schemas of scientific creation (such as CERN in particle physics) in order to tackle the lack of ongoing research on models and multilingual data sets. Project leaders don’t believe they can stop the hype surrounding major language models, but they hope to push it in a direction that will make it more beneficial to society.

The idea is for program participants (who are all there as volunteers) to study the capabilities and limitations of these datasets and models from all angles. The central question they seek to answer is How? ‘Or’ What and when LLMs need to be developed and deployed so that we can reap their benefits without having to face the challenges they pose.

To do this, the group of researchers aims to create a very large linguistic model of a multilingual neural network and a very large multilingual textual data set on a supercomputer provided to them by the French government.

How does BigScience do a better job than tech companies?

Unlike research conducted in tech companies, where researchers have primarily technical expertise, BigScience has drawn on researchers from a much wider range of countries and disciplines. They have researchers specializing in AI, NLP, social sciences, law, ethics and public policy, in order to make the process of building models a true collaborative event.

At present, the program consists of 600 researchers from 50 countries and more than 250 institutions. They were all divided into a dozen working groups, each addressing different facets of the development and investigation of the model: one group measures the environmental impact of the model, another develops and assesses the “multilingualism” of the model, a other is developing responsible means of sourcing training data. , and yet another transcribes historical radio archives or podcasts.

If things work out, the project could inspire people in the industry (many of whom are involved in the project) to incorporate some of these approaches into their own LLM strategy and create new standards within the NLP community.