Code completion has been one of the most prominent use cases of large language models (LLMs). GitHub Copilot, the popular AI tool, has been used by over a million developers and 200,000 enterprises.
However, widely-used code generation tools like GitHub CoPilot, AWS CodeWhisperer or Google Duet AI are not open source. Enterprises are unaware of the specific codes on which these models are trained, presenting a significant concern, especially for those in highly scrutinised industries.
Thus, project BigCode, an open scientific collaboration run by Hugging Face and ServiceNow Research, was born. It recently released StarCoder 2, which is trained on a larger dataset (7.5 terabytes) than its predecessors and on 619 programming languages.
StarCoder 2 comes in three sizes – 3-billion, 7-billion and 15-billion-parameter models.
While there are a few open source code LLMs, the StarCoder 2 15-billion model developed by NVIDIA matches and at times even surpasses 33-billion parameter models, like Code Llama, on many evaluations.
How enterprises benefit from open source?
According to Leandro von Werra, machine learning engineer at Hugging Face and co-lead of the BigCode project, StarCoder 2 will empower the developer community to build a wide range of applications more efficiently with full data and training transparency.
Besides the fact that StarCoder 2 is free to use, it also brings in added benefits for developers and enterprises, according to Werra.
“For many companies, using GitHub CoPilot is tricky from a security perspective, because it requires employing the endpoint that CoPilot uses, which is not retained in their environments. You’re sending parts of your code to that endpoint and you have no control over where exactly that code goes.
“Given that code represents a crucial aspect of intellectual property for many companies, we’ve received numerous inquiries requesting an open version to utilise such services securely,” Werra told AIM.
Moreover, enterprises don’t know what codes went into the model during the training process. This lack of transparency poses a potential liability for the enterprise, especially if the model generates copyrighted code.
However, Werra adds that this is a problem which even his team has not been able to solve. “We’re doing licence detection, but it’s not 100% accurate. It’s nearly impossible to do it at that scale 100% correctly, but at least we provide full transparency in what went into it and how we filter data,” he said.
Fine-tuning StarCoder 2
While the above mentioned pointers were from a security perspective, the biggest benefit of StarCoder 2 for enterprises is that they can take the model and fine-tune it with their own enterprise data.
Indeed, many enterprises, for instance, emphasise their unique coding style or internal standards, which may differ from codebases used in training code LLMs.
“By leveraging their own codebase, they streamline processes, avoiding the need for extensive rewriting, such as fixing styles or updating docstrings, often accomplished effortlessly.
Alternatively, they can fine-tune the model for specific use cases, catering to tasks like text-to-SQL code conversion or translating legacy COBOL code to modern languages. This ability to fine-tune models based on their data enables companies to address specialised needs effectively,” Werra said.
For example, while a dedicated model may be more comprehensive for a specific SQL use case, fine-tuning allows for customisation, providing flexibility to tackle various scenarios—a prospect that excites enterprises.
So far, StarCoder 2 is already being used by ServiceNow, which also trained the 3-billion-parameter StarCoder 2 model. Besides, a dozen other enterprises have started leveraging StarCoder 2, according to Werra.
Previously, VMware, an American cloud computing and virtualisation technology company, successfully deployed a fine-tuned version of StarCoder.
Businesses subject to stringent security regulations, such as those in the financial or healthcare sectors, would most likely adopt open-source models. These companies face challenges in sharing data with third parties due to heightened scrutiny.
It is important to note that other code LLMs, like Code Llama, can be fine-tuned. However, Meta has not released the datasets besides stating that it has been trained on widely available public data.
Will enterprises pivot to open-source?
Using open-source technologies comes with its own set of challenges. Despite the promised benefits of StarCoder 2 and the adoption by a handful of enterprises, the question that arises is: will we see a wider adoption by enterprises?
Werra believes that it is probable, as many enterprises initially opt for closed LLMs due to their accessibility and ease of use. However, as companies mature and streamline their use cases, there is a growing desire for models that offer total control. This trend holds true for code LLMs as well.
“Decades ago, software development primarily relied on off-the-shelf solutions. However, the landscape has changed, with many companies, especially IT firms, crafting their own software solutions at the core of their operations.
“Similarly, a parallel trend is emerging with LLMs. While off-the-shelf models serve a broad range of tasks competently, for more specialised or dedicated applications, fine-tuning an open model remains the preferred approach,” Werra said.
Based on open-source principles
The BigCode team has open-sourced the model weights and dataset. “We released Stack V1 a year ago, and now we have released Stack v2,” Werra said.
However, even though the models are supported by an open rail licence, there are some restrictions.
For example, “You can’t use the model to extract Personally Identifiable Information (PII) from the pretrained data or generate potentially malicious code,” Werra warned. Nonetheless, StarCoder2 is available for commercial use.