Yesterday, OpenAI announced GPT-4, its long-awaited next-generation AI language model. The system’s capabilities are still being assessed, but as researchers and experts pore over its accompanying materials, many have expressed disappointment at one particular feature: that despite the name of its parent company, GPT-4 is not an open AI model.
OpenAI has shared plenty of benchmark and test results for GPT-4, as well as some intriguing demos, but has offered essentially no information on the data used to train the system, its energy costs, or the specific hardware or methods used to create it.
Many in the AI community have criticized this decision, noting that it undermines the company’s founding ethos as a research org and makes it harder for others to replicate its work. Perhaps more significantly, some say it also makes it difficult to develop safeguards against the sort of threats posed by AI systems like GPT-4, with these complaints coming at a time of increasing tension and rapid progress in the AI world.
“I think we can call it shut on ‘Open’ AI: the 98 page paper introducing GPT-4 proudly declares that they’re disclosing *nothing* about the contents of their training set,” tweeted Ben Schmidt, VP of information design at Nomic AI, in a thread on the topic.
Here, Schmidt is referring to a section in the GPT-4 technical report that reads as follows:
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
Speaking to The Verge in an interview, Ilya Sutskever, OpenAI’s chief scientist and co-founder, expanded on this point. Sutskever said OpenAI’s reasons for not sharing more information about GPT-4 — fear of competition and fears over safety — were “self evident”:
“On the competitive landscape front — it’s competitive out there,” said Sutskever. “GPT-4 is not easy to develop. It took pretty much all of OpenAI working together for a very long time to produce this thing. And there are many many companies who want to do the same thing, so from a competitive side, you can see this as a maturation of the field.”
“On the safety side, I would say that the safety side is not yet as salient a reason as the competitive side. But it’s going to change, and it’s basically as follows. These models are very potent and they’re becoming more and more potent. At some point it will be quite easy, if one wanted, to cause a great deal of harm with those models. And as the capabilities get higher it makes sense that you don’t want want to disclose them.”
The closed approach is a marked change for OpenAI, which was founded in 2015 by a small group including current CEO Sam Altman, Tesla CEO Elon Musk (who resigned from its board in 2018), and Sutskever. In an introductory blog post, Sutskever and others said the organization’s aim was to “build value for everyone rather than shareholders” and that it would “freely collaborate” with others in the field to do so. OpenAI was founded as a nonprofit but later became a “capped profit” in order to secure billions in investment, primarily from Microsoft, with whom it now has exclusive business licenses.
When asked why OpenAI changed its approach to sharing its research, Sutskever replied simply, “We were wrong. Flat out, we were wrong. If you believe, as we do, that at some point, AI — AGI — is going to be extremely, unbelievably potent, then it just does not make sense to open-source. It is a bad idea… I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.”
Opinions in the AI community on this matter vary. Notably, the launch of GPT-4 comes just weeks after another AI language model developed by Facebook owner Meta, named LLaMA, leaked online, triggering similar discussions about the threats and benefits of open-source research. Most initial reactions to GPT-4’s closed model, though, were negative.
Speaking to The Verge via DM, Nomic AI’s Schmidt explained that not being able to see what data GPT-4 was trained on made it hard to know where the system could be safely used and come up with fixes.
“For people to make informed decisions about where this model won’t work, they need to have a better sense of what it does and what assumptions are baked in,” said Schmidt. “I wouldn’t trust a self-driving car trained without experience in snowy climates; it’s likely there are some holes or other problems that may surface when this is used in real situations.”
William Falcon, CEO of Lightning AI and creator of the open-source tool PyTorch Lightning, told VentureBeat that he understood the decision from a business perspective. (“You have every right to do that as a company.”) But he also said the move set a “bad precedent” for the wider community and could have harmful effects.
“If this model goes wrong, and it will, you’ve already seen it with hallucinations and giving you false information, how is the community supposed to react?” said Falcon. “How are ethical researchers supposed to go and actually suggest solutions and say, this way doesn’t work, maybe tweak it to do this other thing?”
Another reason suggested by some for OpenAI to hide details of GPT-4’s construction is legal liability. AI language models are trained on huge text datasets, with many (including earlier GPT systems) scraping information from the web — a source that likely includes material protected by copyright. AI image generators also trained on content from the internet have found themselves facing legal challenges for exactly this reason, with several firms currently being sued by independent artists and stock photo site Getty Images.
When asked if this was one reason why OpenAI didn’t share its training data, Sutskever said, “My view of this is that training data is technology. It may not look this way, but it is. And the reason we don’t disclose the training data is pretty much the same reason we don’t disclose the number of parameters.” Sutskever did not reply when asked if OpenAI could state definitively that its training data does not include pirated material.
Sutskever did agree with OpenAI’s critics that there is “merit” to the idea that open-sourcing models helps develop safeguards. “If more people would study those models, we would learn more about them, and that would be good,” he said. But OpenAI provided certain academic and research institutions with access to its systems for these reasons.
The discussion about sharing research comes at a time of frenetic change for the AI world, with pressure building on multiple fronts. On the corporate side, tech giants like Google and Microsoft are rushing to add AI features to their products, often sidelining previous ethical concerns. (Microsoft recently laid off a team dedicated to making sure its AI products follow ethical guidelines.) On the research side, the technology itself is seemingly improving rapidly, sparking fears that AI is becoming a serious and imminent threat.
Balancing these various pressures presents a serious governance challenge, said Jess Whittlestone, head of AI policy at UK think tank The Centre for Long-Term Resilience — and one that she said will likely need to involve third-party regulators.
“We’re seeing these AI capabilities move very fast and I am in general worried about these capabilities advancing faster than we can adapt to them as a society,” Whittlestone told The Verge. She said that OpenAI’s reasons not to share more details about GPT-4 are good, but there were also valid concerns about the centralization of power in the AI world.
“It shouldn’t be up to individual companies to makes these decisions,” said Whittlestone. “Ideally we need to codify what are practices here and then have independent third-parties playing a greater role in scrutinizing the risks associated with certain models and whether it makes sense to release them to the world.”