Nuances of Build-or-Buy Decisions
COSTS FOR BUYING
As Krug alluded to, cost is one of the more prominent factors in choosing between building or buying advanced machine learning models, particularly at enterprise scale. The primary difference between the costs of each option is that those for the latter—which are largely based on API calls to the model provider—are far more transparent than those for the former.
Organizations can decrease the cost of transmitting data through API calls via token compression, which involves techniques for reducing the number of tokens while still preserving their meaning. According to Krug, this approach “comes with trade-offs in terms of time and accuracy. If you’re only trying to optimize for cost, you do certain things. If you’re trying to optimize for speed, you do other things at the trade-off of cost. And, if you’re trying to do things for accuracy, you can trade off for cost or performance.”
COSTS FOR BUILDING
The costs for the build approach are not as well-defined as those for the buy approach. Typically, they emerge over a period of time. Here are some of the many costs organizations may not give due credence to before choosing this option:
- Talent Acquisition and Retention: Whether an organization is merely fine-tuning an open source model for domain specificity or building an entire model from scratch, finding and keeping those with the technical expertise in this field can be pricy. According to Bradford, “A lot of times you have to find talent.” Doing so isn’t easy or guaranteed to be successful.
- Infrastructure: Part of the cost of the API calls for accessing models through managed service providers entails them hosting the models. Organizations foregoing that option must maintain their own infrastructure. “That’s quantifiable,” Krug noted. “This workload runs on this hardware. These cloud instances, therefore, cost me this much.”
- Time-to-Market: Determining how long it takes to operationalize a model is an intangible cost for do-it-yourself options. “How fast can you bring something out to market if you’re spending half of your time getting it to work?” Krug asked. “Scaling and upgrades and failure management and monitoring and all of the operational overhead is not trivial.”
- Scalability: Whether building bespoke models or adapting open source ones for specific use cases, organizations must account for the cost required to make them scalable for production settings. There are also costs for running them in production. “We’ve got a lot of Ph.D. research scientists; we run our own data centers and invest quite a large sum of dollars into the compute in those data centers to do the model training and the inference,” Rutgers remarked. “It’s a big investment.”
CONTROL
One of the best prerequisites of deciding to employ open source models (or build models from scratch) is that organizations can control every aspect of them. Such control spans everything from the model’s individual parameters and weights to what data is involved in fine-tuning them and how models are deployed. Here are some of the important ramifications of having such autonomy:
- Security, Governance, and Data Privacy: Many LLM providers are mandated to preserve the data users transmit while accessing their models for a defined period of time, which could potentially span months. Consequently, organizations in highly regulated industries with stipulations about who is privy to their data “are more apt to build out some of these capabilities themselves,” Bradford mentioned. Interestingly, some third-party providers offer measures in which customers pay a fee “so they don’t retain the end users’ data beyond the time of inference,” Bradford indicated. For example, OpenAI has a zero data retention option that substantially decreases how long the provider keeps users’ data. Keeping data in virtual private clouds is another way organizations can reinforce their security and governance when they have complete control over their own AI models.
- Availability: Many third-party vendors have service level agreements specifying availability for approximately 99% of the time. According to Bradford, “These guarantees may not be good enough for some customers, who are going to be sure that 100% of what they have is available. A lot of third-party vendors are just not going to be able to provide that level of availability, just because the cost of doing that is a lot more expensive.”
- Autonomy: The level of autonomy—and specificity—for building AI systems vastly outpaces that for the buying paradigm. “You can run it on whatever you want to run it on,” Krug stipulated. “You can configure it the way you want to and pick and choose how to wire things together.”