-->

Friends of Enterprise AI World! Register NOW for London's KMWorld Europe 2026.

The Multimodal World of Enterprise AI

Article Featured Image

LLMS VS. OPEN SOURCE MODELS

LLMs are the most effectual multimodal models, instead of smaller, open source language models, for several reasons. The former “embeds a lot of the knowledge about the world we live in that goes well beyond what you would think you would want to train your system on,” Allen indicated. “It’s surprising how general knowledge about the world helps these models perform better on mundane tasks, like pulling data out of some form.”

Open source models aren’t as big and don’t contain as much knowledge which, for LLMs, is said to include the entire content of the internet. Consequently, open source models are far less accomplished at multimodality.

“I think it’s a human resource issue,” Capobianco mused. “If you’re training a model to recognize images, imagine how many images Google has access to when training their models versus someone trying to make an open source model. It’s a matter of access to data.” The advantage of open source models is the huge disparity in the cost between these free resources and pricy LLMs. Although it’s possible to find some that specialize in modalities other than text, “You kind of get what you pay for,” Capobianco warned. “It’s not just about the size of the model; it’s the intense mathematics that have to go into making something multimodal.”

AUDIO AND VOICE AI

As both Allen and Gupta alluded to, there are numerous textual applications, including document extraction, that are enhanced by the image modality. Another pervasive modality across industries involves voice and audio capabilities.

According to Allen, this modality may prove the most useful; there are numerous applications of it. For “customer service recordings, if you’ve got a problem with a product you bought or a service, they can record those interactions and analyze them to find or detect your tone of voice or other markers that they can use to identify trends,” Allen revealed.

The most ubiquitous deployments for this modality entail the following:

  • Quality Assurance: For this application, speech recognition is employed to ensure that employees in a contact center, for example, are performing as expected. Implementation involves “all the transcriptions being recorded, converted to text, and [that] analytics is happening to train the call center,” remarked Deepgram chief strategy officer Anoop Dawar.
  • Agent Assistance: This sophisticated use case centers on low-latency speech recognition inputs and spoken—or textual—outputs to help human agents with customers or prospects. “AI is assisting in real time,” Dawar explained. “It’s listening in parallel to the call center representative and providing cues and hints to the representatives to improve the outcome.”
  • Autonomous Voice Agents: This advanced application employs multimodal agents as customer service representatives without human involvement. “You’re taking some or most of your IVR [interactive voice response] and fully offloading it to an autonomous agent that takes care of the whole conversation,” Dawar added.

Additional use cases include employing voice AI agents to take orders at restaurants or drive-thru windows. The aforesaid speech recognition applications are noteworthy for two reasons. Firstly, many involve real-time voice inputs and textual, or voice, outputs prior to transcribing speech-to-text, which was formerly the standard means of facilitating speech recognition. “There are things in audio that are never captured in transcription,” Dawar pointed out. “I can say hello in an excited way; as if I was tired; sarcastically; or angrily. When you transcribe it, you just get five letters, and you’ve lost all that meaning.” Finally, some companies facilitate these applications with numerous multimodal models that may be selected or combined to produce the best result. Thus, not all multimodal AI applications are based on a single model.

VISION LANGUAGE MODELS

Vision language models (VLMs) have garnered traction for their facility for image recognition. Whereas organizations can employ LLMs and other models to generate images, “VLMs are for when you’re processing images or some other visual modality as input,” Allen clarified. “You could have a VLM that identifies say, billboards on pictures or images that are taken from Google Maps, and you want to identify the billboards and extract text from them.” In some cases, VLMs have replaced traditional OCR and intelligent character recognition (ICR) for a wealth of document processing applications, including labeling, tagging, data extraction, and classification.

Gupta cited a financial services use case in which VLMs were employed to extract information from each line in financial documents such as balance sheets, invoices, and financial statements. This approach is vital for reconciling any discrepancies among documents between two or more parties. Using the models he mentioned above, Gupta said, “We were able to extract about 90% quite well, and it required post-processing in order to make it near 100%.” Postprocessing steps typically required human intervention to make manual corrections. However, that input then served as a means of improving the learning capacity of the underlying model, which produced better outputs when similar situations arose. The human effort served as a “few-shot learning example,” Gupta said.

EAIWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues