-->

Friends of Enterprise AI World! Register NOW for London's KMWorld Europe 2026.

The Multimodal World of Enterprise AI

Article Featured Image

Multimodal applications of AI are gaining credence throughout the enterprise at increasingly rapid rates—and for good reason. Firstly, they come close to realizing the full potential of foundation models and large language models (LLMs) which, by default, are multimodal.

Although commonly deployed for textual use cases, several of these models are just as well-versed in image and video applications and, to a lesser extent, audio ones.

Secondly, the possibilities of multimodality applications transcend the ability to simply “process different types of inputs beyond text,” commented Michael Allen, Laserfiche CTO. Today’s multimodal use cases don’t just involve video inputs or audio inputs. They combine these modalities, and any number of others, for a single application, such as analyzing a patient (or customer’s) emotions to optimize treatment, or sales interactions, with them.

When you consider that many such multimodal use cases can be implemented with just a single LLM or foundation model, the cost benefits and efficiency of multimodal AI become even more attractive. However, its grandest premise may be the distinct possibility that, when carefully applied, multimodality AI may “transform the way we as humans interact with technology, broadly speaking,” said John Capobianco, head of AI and developer relations at Itential. By providing pliant, multifaceted interfaces for people to engage with IT systems reliant on multimodal AI on the back end, these technologies may very well accomplish this transformation goal.

MODEL SELECTION

As is the case with most contemporary statistical AI applications, organizations can primarily access multimodal models in two ways. The first is via startups or specialist vendors concentrating in one particular modality, such as audio. The second is through “the big cloud providers,” Allen mentioned. “They have models that can be used for image processing, video processing, and audio processing—including OpenAI.”

Other LLM providers, including Anthropic, can also give organizations a single model that accommodates the foregoing modalities Allen described.

Nonetheless, these models don’t perform equally well for different modalities or the range of use cases for which organizations employ them. According to Abhishek Gupta, head of data science at Talentica, “When one competitor starts to see it’s lagging behind, they work harder to get better.” Here are some of the pros and cons of multimodal models for the two most popular modalities, text and static images:

  • OpenAI: As one of the more popular LLM providers, OpenAI is considered a credible choice for general purpose, multitask models. For a document extraction use case, “We used GPT 4.0 from OpenAI,” Gupta said. “It provides both vision and text, and in terms of cost, for vision it performs much better for cost—compared to if you use GPT-4o mini, which is more optimized for text and costs more for images.”
  • Google: There are several iterations of Google’s popular Gemini models for multimodal use cases. When contrasting OpenAI’s GPT 5.1 to Gemini 2.5, “The response time was much better for Gemini,” Gupta observed. “It was able to extract each page in less than 10–15 seconds, whereas GPT 5.1 took more time.”
  • Anthropic: Claude and Sonnet are currently the most widely used multimodal models from Anthropic, which also provides agents powered by models. “People understand that Claude Code and Gemini can write code,” Capobianco acknowledged. “I don’t think people understand that it can then test that code in a real browser, like a real user would, using multimodality.”

This fact has significant consequences for application development, testing, and deployment, as well as network architecture and deployment.

EAIWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues