The Multimodal World of Enterprise AI

VIDEO DATA

Many of the same use cases for image data can be applied to video data with multimodal models. Models can classify and label videos based on their content. There are also techniques for identifying specific actions that took place at particular points in time in videos. These capabilities are crucial for employee training and annotating meetings between parties.

According to Allen, the govtech space is one of the largest users of video processing technologies. One of the most helpful ways in which multimodal models can assist this vertical is with their capacity for redaction.

“They want to identify all the minors in an image or video so they can blur their faces,” Allen commented. “There’s a lot of bodycam footage; you may want to identify minors or people holding a weapon. This can speed up all sorts of things, like Freedom of Information Act requests.” Prior to the adoption of video processing models, such redaction efforts were largely manual. Organizations can greatly increase the throughput and efficacy of those traditional processes by employing today’s multimodal models. “It was done by officers who went through the images frame by frame,” Allen said. “There are tools they use, but it’s still slow. So, AI’s helping with that.”

COMBINING MODALITIES

The ultimate expression of multimodal models may be in combining their modalities to support a single application.

As Capobianco mentioned, some use cases involve models functioning for humans to interface with highly proficient IT systems for network architecture, cloud architecture, and more. Other applications are more quotidian, albeit still helpful to the enterprise, such as employing the visual capabilities of LLMs to produce images from text (or speech, when supported).

However, Capobianco noted, “The reverse is really exciting. On a napkin, I drew a four-node topology, and I described the interfaces. I took a picture of the napkin and sent it to Gemini 3. It gave me not close, not OK, but perfect configurations for the four devices.” The nodes were a pair of routers and a pair of switches, for which Capobianco described the ports and their connections. The modalities for this use case involved a modest amount of text and images and resulted in actual network connections that were usable.

What’s truly impressive is that this same multimodal approach works on configuring resources in the cloud—which is inordinately painstaking and time-consuming to do correctly. Certain low-code application builders employ such models, these modalities, and any requisite data (including that from legacy systems) to devise new applications or modernize existing ones. Another compelling use case for these capabilities includes a video game Capobianco created with Gemini 3 Pro and Gemini CLI that simulates a drone exploring the contents of Google Maps.

“From that prompt, it used a multimodal model to generate the drone asset, and it wrote all the code,” Capobianco recalled. “I asked it to test the code using a real browser, and take screenshots as you go, and make adjustments to the code based on your usage of the code in the browser. It went back to the CLI, and it changed 200–300 lines of code based on what it saw in the browser.”

EVEN BIGGER

There are few limitations to the capabilities of multimodal models. However, their greatest achievement may be in combining modalities—and models—for allowing humans to interface with back-end systems.

“What will be even bigger is VibeOps, where I say, ‘Please deploy this app into AWS’ and press enter, and the agent builds the app and deploys it to the cloud,” Capobianco predicted.

“Not using Python, not using a Rest API, not using a GUI—I’m using my own words to test, secure, or deploy infrastructure.” As multimodal applications increase and new possibilities are explored, moving beyond text for enterprise AI is becoming not only a reality but also an important future direction.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Friends of Enterprise AI World! Register NOW for KMWorld 2026 & Enterprise AI World 2026, November 16-19.

The Multimodal World of Enterprise AI

VIDEO DATA

COMBINING MODALITIES

EVEN BIGGER

Building an AI-Ready Enterprise: Governance as the Foundation for Scalable AI

Solving the 95% Blind Spot

Spiral by UJET - The AI Issue Hub for Decision-Grade Data

Tipping the Scales: AI-led or Human-led CX in 2025 and Beyond

More

Data Access in the AI Era: Delivering Speed, Security, and Control at Scale

Building Effective Agentic AI Applications

More Webinars