Multimodal AI
Multimodal AI is artificial intelligence that can understand and work with more than one type of input - such as text, images, and audio - rather than just one.
In this guide
What Multimodal AI means
A "mode" is a type of information: written text, a photo, a sound, a video. Older AI tools handled just one mode. Multimodal AI can take in several at once and connect them, much closer to how people perceive the world.
For example, you could show a multimodal AI a photo of your fridge and ask, in text, "what can I cook with this?" It understands the image and your written question together, then replies with recipe ideas.
Why Multimodal AI matters
Multimodal AI greatly expands what you can build, because real tasks rarely involve text alone. Knowing it is possible helps you design more useful tools.
Frequently asked questions
More AI terms
Ready to build the AI skills your future depends on?
Take the free 5-minute quiz and get a personalized learning plan built around your goals, schedule, and experience.