Gemma 3: Understanding Multimodality in AI

Google for DevelopersSeptember 17, 20256 min5,946 views

20 connections·21 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

What is Multimodality in AI?

💡 Multimodality in AI refers to a system's ability to understand and integrate information from multiple data types, such as text, images, and short videos, simultaneously.
🧠 Humans naturally process information from various sources, like diagrams and text, to gain understanding; AI aims to replicate this integrated processing.

Gemma 3's Multimodal Capabilities

🚀 Gemma 3, including its 4B, 12B, and 27B parameter models, possesses both vision and language capabilities.
🖼️ It can analyze images to understand content, describe them, answer questions, identify objects, and extract text.
🎬 Gemma 3 can also interpret short videos (up to a few minutes), enabling analysis of instructional clips, advertisements, or social media content.
💬 The model excels at understanding and generating high-quality text, crucial for providing context and rich outputs.
📄 Leveraging long context capabilities, Gemma 3 can engage in multi-turn conversations about documents, screenshots, and other extended content.

Applications of Gemma 3's Multimodal Skills

📚 As an interactive textbook assistant, it can explain diagrams, answer questions about highlighted content, and summarize figures.
🏛️ In museums and galleries, it can provide information on artists, themes, and translate inscriptions.
🌍 For language learners, Gemma 3 aids vocabulary building and cultural understanding by identifying objects and describing scenes in up to 140 languages.
🌿 Nature enthusiasts can use it to identify species and translate information about local flora and fauna.
💻 Developers can use Gemma 3 to generate alt text for images, improving accessibility and SEO, or assist game developers in designing quests based on visual concepts.

Underlying Technology

🔬 A powerful vision encoder allows Gemma 3 to process images and convert them into a format the language model can use, handling high resolution and non-square images.
🌐 The combination of multilingual and multimodal capabilities stems from its strong tokenizer and joint multimodal multilingual training, learning from diverse data across many languages.
🛠️ As open models, Gemma 3 allows developers and researchers to build upon and fine-tune it for specific tasks, fostering innovation.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph21 entities · 20 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

21 entities

Chapters3 moments

Key Moments

Transcript22 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics14 themes

What’s Discussed

MultimodalityGemma 3Artificial IntelligenceVision EncoderMultilingual ProcessingImage AnalysisVideo AnalysisLong ContextText GenerationAccessibilitySEOOpen ModelsAI ResearchDeveloper Tools

Smart Objects21 · 20 links

Products· 3

Concepts· 12

People· 4

Company· 1

Media· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free