Multimodal CoT prompt samples
Here are several concrete Multimodal CoT prompt samples you
can plug into a model that supports images, audio/video, and files. Each one
makes the perception → semantics → analysis flow explicit.
Marketing
performance prompts
- Instagram
vs TikTok campaign
“You are a multimodal marketing analyst. I will provide:
- 1
product image from the campaign,
- 1
spreadsheet with Instagram and TikTok performance metrics,
- 1
short video (or audio) customer review,
- and
my question.
Question: ‘Why did our Instagram campaign perform better
than our TikTok one?’
Follow this reasoning process step by step:
1.
Perception:
·
Describe what you see in
the campaign image (colors, composition, emotions).
·
Summarize key numeric
trends in the spreadsheet (CTR, CPC, conversions).
·
Extract the main sentiments
and phrases from the review.
2.
Semantic understanding:
·
Explain what the visual
style suggests about brand mood and audience fit.
·
Interpret what the metrics
say about engagement quality for each platform.
·
Interpret how the review
reflects user expectations or reactions.
3.
Analytical reasoning:
·
Connect visuals,
sentiments, and metrics to explain why Instagram outperformed TikTok.
·
State at least two
evidence-backed hypotheses.
4.
Synthesis:
·
Give a concise conclusion
and 3 concrete recommendations for the next campaign.”
- Emotional drivers of engagement
“I am uploading:
- Several
ad creatives (images),
- A
spreadsheet with click-through and conversion rates,
- A
CSV of social comments.
Task:
‘Which visual elements triggered the strongest emotional reaction, and how did
they influence engagement?’
Think in a Multimodal Chain-of-Thought:
- Perceptual
level: Describe recurring visual patterns (faces vs no faces, warm vs cool
colors, text density).
- Semantic
level: Map these patterns to likely emotions (trust, excitement,
confusion) and summarize common themes in comments.
- Analytical
level: Link specific visual patterns and emotional themes to the metrics
in the spreadsheet.
End with:
- A
short explanation of which visual/emotional combinations worked best,
- 2
suggested design changes for the next A/B test.”
Design
and UX prompts
- Homepage
UX analysis
“I am providing:
- A
screenshot of my homepage,
- A
heatmap image from an eye-tracking test,
- A
short feedback transcript from 5 users.
Question:
‘How do layout, colors, and copy influence user attention and conversion
potential on this page?’
Reason step by step:
1.
Perception:
·
Describe the visual
hierarchy of the homepage (what stands out first, second, third).
·
Interpret the heatmap
(where users look most/least).
·
Summarize the key phrases
and complaints from user feedback.
2.
Semantic understanding:
·
Explain what this hierarchy
suggests about users’ mental path (what they think the page is about).
·
Interpret what the heatmap
and comments imply about confusion or friction.
3.
Analytical reasoning:
·
Connect visuals + attention
patterns + feedback into a narrative of why users do or do not convert.
·
Propose 2–3 specific design
changes that could increase clarity and conversion, and justify each using the
multimodal evidence.”
- Comparing two designs
“I will upload two versions of a landing page (Design A and Design B), along with:
- A
spreadsheet of A/B test results,
- Short
user interview notes.
Prompt:
‘Compare Design A and B
using Multimodal Chain-of-Thought.
Follow this structure:
- Perception:
Describe the main layout, color, typography, and imagery differences
between A and B.
- Semantics:
Explain what each design ‘communicates’ in terms of trust, urgency, and
clarity.
- Analytics:
Use the A/B metrics and interview notes to reason about which elements
likely caused the performance difference.
Conclude with:
- Which
design is more effective and why,
- One
“hybrid” version combining the strongest elements from both designs.”
Product analytics / usability prompts
- Usability
testing synthesis
“You are a multimodal UX researcher. I am giving you:
- Short
video clips from usability tests,
- A
spreadsheet with task completion times and error rates,
- A
text log of user comments.
Goal: ‘Identify the top 3 usability issues and how they
affect behavior.’
Reason using Multimodal CoT:
- Step
1 (Perceptual): Describe what you observe in the video clips (where users
hesitate, where they hover, facial expressions).
- Step
2 (Semantic): Interpret these behaviors in terms of confusion,
frustration, or satisfaction, and connect them to specific UI elements.
- Step
3 (Analytical): Cross-reference these observations with the metrics and
comments to pinpoint the most impactful usability issues.
Finally, list the 3 key issues with:
- Evidence
from video + metrics + comments,
- A
concrete fix for each.”
Education
/ explanation prompts
- Teaching
with a chart and video
“I will provide:
- An
infographic or chart image,
- A
short video explanation of the same topic,
- A
text question.
Question: ‘Explain this concept as if teaching a university
student who is new to the topic.’
Use Multimodal Chain-of-Thought:
1.
Perception:
·
Describe the axes, legends,
colors, and main shapes in the chart.
·
Summarize the key verbal
points from the video.
- Semantic:
- Explain
what each key part of the chart represents in words.
- Map
the narrative from the video onto the visual elements.
- Analytical:
- Combine
both to give a step-by-step explanation of the underlying concept and the
trend shown.
End with:
- A
simple verbal summary,
- One
analogy or metaphor to deepen understanding.”
- Student
self-check prompt
“I am uploading a graph from my textbook and my own written explanation of it.
Prompt:
‘Act as a multimodal
tutor.
- First,
interpret the graph step by step (perception → semantics → analysis).
- Then
read my explanation and compare it to your own.
- Point
out where I am correct, where I am missing details, and where I am
mistaken.
- Finally,
rewrite my explanation in a more accurate but still student-friendly
way.’”
General
Multimodal CoT template
- Generic
template you can reuse
“Act as a Multimodal Chain-of-Thought reasoning assistant. I will provide one or more of the following: images, audio/video clips, spreadsheets, and text.
For any question I ask, follow this fixed reasoning
protocol:
- Perception:
Describe the key elements you detect in each modality (visual patterns,
sounds, emotions, numbers, text themes).
- Semantic
understanding: Explain what these elements mean in context for the task.
- Analytical
reasoning: Connect patterns across all modalities to answer the question
step by step.
- Synthesis:
Produce a concise final answer, followed by 2–3 practical recommendations.
Wait for my inputs, then start the Perception step
explicitly.”
You can adapt these to your own research by swapping
“campaign” for “experiment,” “CTR” for whatever metric you care about, and
tightening or loosening the reasoning protocol depending on how heavy you want
the CoT to be.
Comments
Post a Comment