Multimodal CoT prompt samples

 

Multimodal CoT prompt samples

Here are several concrete Multimodal CoT prompt samples you can plug into a model that supports images, audio/video, and files. Each one makes the perception → semantics → analysis flow explicit.

Marketing performance prompts

  1. Instagram vs TikTok campaign
    “You are a multimodal marketing analyst. I will provide:
  • 1 product image from the campaign,
  • 1 spreadsheet with Instagram and TikTok performance metrics,
  • 1 short video (or audio) customer review,
  • and my question.

Question: ‘Why did our Instagram campaign perform better than our TikTok one?’

Follow this reasoning process step by step:

1.     Perception:

·         Describe what you see in the campaign image (colors, composition, emotions).

·         Summarize key numeric trends in the spreadsheet (CTR, CPC, conversions).

·         Extract the main sentiments and phrases from the review.

2.     Semantic understanding:

·         Explain what the visual style suggests about brand mood and audience fit.

·         Interpret what the metrics say about engagement quality for each platform.

·         Interpret how the review reflects user expectations or reactions.

3.     Analytical reasoning:

·         Connect visuals, sentiments, and metrics to explain why Instagram outperformed TikTok.

·         State at least two evidence-backed hypotheses.

4.     Synthesis:

·         Give a concise conclusion and 3 concrete recommendations for the next campaign.”

  1. Emotional drivers of engagement
    “I am uploading:
  • Several ad creatives (images),
  • A spreadsheet with click-through and conversion rates,
  • A CSV of social comments.

Task: ‘Which visual elements triggered the strongest emotional reaction, and how did they influence engagement?’

Think in a Multimodal Chain-of-Thought:

  • Perceptual level: Describe recurring visual patterns (faces vs no faces, warm vs cool colors, text density).
  • Semantic level: Map these patterns to likely emotions (trust, excitement, confusion) and summarize common themes in comments.
  • Analytical level: Link specific visual patterns and emotional themes to the metrics in the spreadsheet.

End with:

  • A short explanation of which visual/emotional combinations worked best,
  • 2 suggested design changes for the next A/B test.”

Design and UX prompts

  1. Homepage UX analysis
    “I am providing:
  • A screenshot of my homepage,
  • A heatmap image from an eye-tracking test,
  • A short feedback transcript from 5 users.

Question: ‘How do layout, colors, and copy influence user attention and conversion potential on this page?’

Reason step by step:

1.     Perception:

·         Describe the visual hierarchy of the homepage (what stands out first, second, third).

·         Interpret the heatmap (where users look most/least).

·         Summarize the key phrases and complaints from user feedback.

2.     Semantic understanding:

·         Explain what this hierarchy suggests about users’ mental path (what they think the page is about).

·         Interpret what the heatmap and comments imply about confusion or friction.

3.     Analytical reasoning:

·         Connect visuals + attention patterns + feedback into a narrative of why users do or do not convert.

·         Propose 2–3 specific design changes that could increase clarity and conversion, and justify each using the multimodal evidence.”

  1. Comparing two designs
    “I will upload two versions of a landing page (Design A and Design B), along with:
  • A spreadsheet of A/B test results,
  • Short user interview notes.

Prompt:
‘Compare Design A and B using Multimodal Chain-of-Thought.

Follow this structure:

  • Perception: Describe the main layout, color, typography, and imagery differences between A and B.
  • Semantics: Explain what each design ‘communicates’ in terms of trust, urgency, and clarity.
  • Analytics: Use the A/B metrics and interview notes to reason about which elements likely caused the performance difference.

Conclude with:

  • Which design is more effective and why,
  • One “hybrid” version combining the strongest elements from both designs.”

Product analytics / usability prompts

  1. Usability testing synthesis
    “You are a multimodal UX researcher. I am giving you:
  • Short video clips from usability tests,
  • A spreadsheet with task completion times and error rates,
  • A text log of user comments.

Goal: ‘Identify the top 3 usability issues and how they affect behavior.’

Reason using Multimodal CoT:

  • Step 1 (Perceptual): Describe what you observe in the video clips (where users hesitate, where they hover, facial expressions).
  • Step 2 (Semantic): Interpret these behaviors in terms of confusion, frustration, or satisfaction, and connect them to specific UI elements.
  • Step 3 (Analytical): Cross-reference these observations with the metrics and comments to pinpoint the most impactful usability issues.

Finally, list the 3 key issues with:

  • Evidence from video + metrics + comments,
  • A concrete fix for each.”

Education / explanation prompts

  1. Teaching with a chart and video
    “I will provide:
  • An infographic or chart image,
  • A short video explanation of the same topic,
  • A text question.

Question: ‘Explain this concept as if teaching a university student who is new to the topic.’

Use Multimodal Chain-of-Thought:

1.     Perception:

·         Describe the axes, legends, colors, and main shapes in the chart.

·         Summarize the key verbal points from the video.

  1. Semantic:
    • Explain what each key part of the chart represents in words.
    • Map the narrative from the video onto the visual elements.
  2. Analytical:
    • Combine both to give a step-by-step explanation of the underlying concept and the trend shown.

End with:

  • A simple verbal summary,
  • One analogy or metaphor to deepen understanding.”
  1. Student self-check prompt
    “I am uploading a graph from my textbook and my own written explanation of it.

Prompt:
‘Act as a multimodal tutor.

  • First, interpret the graph step by step (perception → semantics → analysis).
  • Then read my explanation and compare it to your own.
  • Point out where I am correct, where I am missing details, and where I am mistaken.
  • Finally, rewrite my explanation in a more accurate but still student-friendly way.’”

General Multimodal CoT template

  1. Generic template you can reuse
    “Act as a Multimodal Chain-of-Thought reasoning assistant. I will provide one or more of the following: images, audio/video clips, spreadsheets, and text.

For any question I ask, follow this fixed reasoning protocol:

  • Perception: Describe the key elements you detect in each modality (visual patterns, sounds, emotions, numbers, text themes).
  • Semantic understanding: Explain what these elements mean in context for the task.
  • Analytical reasoning: Connect patterns across all modalities to answer the question step by step.
  • Synthesis: Produce a concise final answer, followed by 2–3 practical recommendations.

Wait for my inputs, then start the Perception step explicitly.”

You can adapt these to your own research by swapping “campaign” for “experiment,” “CTR” for whatever metric you care about, and tightening or loosening the reasoning protocol depending on how heavy you want the CoT to be.

Comments