Not all content is equally hard to understand. Many factors affect how comprehensible something is, and you can manipulate these factors to find content at the right difficulty level for your current ability.
Visual context: Can you see what's happening? TV shows and movies provide visual cues (facial expressions, gestures, settings) that help you understand even when you miss words. Podcasts and audiobooks provide none.
Narrative predictability: Is the story easy to follow? Slice-of-life dramas have simple, predictable plots. Mystery thrillers have complex, surprise-driven plots. Predictability helps comprehension.
Domain familiarity: Are you familiar with the topic from your native language? Content in a domain you know well is much easier to understand than content in an unfamiliar domain.
Prior knowledge of the story: Have you seen this before (in another language, or read a summary)? Knowing what happens makes the language much easier to process.
Audience level: Content made for children uses simpler language than content for adults. Content for learners (comprehensible input) is designed to be easy. Novels for adolescents are easier than those for adults. And so on.
Speech speed and clarity: News anchors speak clearly and at moderate speed. Casual conversation between friends is fast, mumbled, and full of slang.
Scripted vs. unscripted: Scripted dialogue (TV, movies) is cleaner and more predictable than unscripted speech (vlogs, podcasts, real conversation).
Dubbed vs. native: Dubbed content often uses simpler, more standard language than native content which might contain cultural references or plays on words.
When choosing immersion content, stack the factors in your favor at first: visual context + familiar domain + scripted + moderate speed = highly comprehensible. As you improve, gradually remove supports: unfamiliar content + no visual context + fast casual speech = much harder, but you'll be ready for it.
The key insight: you can make almost any content more comprehensible by manipulating these factors. If something is too hard, find a version with more support (add subtitles, watch a recap first, choose a simpler genre). If something is too easy, remove support (turn off subtitles, try unfamiliar content).
Krashen's (1982) Input Hypothesis proposed that acquisition occurs when learners are exposed to language just beyond their current level — but what makes input land at the right level isn't just the grammar or vocabulary. It's the full set of contextual and linguistic supports surrounding it.
The role of visual context is supported by research on multimedia learning. Mayer's (2009) Cognitive Theory of Multimedia Learning demonstrates that combining verbal information with relevant visual information improves comprehension and retention by distributing processing across two channels rather than overloading one. In L2 contexts, Peters and Webb (2018) found that learners acquired vocabulary incidentally through watching a full-length TV documentary, with visual context helping learners infer the meaning of unknown words. This is why TV and video are recommended for earlier phases — they provide a scaffolding layer that audio-only content does not.
The distinction between scripted and unscripted speech reflects real differences in how language is produced and processed. Scripted dialogue tends to be more clearly articulated, uses more standard grammar, and contains fewer disfluencies, while natural conversation is faster, includes false starts and overlapping turns, and uses non-standard forms. Munro and Derwing (2001) demonstrated that speaking rate significantly affects how comprehensible L2 speech is perceived to be, and their broader body of work shows that both rate and clarity are major factors in whether learners can successfully process what they hear. This is why the roadmap recommends progressing from scripted to unscripted content as comprehension ability grows — it allows learners to build processing speed gradually rather than being overwhelmed.