Pronunciation in the Refold method develops organically through massive input, with targeted practice layered on top in the speaking phases. You don't need to drill pronunciation from day one — in fact, too much early focus on production can cement bad habits before your ears are trained.
When you grow up speaking a language, your brain learns to sort all the sounds you hear into a fixed set of categories — like sorting colors into bins. Sounds that are slightly different but fall into the same bin all get treated as identical.
This is efficient for your native language, but it becomes a problem when learning a new one: your brain "snaps" unfamiliar sounds to the closest category it already knows. This is why English speakers pronounce the final vowel in "sombrero" as an English "oh" sound rather than the pure Spanish "o" — they're snapping to the closest vowel their brain recognizes.
Hundreds of hours of listening helps your brain start to recognize new categories instead of forcing everything through your native language filter. If you try to produce sounds before your brain can distinguish them, you'll just produce the closest native-language equivalent — and then practice that mistake over and over.
A little technical knowledge about how sounds are produced can dramatically speed up your pronunciation practice. You don't need to memorize the IPA, but knowing the basics helps you understand what to do with your mouth when a sound isn't coming out right.
Vowels are open, unobstructed sounds that exist on a spectrum — like colors. They're defined by three factors: how open your jaw is, how forward or back your tongue is, and how rounded your lips are. Learning a new vowel means adjusting these dimensions to hit a different point than what your mouth is used to. 
A useful exercise is to change one dimension at a time — for example, try saying an "ee" sound but with rounded lips to produce the German/French "ü" sound.
Consonants are sounds where airflow is blocked or restricted. They're more concrete than vowels — you can usually tell more clearly if you're producing one correctly. They're defined by where the blockage happens (lips, teeth, roof of mouth, throat), how it happens (full stop, friction, trill), and whether your voice box is buzzing. 
A simple test: put your hand on your throat and compare "s" (no buzz) with "z" (buzz). Knowing whether a sound should be voiced or unvoiced can instantly fix certain errors.
There is a LOT more to sounds and pronunciation, but a little knowledge goes a long way.
Learning a new sound is a lot like learning to whistle — you need some knowledge of what your mouth should do, plus consistent daily practice, and eventually it clicks.
The progression goes: learn how the sound is produced, practice it in isolation, try it in words, then work on using it in fluid speech. Being able to produce a sound correctly once doesn't mean you can use it naturally mid-sentence — that takes time.
The "ears first" approach is grounded in models of L2 speech perception. Best & Tyler (2007) proposed the Perceptual Assimilation Model for L2 learners (PAM-L2), which explains how adults perceive unfamiliar sounds by assimilating them to the closest native language categories — the perceptual snapping described above. Flege & Bohn (2021) proposed the Speech Learning Model, which predicts that L2 sounds similar to native categories are the hardest to learn, precisely because learners keep assimilating them rather than forming new categories. Both models emphasize that accurate perception must precede accurate production.
Research on perceptual training supports this. Sakai & Moorman (2018) reviewed High Variability Phonetic Training (HVPT) studies showing that listeners can learn to distinguish new L2 sound contrasts through intensive exposure to varied speakers, and that these perceptual gains transfer to production — even without explicit pronunciation practice.
The practical guidance on vowels and consonants reflects standard articulatory phonetics as described in Ladefoged & Johnson (2014). A small amount of metalinguistic knowledge helps learners make targeted adjustments rather than relying on trial and error alone.