Abstract: | Cross‐situational statistical learning of words involves tracking co‐occurrences of auditory words and objects across time to infer word‐referent mappings. Previous research has demonstrated that learners can infer referents across sets of very phonologically distinct words (e.g., WUG, DAX), but it remains unknown whether learners can encode fine phonological differences during cross‐situational statistical learning. This study examined learners’ cross‐situational statistical learning of minimal pairs that differed on one consonant segment (e.g., BON–TON), minimal pairs that differed on one vowel segment (e.g., DEET–DIT), and non‐minimal pairs that differed on two or three segments (e.g., BON–DEET). Learners performed above chance for all pairs, but performed worse on vowel minimal pairs than on consonant minimal pairs or non‐minimal pairs. These findings demonstrate that learners can encode fine phonetic detail while tracking word‐referent co‐occurrence probabilities, but they suggest that phonological encoding may be weaker for vowels than for consonants. |