Magnitude Estimation of disfluency by stutterers and nonstutterers. Melanie Russell,1 Martin Corley,2 and Robin Lickley1 Speech and Language

“Until stuttering can be identified qualitatively, we have no way of knowing what it is we have studied. Empirical evidence is needed to determine the best appropriate measures” (Perkins, 1995). The technique of Magnitude Estimation promises to be an extremely useful way of accessing fine judgements about the severity of disfluency in speech. This method was developed by psychophysicists to make the best use of participants’ ability to make fine judgements about physical stimuli, and has since been used in a number of linguistic acceptability tasks (Bard et al., 1996; Keller, 2000). Participants are instructed to assign any number to a given stimulus (the Modulus), and rate the following stimuli proportionately. This can be compared to traditional ‘Likert Scale’ measures, where participants are asked to assign a number on a discrete scale (often 1-7). The disadvantage of such interval scaling is that there is no way of knowing in advance if people’s sensitivities to the data provided are limited to a seven-way distinction any more than to a four-way one (Bard et al., 1996). In contrast, in Magnitude Estimation, raters’ responses are unconstrained;

categorical judgements can be revealed rather than imposed. This method has been demonstrated to result in robust but fine distinctions. In previous research on stuttering, it has been argued that Magnitude Estimation has greater construct validity than other methods (Schiavetti et al., 1983). Experience with internet studies using Magnitude Estimation (e.g., Keller & Alexopolou, 2001) demonstrates that it can be used consistently by untrained readers and listeners.


1.Speech Corpora All stimuli used in the experiment were unedited samples of spontaneous speech taken from task-oriented dialogues. The HCRC Map Task Corpus (Anderson et al.,

1991) was used as a model. In the map task, both speakers have a similar map and one speaker (instruction giver) has a route marked on their map, which they have to describe to the other (follower). Discrepancies between the two maps provide occasions for discussion and negotiation. The HCRC Map Task Corpus has proven to be a rich source of disfluent speech in nonstutterers, both as instruction giver and as follower (Branigan, Lickley and McKelvie, 1999; Lickley, 2001).

To provide natural samples of speech by stutterers, 2 dialogues involving two pairs of speakers who stutter were recorded. The stuttering speakers were recruited with the help of a local speech and language therapist and a self-help group in Edinburgh.

Recordings took place in a quiet studio, with speakers sitting at tables facing each other about 5 metres apart, their maps raised on easels at an angle so that neither participant’s map was visible to the other. Speakers were fitted with clip-on microphones and recorded onto separate channels on digital audio tape and SVHS video tapes.

Nonstuttering control stimuli came from two sources. The first source was the speech of two speakers from the HCRC corpus itself, which involved speakers with Scottish accents and was recorded in very similar conditions to the new corpus. These two speakers provided matches for the stimuli produced by the two Scottish stuttering speakers. Since the other two stuttering speakers were not Scottish speakers, nonstuttering speakers with very similar accents were recruited to record another dialogue, so as to counter any biasing effects of regional accent in the experiment.

The HCRC Map Task Corpus has full transcriptions and disfluency annotation timealigned with the digitised speech signal. The new dialogues were transcribed and annotated for disfluency using signal processing software on Unix workstations.

Disfluency annotation was performed with reference to the HCRC disfluency coding manual (Lickley, 1998), which was adapted to include disfluencies associated with stuttering (multiple repetitions, prolongations and blocks). The same software was used to excise the experimental stimuli from the dialogues into separate files.

2.Stimulus selection

For the purposes of the current study, we attempted to match the stimuli produced by stutterers with similar stimuli produced by nonstutterers. This strategy meant that the type of disfluency we could use in stimuli was restricted to a small subset of the types of disfluency that are produced by people who stutter: single repetitions of part words, rather than multiple repetitions. While they are a common characteristic of the speech of people who stutter, multiple repetitions are somewhat rare in the speech of nonstutterers. In the HCRC Map Task Corpus (described in Anderson et al., 1991), we find nearly 2000 disfluent repetitions, only 161 of which consist of more than one repetition and only 19 of more than two. Of these, only 1 is a part-word repetition, consisting of progressively shorter repetitions of the onset of a 3 syllable word (undernea- under- und- un- no underneath).

Perceptual studies on non-stuttered speech using non-stuttering listeners suggest that minor disfluencies such as single part-word repetitions are harder to detect and more often missed altogether by listeners than other types of disfluencies (Bard and Lickley, 1998): non-stutterers, at least, appear to find such disfluencies unobtrusive.

Restricting the stimuli in our study to this type of disfluency has a bearing on our interpretation of the results. If stutterers are more sensitive even to such minor disruptions than are nonstutterers, this will serve to emphasise their over-sensitivity and support the notion that their acceptability threshold for errors is significantly higher. In addition, if we find that listeners judge these minor disfluencies differently for stutterers and nonstutterers, we will have evidence that contradicts the continuity hypothesis, suggesting that there is a qualitative difference even between the “normal” disfluencies for the two sets of speakers.


A total of 64 stimuli were selected from the corpora described above so as to include sets of 32 disfluent and 32 fluent stimuli. Half of these came from the 4 stuttering speakers and the other half from 4 nonstutterers. All the disfluent stimuli contained single repetitions of word onsets. Each stimulus produced by a stutterer was matched as closely as possible with a stimulus from a nonstutterer with the same regional accent. Disfluent stimuli were matched for phonetic content of the repeated segment wherever possible (e.g. “ that s-section” was matched with “ going s-straight up”).

Fluent stimuli were matched for their lexical and syntactic content, as far as possible (e.g. “ then you go up” was matched with “ then you go straight up”). However, finding precisely matched controls from a small corpus of spontaneous speech is virtually impossible. Where such a precise match was not possible, the most liberal criterion used was that speech segments should be of equivalent length. No patterns likely to bias experimental outcomes could be detected in the less precisely-matched stimuli.

One stimulus, a disfluent item produced by a nonstutterer, was selected as modulus, and headed each of 3 blocks of 21 other stimuli. Apart from this stimulus, the items were presented in different random orders for each subject.

4.Subjects Subjects in the listening experiment consisted of 16 nonstutterers (9 female, 7 male) and 6 stutterers (1 female, 5 male), with an age range of 20-45. None reported having hearing deficits. None had previous experience of the task of giving fluency judgments.

5.Procedure The experiment was carried out using Psyscope (Cohen et al., 1993) on an Apple Macintosh computer. Stimuli were played over headphones to subjects seated in sound-proofed listening booths.

Instructions were presented on the computer screen in several short sections. Subjects were told that their task was to give a numerical response which matched their perception of the severity of speech disfluency for each segment of speech that they heard. They were asked to rate more disfluent segments with higher numbers and less disfluent segments with lower numbers and to relate their judgments to their score for the modulus segment. They were encouraged not to base their ratings on anything other than fluency (e.g. speaker accent, grammaticality) and to respond as quickly as possible. Subjects responded by typing their responses on a computer keyboard. The presentation of stimuli was self-paced: a new stimulus was played when the subject hit the “return” key on the keyboard.

The experiment was preceded by a practice session to familiarise the subjects with the magnitude estimation task. The practice session consisted of judgments of tone duration, rather than line length, which is the measure usually used in magnitude estimation, in order to maintain the auditory aspect of the experiment.

Following the practice session, subjects performed the experiment without interruption, typically completing the task in about 15 minutes. Responses, consisting of typed numbers corresponding to the three repetitions of the Modulus, together with 63 other comparative ratings, were recorded in data files generated by Psyscope.


Each participant’s ratings were divided by the value they had given to the modulus stimulus, to make the scores comparable. Since the ratings were ratios (“how much more or less fluent than the modulus”) they were then log-transformed. A transformed rating of zero thus indicated that the participant had judged a stimulus to be equivalently fluent to the modulus; scores greater than zero indicated increased disfluency, and scores less than zero indicated that the stimulus had been rated as relatively fluent.

The analysis of the transformed scores was however made more difficult by a design flaw in the study. Participants rated each modulus three times, but no attention was drawn by the experimenters to the fact that the two repetitions should be given the initial modulus rating. This lack of ‘anchoring’ resulted in an appreciable drift in participants’ scoring throughout the experiment; of 22 participants in total, only 5 gave the modulus item the same score on all three occasions. In other words, the results from 17 participants introduced additional, non-systematic, error variance into the study (and because the modulus ratings did not appear to change in predictable ways, there is no obvious way to compensate for this). The analysis by participants reflects these problems, and will not be reported here. However, because the experimental stimuli were randomised, each stimulus had an equal chance of occurring early in the experiment (before the onset of drift). This means that the error variance due to drift should be approximately equally partitioned across items, and a by-items analysis can be used to give a clearer picture of the outcome of the experiment.

The analysis reported here included the (matched) stimuli as a random factor, and explored the effects of rater (with or without stutter), speaker (with or without stutter), and type of utterance (fluent or disfluent) as within-item factors. All means reported are of log-transformed adjusted ratings.

Note that we can consider the stimuli used in this experiment to be a subset of the infinite population of comparable disfluencies. Thus a by-items analysis does not fall subject to the Raaijmakers, Schrijnemakers and Gremmen (1999).

criticism of Only two of the variables had independent effects: unsurprisingly, disfluent utterances were judged to be more disfluent than fluent utterances (0.10 vs. -0.57; F(1,15) = 153.17, p.001); and speakers with stutters were rated slightly less fluent overall (0.13 vs. -0.34; F(1,15) = 7.29, p = 0.003). There was no independent effect of rater (that is, raters appeared to use similar ranges of scores, whether or not they had stutters themselves). Interestingly, there was no interaction between speaker and utterance type, suggesting that disfluent or fluent utterances from speakers with stutters were perceived equivalently to similar utterances from nonstuttering speakers;

the interaction between speaker and rater, and the three-way interaction, also failed to reach significance.

However, the interaction between rater and utterance type did reach significance (F(1,15) = 23.41, p 0.001). As can be seen from figure 1, this reflects the fact that raters with stutters differentiated more between disfluent and fluent utterances than did raters without stutters, suggesting that people with stutters discriminate more sensitively between fluent and disfluent speech. We return to this point in the discussion.

9.Discussion It is widely agreed that despite the inclusiveness of the label, people who are described as, or describe themselves as, stutterers often display very different symptoms and coping strategies. In this context, results from a small-scale study such as that reported here need to be treated with caution: it is too early to make any claims about a single cause of stuttering. However, taken together with the studies reported by Vasic and Wijnen, the findings from the present study converge to implicate the self-monitor in stuttering. In a direct test of sensitivity to disfluency, stutterers were found to differentiate more between disfluent and fluent speech than nonstutterers, regardless of whether that speech had been originally uttered by someone considered to have a stutter or someone who was a nonstutterer. This evidence is consistent with one interpretation of Vasic and Wijnen's hypothesis. It would be premature however to conclude that people who stutter do not rate fluent speech as worse; given the small numbers of participants, comparisons of absolute ratings between groups must be treated with caution. However, the evidence clearly indicates a difference in relative ratings, consistent with either version of the hypothesis; further, we can assume that since participants were explicitly instructed to rate the recordings for fluency, the focus and cognitive effort devoted to the task were maximised, and have little role to play in the outcome.

