«Journal of Phonetics (2002) 30, 139–162 doi:10.1006/jpho.2002.0176 Available online at on The phonetics of phonological ...»
While transcription is too coarse a means of encoding the phonetic details of an utterance, these instrumental studies have yet to explore how these gradient errors ﬁt into the broader pattern of speech production. It is possible that other muscles were also gradiently activated in a coordinated fashion during these errors. Patterns of muscular coordination in speech errors would provide evidence of organization at the level of the feature or segment in speech production. If physically independent articulations for a segment are found to coordinate in gradient errors, such as lip rounding and tongue blade raising for /P/, then the most plausible explanation would be the partial activation of a higher-order segmental planning unit. It is also possible that the behavior of other muscles in speech errors is independent. The available articulatory evidence to date demonstrates that gradient speech errors can and do frequently occur at the sub-segmental level, but this data does not rule out a role for higher levels of organization in speech production as a constraint on speech errors.
It is also worth noting that deﬁning an error as unexpected muscle activity provides no allowance for what might be normal variation, by assuming that the speaker’s intention excludes such activity. These articulatory studies have found many of these gradient errors to be imperceptible, even upon repeated listening with high-quality equipment. It may be that these gestures have no signiﬁcant acoustic eﬀect. A possibility that should be considered is that one goal when talking is to convey a distinct, contrastive message to an observer. Thus, random variation that is noncontrastive may not be monitored as strictly and for this reason such productions might not be considered erroneous by the producer or detected as erroneous by the monitoring mechanisms in the production system of the producer.
144 S. A. Frisch and R. Wright
1.4. Overview of the study: acoustic analysis of speech errors Phonetic transcription provides a coarse coding of speech, ﬁltered through the perceptual system of the listener. EMG analysis of a single muscle provides detailed information about activity levels of an individual articulator but little information about the coordination of articulators or the acoustic eﬀects of the articulation. We propose to analyze experimentally elicited speech errors using acoustic measures.
Acoustic analysis can examine both contrastive and sub-contrastive dimensions together, providing an intermediate level of analysis between transcription and EMG. In addition, acoustic analysis may reveal systematic biases in the perception of speech errors. Acoustic analysis has the further beneﬁt that larger numbers of experimental subjects can be studied. The following section contains a description of the data and analysis techniques used in our acoustic analysis of phonological speech errors.
2.1. Data The data for our analysis come from a recorded corpus of utterances from a speech error experiment that induced errors using tongue twisters (Frisch, 1996, Chapter 9).
In the experiment, 21 participants each produced 6 repetitions of 88 diﬀerent tongue twisters. The participants were monolingual American English speaking undergraduate students at Northwestern University who were paid for their participation.
The experiment was self-paced and the speech rate of the participants was not controlled. The tongue twisters were printed individually in large type on index cards. The participants read each tongue twister aloud three times, and then repeated the tongue twister from memory three times. If they forgot the correct words, participants were allowed to consult the index card between repetitions during the repeat-from-memory portion. Each participant had a break after half of the experiment was completed in which they were engaged in normal conversation for about 5 min. The experiment lasted about 45 min in total. The entire session was recorded on audiocassette using a Marantz portable cassette recorder. The participants wore an electret condenser microphone attached to their shirt in the upper chest area.
The original experiment was a psycholinguistic study designed to induce word onset consonant errors for 22 diﬀerent consonant pairs. Each consonant pair was used in four diﬀerent tongue twisters. Each tongue twister consisted of four monosyllabic words. We examined the /s/-/z/ tongue twisters from the experiment, given in (3). Two of the tongue twisters created target errors that were existing words (3a, c), and two created nonword target errors (3b, d).
(3) a. sit zap zoo sip b. sung zone Zeus seem c. zit sap sue zip d. zig suck sank zilch The phonetics of phonological speech errors The words in the tongue twister and their error targets (if they were words) were balanced for lexical frequency within each tongue twister. Thus, lexical frequency eﬀects should not diﬀerentially inﬂuence the relative error rate between the consonants, from intended /s/ to [z] or intended /z/ to [s] in this particular case.1 The acoustic analysis focuses on the single consonant pair, /s/-/z/. Data from nine participants in the original experiment were analyzed. They were the ﬁrst nine that participated in the experiment. Twisters containing /s/-/z/ were selected for this study as the ﬁrst author found some of the tokens in these twisters to be perceptually ambiguous when scoring the original experiment. It was assumed that these stimuli had good potential for ﬁnding a variety of error types and examining the distribution of both categorical and gradient speech errors. The choice of the ﬁrst nine participants in the experiment was arbitrary. Five of these participants were male and four were female. These participants produced a total of 448 /s/ tokens and 446 /z/ tokens. The number of productions of each segment in each word by each participant is not necessarily equal, as the participants sometimes repeated individual words or the entire tongue twister after producing an error. All productions were measured, including repetitions.
It should be noted that the use of nonsense tongue twisters to generate speech errors may introduce systematic patterns not observed in naturally occurring speech, and thus the results of this paper may not generalize to normal speech production.
Shattuck-Hufnagel (1992) compared perceived error rates in nonsense tongue twisters with those in spontaneously generated utterances using the same words.
There was a diﬀerence in the number of errors observed in the two conditions, but the relative patterns of errors were the same for both conditions. Since quantitative but not qualitative diﬀerences were found, it appears that speech in tongue twisters does not diﬀer in important ways from spontaneously generated utterances. Note that Shattuck-Hufnagel’s (1992) study used transcribed speech error evidence. It may be that tongue twisters enhance the likelihood of observing gradient speech errors, given that they involve an unusually high level of repetition and alternation.
However, we assume that this entails quantitative rather than qualitative diﬀerences in the action of the production mechanism, and that the same phonological encoding and phonetic implementation is involved in producing both tongue twisters and natural speech.
2.2. Measurements The onset /s/ or /z/ of each word was measured in a variety of ways, with the overall goal of ﬁnding quantitative diﬀerences between /s/ and /z/ along acoustic dimensions that can be plausibly associated with independent aspects of the articulation. In describing the data acoustically, tokens are categorized as intended /s/ or /z/ based on the word the speaker was supposed to say if the tongue twister was to be produced correctly. This was assumed to be the speaker’s intended utterance. Both perceptually normal and erroneous tokens are included in the ﬁgures and tables that follow.
For convenience, we borrow the slash and bracket notation of phonology to diﬀerentiate the intended production from the recorded utterance. The ‘‘underlying’’ /s/ and /z/ represent the intended productions of the participants, and the ‘‘surface’’ [s] and [z] (or other transcriptions in square brackets) represent the actual output.
146 S. A. Frisch and R. Wright Three measures are reported in this paper. These measures diﬀerentiate /s/ and /z/ tokens produced by our talkers and have articulatory bases that are suﬃciently physiologically independent for our analysis. In addition to measuring each token, we transcribed each token based on repeated careful listening to digitized recordings of the individual repetitions of the tongue twisters. In most cases, the result was a clear /s/ or /z/ percept. However, there were several tokens of /z/ that were transcribed as devoiced.
The twister sung zone Zeus seem contained a sequence of coda /s/ followed by onset /s/. In 40 of the 55 repetitions of this twister, the participants did not produce two distinct /s/ segments (though in most cases there were two distinct amplitude peaks). These ‘‘geminate’’ /s/ tokens were not included in the analysis as they could not be unambiguously measured. In addition, there were 10 instances of ‘‘concatenation’’ errors where the participant produced a long [sz] or [zs] sequence, three instances where [W] was produced in place of [z], and three instances that were perceptually bizarre and untranscribable. We discuss these tokens in Section 3.3, and they are excluded from our analysis in order to provide the clearest possible picture of normal /s/ and /z/ production by our participants. Given these criteria, data for 397 /s/ productions and 435 /z/ productions are reported in this section.
The ﬁrst measure is the duration of the fricative. We considered the duration of the fricative to be the total duration of fricative noise, including any measurable overlap of the frication noise with the formant structure of the following vowel and the preceding vowel or sonorant, if any. In general, this overlap was common for /z/ but lacking for /s/. We determined whether or not fricative noise was present by examining the waveform for the fuzziness of aperiodic noise and a spectrogram for evidence of broadband noise at high frequencies. The expected pattern was for /s/ to be longer than /z/ (Klatt, 1976), which was true across all nine talkers in our study.
Fig. 1 shows means and error bars representing one standard deviation for each participant for our three measures of /s/ and /z/. Average duration is given in the top panel of Fig. 1, with the average for /s/ shown by the shaded bar, and the average for /z/ shown by the clear bar.
The second measure is the amplitude of frication noise. Amplitude of frication noise was calculated based on the RMS amplitude of the signal after highpass ﬁltering at 2 kHz. For each fricative, RMS amplitude was computed for 10 ms windows across the entire fricative. The window containing the peak RMS amplitude was identiﬁed, and the amplitude of frication was deﬁned to be the RMS amplitude of the fricative in a 50 ms window around the peak.
This generally corresponded to the middle of the fricative segment. The expected pattern was for /s/ to have more frication noise than /z/ (Strevens, 1960; Pickett, 1980), which was true across all nine participants in our study (see Fig. 1, middle panel).
The ﬁnal measure is the percent of the fricative that contained voicing. We used the waveform to determine when voicing was present, and considered portions of the signal that had evidence of clear periodicity as voiced. The percent of voicing was the fraction of the total duration that contained voicing. In most cases, there was clear evidence of a glottal closure phase that indicated the presence of voicing.
In some cases there was very breathy voicing resulting in sinusoidal waveform overlaid with frication noise. Breathy voicing was considered voicing until the point The phonetics of phonological speech errors where the pressure valleys marking each glottal cycle were too obscured by frication noise to be reliably identiﬁed. The expected pattern was for /z/ to have a larger percent of voicing than /s/, which was true across all nine talkers in our study (see Fig. 1, bottom panel).
The measurement of duration, frication amplitude, and percent voicing for a sample case of /z/ is shown in Fig. 2. This is a production of zone where fricative noise overlaps with the coda nasal of the previous word (sung). The fricative is devoiced in the middle, but voicing begins again before the vowel onset. This pattern is relatively common for /z/ (Haggard, 1978; Smith, 1997). The top panel in Fig. 2 shows the original signal, with brackets indicating the overall length of frication and the portions that are voiced. The bottom panel shows the signal ﬁltered at 2 kHz, which has removed most of the inﬂuence of periodicity, and the 50 ms window around the peak over which the RMS amplitude of frication was computed.
Figure 2. Example measurement of /z/ in zone showing waveform and highpass ﬁltered waveform.
Relevant regions for measuring duration, frication amplitude, and percent voicing are indicated by braces.
While our goal was to consider measurements that are articulatorily independent, it is clear that duration, amplitude of frication, and percent voicing are not unrelated aerodynamically (Ohala, 1983). During voicing, amplitude of frication is necessarily reduced, as the closed phases of the glottal cycle stop the airﬂow that is crucial for maintaining frication. However, many speakers produce /z/ with a voiceless portion in the middle of the fricative. During this devoiced portion, airﬂow is not constrained by glottal closure. Our measure of frication amplitude over the loudest 50 ms window would allow the frication amplitude of /z/ to be determined by the devoiced portion, if there is one. In addition, talkers also actively increase the pulmonic airﬂow in /s/ to produce a noisier fricative (Shadle, 1985). In an error articulation, the oral and glottal gestures for /s/ may be in place, but if this is not combined with increased pulmonic airﬂow, the resulting frication amplitude will be less than normal for /s/.