Speech data acquisition: the underestimated challenge

Oliver Niebuhr, Alexis Michaud

Oliver Niebuhr, Alexis Michaud. Speech data acquisition: the underestimated challenge.

Preprint version of paper to appear in KALIPHO - Kieler Arbeiten zur Linguistik und Phonetik, vol.. 2014.

vol.. 2014. halshs-01026295v1

Kieler Arbeiten in Linguistik und Phonetik (KALIPHO)

Skandinavistik, Frisistik und Allgemeine Sprachwissenschaft (ISFAS), Christian-Albrechts-Universität zu Kiel 2 Speech Data Acquisition:

The Underestimated Challenge Oliver Niebuhr Analyse gesprochener Sprache Allgemeine Sprachwissenschaft Christian-Albrechts-Universität zu Kiel niebuhr@isfas.uni-kiel.de Alexis Michaud International Research Institute MICA Hanoi University of Science and Technology, CNRS, Grenoble INP, Vietnam Langues et Civilisations à Tradition Orale, CNRS/Sorbonne Nouvelle, France michaud.cnrs@gmail.com The second half of the 20th century was the dawn of information technology;

and we now live in the digital age. Experimental studies of prosody develop at a fast pace, in the context of an “explosion of evidence” (Janet Pierrehumbert, Speech Prosody 2010, Chicago). The ease with which anyone can nowdo recordings should not veil the complexity of the data collection process, however. This article aims at sensitizing students and scientists from the various fields of speech and language research to the fact that speech-data acquisition is an underestimated challenge. Eliciting data that reflect the communicative processes at play in language requires special precautions in devising experimental procedures and a fundamental understanding of both ends the elicitation process, speaker and recording facilities. The article compiles basic information on each of these requirements and recapitulates some pieces of practical advice, drawing many examples from prosody studies, a field where the thoughtful conception of experimental protocols is especially crucial

1. Introduction: Speech Data Acquisition as an Underestimated Challenge The second half of the 20th century was the dawn of information technology; and we now live in the digital age. This results in an “explosion of evidence” (Janet Pierrehumbert, Speech Prosody 2010, Chicago), offering tremendous chances for the 2 Oliver Niebuhr & Alexis Michaud analysis of spoken language. Phoneticians, linguists, speech therapists, speech technology specialists, anthropologists, and other researchers routinely record speech data the world over. There remains no technological obstacle to collecting speech data on all languages and dialects, and to sharing these data over the Internet. The ease and the speed with which recordings can now be conducted and shared should not veil the complexity of the data collection process, however.

Phonetics “calls on the methods of physiology, for speech is the product of mechanisms which are basically there to ensure survival of the human being; on the methods of physics, since the means by which speech is transmitted is acoustic in nature; on methods of psychology, as the acoustic speech-stream is received and processed by the auditory and neural systems; and on methods of linguistics, because the vocal message is made up of signs which belong to the codes of language” (Marchal 2009:ix). In addition to developing at least basic skills in physiology, physics, linguistics, and psychology, each of which has complexities of its own, people conducting phonetic research are expected to have a good understanding of statistical data treatment, combined with a command of one or more specific exploratory techniques, such as endoscopy, ultrasonography, palatography, aerodynamic measurements, motion tracking, electromagnetic articulography, or electroencephalography (for a description of the many components of a multisensor platform see Vaissière et al. 2010). As a result, it tends to be difficult to maintain a link between the phonetic sciences and fields of the humanities that are highly relevant for phonetic studies, and in particular for the study of prosody. Phoneticians’ training does not necessarily include disciplines that would develop their awareness of the complexity and versatility of language, such as translation studies, languages, literature and stylistics, historical phonology, and sociolinguistics/ethnolinguistics. Moreover, the increasing use of digital and instrumental techniques in phonetic research is, taken by itself, a welcome development. But more and more phoneticians neglect explicit and intensive ear training, forgetting that an attentive, trained ear is the key to observations and hypotheses and hence the prerequisite for any analysis by digital and instrumental techniques. For example, we do not think that successful research on prosody can be done without the ability to produce and identify the prosodic patterns that one would like to analyse. As Barbosa (2012:33) puts it: “The observation of a prosodic fact is never naïve, because formal instruction is necessary to see and to select what is relevant”.

In summary, advances in phonetic technologies impose many challenges on modern phoneticians, and they can tend to replace rather than complement traditional skills.

This has a direct bearing on data collection procedures. To a philologist studying written documents, it is clear that every detail potentially affects interpretation and analysis (The complexities of Greek and Latin texts are perfect examples; see, e.g., Probert 2009; Burkard 2014). Carrying the same standards into the field of speech data collection, it goes without saying that every speaker is unique, that no two recording situations are fully identical, and that human subjects participating in the experiments are no “vending machines” that produce the desired speech signals by paying and pressing a button. An experience of linguistic fieldwork, or of immersion learning of a foreign language, entails similar benefits in terms of awareness of the central importance of communicative intention (see in particular Bühler 1934, passim; Culioli 1995:15; Barnlund 2008), and of the wealth of expressive possibilities and redundant Speech Data Acquisition 3 encoding strategies open to the speaker at every instant (as emphasized, e.g., by Fónagy 2001). Researchers working on language and speech are no “signal hunters”, but hunt for functions and meanings1 as reflected in the speech signal, which itself is only one of the dimensions of expression, together with gestures and facial expressions. The definition of tasks, their contextualization, and the selection of speakers are at the heart of the research process.

The diversification of the phonetic sciences is likely to continue, together with technological advances; the literature within each subfield is set to become more and more extensive, making it increasingly impractical for an individual to develop all the skills that would be useful as part of a phonetician’s background. This results in modular approaches, as against a holistic approach to communication. What is at stake is no less than a cumulative approach to research. The quality of data collection is inseparable from the validity and depth of research results; and data sharing is indispensible to allow the community to evaluate the research results and build on them for further studies.

Against this background, the present article is primarily intended for an audience of advanced students of phonetics. However, it is hoped that it can also serve as a source of information for phonetic experts and researchers who have a basic understanding of phonetics but work in other linguistic disciplines, including speech technology. The present article summarizes some basic facts, methods, and problems concerning the three pillars of speech data acquisition: the speaker (§2), the task (§3), and the recording (§4). Discussion on these central topics build on our own experiences in the field and in the lab. Together, the chapters aim to convey to the reader in what sense data acquisition is an underestimated challenge. Readers who are pressed for time may want to jump straight to the Summary in section 5, which provides tips and recommenddations on how to meet the demands of specific research questions and achieve results of lasting value for the scientific community.

Given its aim, our article is both more comprehensive and introductory than other methodologically oriented papers such as those by Mosel (2006), Himmelmann (2006), Ito and Speer (2006), Xu (2011), Barbosa (2012), and Sun and Fletcher (2014), which are all highly recommended as further reading. Most readers are likely to know much if not most of what will be said. Different readers obviously have different degrees of prior familiarity with experimental phonetics; apologies are offered to any reader for whom nothing here is new.

The two terms ‘meaning’ and ‘function’ tend not to be clearly separated in the literature – including in the present article, in which we simply use both terms in combination. In the long run, a thorough methodological discussion should address the issue of the detailed characterization of ‘meaning’ and ‘function’. To venture a working definition, meanings refer to concrete or


entities or pieces of information that exist independently of the communication process and are encoded into phonetic signs.

Functions, on the other hand, are conveyed by phonetic patterns that are attached to these phonetic signs; they refer to the rules and procedures of speech communication. If meanings are the driving force of speech communication, then functions are the control force of speech communication.

4 Oliver Niebuhr & Alexis Michaud

2. The speaker

2.1 Physiological, social, and cognitive factors Individual voices differ from one another. Physiological differences are part of what Laver (1994, 27–28) refers to as the “organic level”; they are extralinguistic, but are nevertheless of great importance to analyzing and interpreting speech data. Age and body size are perfect examples for this (cf. Schötz 2006), affecting, among others, F0, speaking rate (or duration) and spectral characteristics such as formant frequencies.

Physiological variables are intertwined with social variables. For instance, there are physiological and anatomical differences between the male and female speech production apparatus, which lends female speakers a higher and breathier voice as well as higher formant values and basically allows them to conduct more distinct articulatory movements than their male counterparts within the same time window (Sundberg 1979;

Titze 1989; Simpson 2009, 2012). So, “if we randomly pick out a group of male and female speakers of a language, we can expect to find several differences in their speech“ (Simpson 2009:637).

However, Simpson (2009) also stresses in his summarizing paper that gender differences in speech do not merely have a biophysical origin. Some differences are also due to learned, i.e. socially evoked behaviour, and the dividing line between these two sources of gender-related variation cannot always be easily determined. The social phenomenon of “doing gender” is well documented; it is an object of attention on the part of speakers themselves, and ‘metalinguistic’ awareness of gender differences in speech is widespread, particularly with respect to grammar and lexicon (cf. Anderwald 2014). Gender-related phonetic differences are less well documented. The frequent cross-linguistic finding that women speak slower and more clearly than men is probably at least to some degree attributable do “doing gender” (cf. Simpson 2009). Further, more well-defined differences between the speech of men and women are documented by Haas (1944) for Koasati, a Native American language. Sometimes women have exclusive mastery of certain speaking styles: mastering whispered speech, including the realization of tonal contrasts without voicing, used to be part of Thai women’s traditional education (Abramson 1972). In languages where the differences are less codified, they are nonetheless present: Ambrazaitis (2005) found gender differences in the realization of terminal F0 falls at the ends of utterances in German and – more recently – also in English and Swedish (see also Peters 1999:63). Compared with male speakers, female speakers prefer pseudo-terminal falls that end in a deceleration and a slight, short rise at a relatively low intensity level (Ambrazaitis 2005). This pseudo terminal fall reduces the assertiveness/finality of the statement, as compared with a terminal fall.

In extreme cases, this pattern might be mistaken for an actual falling-rising utterancefinal intonation patterns, which has a different communicative function. Phonetically, the difference is not considerable: a rise on the order of 2 to 4 semitones for the pseudoterminal fall, of 6 semitones for a falling-rising utterance-final pattern.

Another socially-related phenomenon is the so-called ‘phonetic entrainment’ or ‘phonetic accommodation’. That is, when two speakers are engaged in a dialogue, they become phonetically more similar to each other, particularly when the interaction is cooperative and/or when the two dialogue partners are congenial with each other (cf.

