In the first part of the paper, we summarize the linguistic factors that shape speech timing patterns, including the prosodic structures which govern them, and suggest that speech timing patterns are used to aid utterance recognition. In the spirit of optimal control theory, we propose that recognition requirements are balanced against requirements such as rate of speech and style, as well as movement costs, to yield (near-)optimal planned surface timing patterns; additional factors may influence the implementation of that plan. In the second part of the paper, we discuss theories of timing control in models of speech production and motor control. We present three types of evidence that support models of speech production that involve extrinsic timing. These include (i) increasing variability with increases in interval duration, (ii) evidence that speakers refer to and plan surface durations, and (iii) independent timing of movement onsets and offsets.
Timing is an integral part of every aspect of speech production: individual movements of the rib cage, oral articulators and laryngeal structures; their coordinated motor activity and the speech sounds they produce. Understanding speech production therefore requires understanding timing: what it is used for, and how it is controlled. In this paper, we first review our current understanding of what speakers use timing for, and how this understanding was acquired by researchers, and then we focus on two different views of how timing is controlled: with and without an extrinsic timekeeping mechanism. We then present evidence that seems to require an extrinsic timekeeping mechanism. Space prevents us from detailing the methods involved in measuring timing, but see Turk et al.  for measurement methods based on acoustic landmarks  and Perkell et al.  for a method based on landmarks in movement traces.
2. What is speech timing used for?
The traditional way of determining what speakers use timing for is to conduct controlled experiments in which a factor of interest is systematically varied, keeping other factors constant. For example, in experiments testing whether vowel type has a systematic effect on duration, different vowels can be embedded in a constant carrier phrase, e.g. Say dad again versus Say did again. Such experiments have shown systematic differences between different speech sounds (e.g. ), which are therefore hypothesized to have a characteristic ‘intrinsic’ duration . Analogously, experiments that vary higher-level prosodic structure have shown systematic effects of prominence and constituent boundaries on duration. For example, a comparison of dad in Say DAD again versus in SAY dad again shows that DAD is systematically longer when phrasally prominent (see  for a review). Moreover, depending on how the speaker chooses to prosodically produce a syntactic string, words before major constituent boundaries are often systematically longer than constituent-medial words, e.g. cousin is longer in Mary GEORGE's cousin] [baked the cake, where it is at the end of a phrase, when compared with cousin in Mary's cousin GEORGE] [baked the cake, where it is medial.
Experiments conducted from the 1950s through the 1980s established a long list of factors that appeared to affect speech timing. These include
(i) vowel and consonant type
(ii) contextual factors, e.g.
— prominence (word stress, phrasal stress)
— adjacent segment type
(iii) global factors, e.g.
— speech rate
— speech style (e.g. clear versus relaxed)
([7–10] inter alia; for reviews, see [4,6,11]). In addition, there are many other possible factors not yet integrated into current models that may also influence speech timing under special circumstances, such as speaking to an external beat.
However, since the late 1970s and 1980s (e.g. [12,13]), it has become clear that the view that each factor has a separate, direct effect on timing is problematic. Syntax is problematic because it has only an indirect influence on phonetic form, and predictability is problematic because many of its effects appear to be shared with other factors. In the following sections, we address these two problems and show how these factors relate to prosodic structure, which we see as a central aspect of the interface between language and speech. In our view, prosodic structure, segmental identity and segmental context are the factors that have a direct effect on the speaker's surface phonetic plan, including speech timing. When planning speech production, speakers balance these factors against non-grammatical factors such as speech rate and other stylistic requirements, clarity requirements and movement costs (e.g. energy, time) to yield a specification of the desired temporal patterns for a spoken utterance.
(a) The problem with syntax
Although it is clear that some syntactic manipulations have a measurable effect on duration (and other phonetic parameters), not all do. Consider for example
— Mary George's cousin]? ate a piece of cake
— Her cousin]? ate a piece of cake
— She]? ate a piece of cake.
In these examples, where ]? is used to indicate a possible site of boundary-related cues, the likelihood of these cues decreases for shorter subject noun phrases. That is, the longer subject noun phrase (Mary George's cousin) is more likely than the shorter ones (Her cousin and She) to show boundary-related phonetic cues such as pre-boundary lengthening and pause, even though they all share the same syntactic structure [14–17]. There are also some phonetic indicators of constituent boundaries that occur where syntax would not predict them, as in Sesame street is brought to you by] … the Children's Television Workshop, where a break occurs within a prepositional phrase . Finally, levels of embedding found in syntax are often absent in speech : for example, the utterance above has a right branching syntactic structure (figure 1, top), whereas its spoken phrasing is flatter (figure 1, bottom).
(b) Prosodic structure as a solution
(i) Prosodic constituent structure
Along with other findings from segmental phonology  and intonational phonology , these findings suggest that a structure that is influenced by syntax, but not isomorphic to it, directly defines the groupings observed in speech. This structure, called prosodic constituent structure, is hierarchical, and includes constituents such as words and perhaps feet or syllables at lower levels, and phrases of various sizes at higher levels. Although there are debates about many aspects of the prosodic hierarchy, e.g. about the number of levels in the hierarchy, and about the name and definition of each constituent type, there is general agreement about its hierarchical nature, and about the fact that it is flatter and more symmetric than syntactic structure [12,21]. An example of prosodic structure is shown in figure 2.
Prosodic constituent structure is a likely linguistic universal, although different languages may elect different sets of levels from the universal hierarchy . It has measurable effects on durational phenomena such as initial lengthening, final (or pre-boundary) lengthening, polysyllabic shortening (the shortening of syllables when more occur in a constituent), polysegmental shortening (the shortening of segments when more occur in a constituent) and pause (see [5,23] for reviews). Support for the universality of prosodic structure comes from the ubiquitous occurrence of final and initial lengthening patterns that reflect a structural hierarchy in languages of the world [24,25].
Phrasally related initial and final lengthening affect specific parts of initial and final words, respectively. Initial lengthening appears to be primarily localized on the initial C in phrase–initial CV and CCV sequences [26,27]. In final position, most of the lengthening occurs on the rhyme of the final word. Smaller, but significant amounts of lengthening have also been observed on lexically stressed syllable rhymes when the lexically stressed syllable is pre-final, as in Michigan or Trinidad (see  for Dutch and  for American English). Lengthening at other sites, e.g. the onset consonant of the phrase–final syllable rhyme, has also been observed, but these effects are sporadic in the sense that they appear to be study- or material-dependent, and may possibly be speaker-dependent. For both initial and final lengthening, the magnitude of the durational effects varies with boundary strength: stronger boundaries (e.g. phrases) are generally associated with greater degrees of lengthening [30,31] but interestingly not with a longer string in the domain of lengthening  (for discussions of polysyllabic shortening, see [32–34]).
Prosodic constituent structure also affects non-durational phonetic parameters, such as constituent-initial and final voice quality modifications [35–39], supralaryngeal articulatory modifications (e.g. phrase–initial strengthening, syllable–final lenition [25,40,41]), the use of word- or phrasal-prominence near the beginnings or ends of constituents [16,42], as well as intonational phenomena, e.g. phrase–final lowering, phrase–initial reset (cf. [20,43] among others).
(ii) Prosodic prominence structure
Prosodic structure also includes prosodic prominence structure, which describes different degrees of stress/accent found in words and phrases. For example, in one prosodification of the phrase Mary's cousin George, George is the most prominent word in the phrase, and is said to bear phrasal stress (also called sentence stress, or accent). In the words Mary and cousin, the word–initial syllables Ma(r)- and cou- are more prominent than the second syllables in these words, and are said to bear word- or lexical stress. Figure 3 shows a grid-like representation of prominence structure [44–46], illustrated for this phrase.
Like prosodic constituent structure, prosodic prominence structure is hierarchical, with word-stress near the bottom of the hierarchy, and phrasal stress at higher levels . It also has measurable effects on duration, but the effects of prominence on duration appear to be different from those related to prosodic constituent boundaries [32,48–50]. For example, monosyllabic words show different effects of phrasal prominence versus final lengthening: prominence increases the nucleus duration most, followed by the syllable onset, then optionally the coda, whereas final lengthening increases the nucleus duration most, followed by the coda, then (optionally) the onset. Prosodic prominence structure not only affects duration, but also affects other articulatory parameters such as articulatory distinctiveness and voice quality, and their acoustic consequences (e.g. formant structure and spectral balance) [51,52].
(iii) Prosodic structure as the interface between language and speech
The proposal that prosodic structure serves as an interface between language on the one hand, and speech on the other hand is illustrated in figure 4 (based on a similar figure in , see also , inter alia). The figure illustrates the indirect effects of factors such as syntax, utterance length, focus, etc., on surface phonetics, via prosody. Prosodic structure has direct influence on the phonetic plan. During speech planning, prosodic effects on phonetic parameters such as duration are balanced against the effects of segmental identity and context, as well as non-grammatical factors (e.g. rate and style of speech, clarity requirements, movement costs, etc.), on those same parameters.
Several aspects of figure 4 are worthy of comment. First, we assume that the non-grammatical factors have a direct influence on the plans for surface phonetic form, rather than influencing the phonological plan. Although factors such as rate and style of speech have been described as directly affecting aspects of prosody (e.g. fewer ‘breaks' at faster rates of speech, cf. ), our view is that a speaker plans the same prosodic structure (i.e. same relative prominence and relative boundary strength structure) for a given utterance at different rates of speech, but that the planned phonetic manifestation of this structure is different at different rates. This is because the rate-of-speech requirement must be balanced against the prosodic structure requirement in determining optimum surface phonetic characteristics that meet the competing demands. Second, the factors mentioned in figure 4 are intended to be a preliminary indicator of factors that might be at work, and may not be exhaustive. Related to this, there are other factors that are known to influence phonetics that remain to be investigated, for example, the adjustments that might be made in response to an interlocutor (possibly including non-speech input), a noisy environment or intense emotion. These adjustments might relate to phonological planning, e.g. choices of prosodic structure, or might be non-grammatical, e.g. reflected in specifications of rate or clarity, and would therefore be balanced against prosodic structure requirements in influencing the phonetic plan. And there are other candidate factors, such as cognitive processing costs and constraints, whose effects are not yet well-understood. Figure 4 is therefore intended as a tool for identifying and thinking about factors that influence phonetic planning and as a proposal for how they might interact.
(c) The problem with predictability
If we accept that prosodic structure has a measurable effect on duration, then another factor in the list becomes problematic: predictability. What we refer to as ‘predictability’ is the likelihood of a word given its context (linguistic and pragmatic/real-world) and frequency of use, i.e. the likelihood that a word can be guessed from its context. It has long been observed that more predictable words are produced with shorter durations than less predictable words [56–59]. For example, Lieberman  observed that more predictable words are shorter and less acoustically salient; he found that the word nine in A stitch in time saves nine (highly predictable context) was shorter than the word nine in The number that you will hear is nine.
The problem with predictability as a factor affecting duration is that it is unclear whether prosodic structure and predictability are both motivated as separate factors affecting duration. This is because prosodic structure and predictability are not independent. When predictability is low, syllables are more likely to be prosodically prominent, and words are more likely to be demarcated using prosodic boundary correlates such as initial- and final-lengthening and pause. For example, the word operas in the phrase health operas is more likely to bear phrasal stress than the word issues in the phrase health issues, possibly, because issues in this context is more predictable [60,61]. In addition, the word nine may be longer in the phrase The number that you will hear is nine than in the phrase A stitch in time saves nine, because the nine in the former sentence is less predictable, and therefore the word boundary will be more saliently marked by lengthening on the word–initial /n/.
(i) Prosodic structure as the interface between predictability and acoustic salience: a solution to the predictability problem
Earlier studies [60,62] proposed that prosodic structure is the interface between predictability and acoustic salience, that is, prosodic structure is used to control acoustic salience in order to signal relative predictability . Aylett  proposed that in this way prosodic structure makes all words in an utterance equally easy to recognize. This proposal was termed the smooth signal redundancy hypothesis (figure 5, based on a similar figure in ).
In the sentence Who's the author?, ‘Who's in its context (___the author?) is more predictable than author in its (full) context (Who's the ___?); the is even more predictable (context: Who's__author?); and furthermore, the word–initial syllable , au(th)- is relatively unpredictable compared with the second syllable –(th)or. The smooth signal redundancy hypothesis states that an utterance's predictability profile (also called language redundancy) is inversely reflected in the prosodic structure of the elements (e.g. syllables and words) in the utterance. Prosodic structure is used to control the acoustic salience of surface phonetics (through prosodic prominence and boundary strength), so that the recognition likelihood of each element in the utterance is approximately equal, i.e. signal redundancy is smooth. As discussed in Aylett & Turk , the smooth signal redundancy profile is advantageous because it increases the likelihood of recognizing all of the elements in the utterance. The p(recognition)1 of the entire sequence corresponds to the product of the p(recognition) of each element in the sequence, and will therefore be greater if p(recognition) of each element is equal, than if p(recognition) of different elements is different.
As discussed in Turk , the idea that prosodic structure reflects predictability provides an explanation for the effect of utterance length on the likelihood of boundary occurrence and on boundary strength. This is because, all other things being equal, words are harder to guess (less predictable) in longer utterances. To understand why this is, consider a two-syllable utterance. All things being equal, there are two possible ways to parse such an utterance. As a sequence of two monosyllabic words, or as a single disyllabic word.
parsing option 1: [ syl ]word [ syl ]word
parsing option 2: [ syl syl ]word
For a three-syllable utterance, the number of possible parsings increases to four:
parsing option 1: [ syl ]word [ syl ]word [ syl ]word
parsing option 2: [ syl ]word [ syl syl ]word
parsing option 3: [ syl syl ]word [ syl ]word
parsing option 4: [ syl syl syl ]word
And for a four-syllable utterance, the number of possible parsings is even larger, i.e. eight. However, when a phrase boundary is inserted anywhere in the utterance, the number of possible parsings is halved. As this example illustrates, when predictability is relatively low, because an utterance is long, prosodic structure can be used to increase recognition likelihood by signalling constituent boundaries.
Aylett & Turk  proposed that predictability is a composite factor that directly influences prosodic structure, and thereby indirectly controls acoustic salience . That is, all of the factors at the top of figure 6 contribute to the predictability of elements in an utterance. For example, a word's lexical frequency, together with its syntactic and semantic context, its real-world context (pragmatics) and utterance length, combine to predict how likely a particular word would be (i.e. how easily a word could be guessed) in that particular context. Aylett  refers to this predictability as ‘language redundancy’. Our current hypothesis is that the predictability of each element in an utterance relates to its predictability on the basis of both preceding and following elements (i.e. the full context), as well as its frequency of use and likelihood on the basis of real-world context, but note that it is an important research question to determine exactly what contributes to an element's predictability/language redundancy. As discussed in Turk , the speaker can compute predictability (language redundancy) on the basis of his/her own language and real-world experience. The speaker can incorporate information about the listener's knowledge, but need not do so.
As noted above, our hypothesis is that language redundancy is used to plan prosodic structure in order to make the recognition likelihood of each element equal. This goal of even recognition likelihood (or smooth signal redundancy) is balanced against other goals, such as speaking clearly, quickly or in rhythm as well as movement costs (e.g. time, energy) when speakers plan the surface phonetic properties of a spoken utterance.
Aylett & Turk  provide supporting evidence for the view that prosodic prominence structure reflects predictability: both prosodic prominence structure and predictability (word frequency, syllable transitional probability and first versus second mention of a word) largely accounted for the same variance in syllable duration in a large corpus study of spontaneous speech . Further supporting evidence includes findings that word durations are longer, and pauses and intonational boundaries more likely, in less predictable sequences [15,63], discussed in Turk .
(d) Summary of section 2
What is speech timing used for? We propose that one of its main purposes is to make utterances easier to recognize, by signalling the identity of individual speech sounds (e.g. did versus dad), and also signalling (and compensating for) the relative predictability of syllables and words in larger utterances. Because timing effects are implemented on very specific stretches of speech that relate to prosodic constituents (e.g. final lengthening occurs primarily on the rhyme of the final syllable; prominence-related lengthening occurs primarily on the stressed syllable nucleus and onset), it appears that predictability does not have a direct effect on surface phonetics (including timing), but rather its effects are mediated by prosodic structure (see other supporting arguments in ). We propose that the goal of making speech easier to recognize by smoothing signal redundancy is balanced against other goals and costs when planning surface durations in speech.
3. How is speech timing controlled?
Here, we address two different views of speech timing control: with and without an extrinsic timekeeper. Both approaches assume that surface timing patterns result from processes available for general non-speech motor control, but they propose very different mechanisms to generate those surface phenomena. Extrinsic timing approaches involve the use of a system-extrinsic timekeeper, which tracks, represents and specifies time in units that are not defined within the system (in the case of speech, the system would be the speech motor control system). By contrast, intrinsic timing systems do not involve system-extrinsic timekeepers. In such systems, all aspects of surface timing emerge from within-system characteristics. Any within-system timing specification is made in terms of within-system units, e.g. within-system oscillator periods or phasing. We note that we will call extrinsic any system that involves at least some timing computation by an extrinsic timing mechanism. However, we suspect that in many, if not all, extrinsic timing systems there may be aspects of surface timing that are emergent and do not need to be specified by the extrinsic timekeeper.
We first present the two approaches, and then three types of timing phenomena that suggest extrinsic timekeeper control.
(a) Timing with an extrinsic timekeeper
Extrinsic timekeepers can be used in motor control for a variety of functions, including tracking the passage of time, measuring time, representing time as well as specifying time as a parameter of movement. Theories of speech and non-speech motor control that assume an extrinsic timekeeper include Directions Into Velocities of Articulators (DIVA) [64,65], based on Vector Integration To Endpoint (VITE) , and many optimal control theory models (e.g. ). These models assume that desired movement durations can be specified as part of the plan for an utterance, and that the passage of time (and/or the time remaining) within a movement can be continuously tracked during the implementation of that plan. Within these models, state (e.g. spatial) information is also tracked continuously, and timing information is integrated with state information to generate appropriate movement velocities at each time point. For example, DIVA  and VITE  assume that at each point in time, a temporal GO signal is multiplied by the difference vector (distance remaining to the target assuming a straight line path) to give instantaneous movement speed. In Bullock & Grossberg , GO is a function of time that is proportional to 1 divided by the time-to-target-attainment at the current instantaneous movement speed (cf. Lee's tau ). Because the GO signal in Bullock & Grossberg  is an increasing function of time, and the distance to the target decreases as a function of time, multiplying GO by the distance remaining until the target at each point in time yields a bell-shaped velocity profile . The same GO for two different movement distances leads to equal movement durations for both, with higher peak velocities for the movement involving a greater distance. A larger GO for a given movement distance will yield a faster speed and therefore a shorter movement duration.
Optimal control theory models assume that we generate movements that are optimal in the sense that they meet task requirements at minimum cost. Many models of motor control in the optimal control theory framework are like DIVA in that they assume that we continuously monitor the states of our effectors (e.g. their position and velocity) in relation to the task goals, continuously updating our motor commands on the basis of state information to accomplish goals in a near-optimal way (but see [70,71] for an exception). In these models, movements are generated via a control policy that determines the optimal movement from any current state given the task goals and costs of movement. The control policy (which can be a solution to a set of equations) is determined by minimizing a cost function defining the task goals, costs of movement and their relative weightings in the current situation. Cost function minimization leads to the specification of values for all of the parameters in the model.
Although optimal control theory models do not necessarily require the use of an extrinsic timekeeper, many models developed within this framework use time as a parameter of movement and/or as a cost, and therefore assume one [67,69,70]. In many optimal control theory models that use extrinsic timekeepers, cost function minimization leads to the specification of movement parameters, including movement duration, where the optimal movement duration is the one that best satisfies the task requirements and minimizes movement costs. This movement duration results from several aspects of the cost function, including the specification of time as a task requirement, the cost of time, the cost of temporal inaccuracy and the temporal consequences of other movement costs, e.g. spatial inaccuracy at the movement target, or endpoint [69,72–74]. The goals of a movement will determine whether all of these aspects are included in the cost function. For example, if a movement must be produced within a certain time (as in tasks with a periodic rhythm), time would be an explicit task requirement, and spatial inaccuracy would be included in the costs.
In contrast to tasks that require a specified duration as a task goal, purely spatial tasks might not involve an explicit goal for movement duration, but there would be temporal consequences of other task requirements, e.g. of spatial accuracy at target achievement, because faster movements can be produced when there are less stringent spatial accuracy requirements. In addition, empirical findings show that movements are usually produced in the minimum time consistent with other task requirements, suggesting that time itself is a cost [73,74].
Why should time be a cost? One possibility is that longer movements have more temporal variability . This could be explained by the view that the mechanism that meters out time is variable, and hence more variability is expected to accumulate for longer duration intervals. However, this would not explain minimized durations observed in tasks where temporal accuracy is not an issue. Shadmehr and co-workers [69,72], following Harris & Wolpert , offer an explanation that relates movement speed to reward. That is, moving fast is desirable, because we get to a rewarding state quickly; moving slowly is suboptimal because it delays the next desirable state. Evidence in the literature supports the view that getting to a rewarding state more quickly is preferred. For example, Jimura et al.  found that thirsty undergraduates preferred to receive a small amount of water now, rather than more later (see  for additional evidence).
The optimal control theory framework is particularly attractive for speech timing, which appears to involve the influence of many different prioritizable factors. It has been used successfully to model simple movements, and to model aspects of speech timing [70,71]. We note however, that although many if not most optimal control theory models of motor control assume an extrinsic timekeeper, this theoretical framework is a theory of parameter value optimization, and can also be used in intrinsic timing models that do not use extrinsic timekeepers.
Simko & Cummins' embodied task dynamics model [70,71] is an interesting case: an example of a theory of speech motor control in which time is used only as a cost (where surface utterance duration is penalized), but not a parameter of movement. In avoiding the use of time as a parameter of movement, this model is similar to the articulatory phonology/task dynamics (AP/TD) approach, discussed in more detail below. However, even though time is not a parameter of movement in this model, an extrinsic timekeeping mechanism is nevertheless required to specify and represent the utterance duration quantity that it penalizes. On the definition that we presented in §2b, we would therefore classify it as an extrinsic timing model, even though it makes less extensive use of an extrinsic timekeeper than other types of extrinsic models.
In summary, many models of motor control use extrinsic timekeepers and many of these are optimal control theory models. In §3b, we discuss a different approach, that is, timing without an extrinsic timekeeper in AP/TD. Although this model currently provides the most comprehensive account of timing effects in speech production, we believe extrinsic models should be considered, for reasons laid out in §3c below.
(b) Timing without an extrinsic timekeeper in articulatory phonology/task dynamics
The main theory of speech production that assumes that surface timing phenomena can be produced without an extrinsic timekeeper is AP/TD [77–83]. This theory is particularly important because it currently provides the most comprehensive account of timing phenomena observed in speech, and has led to a number of significant insights into the nature of speech production, such as the understanding that coarticulation between adjacent sounds is often a matter of articulatory overlap rather than of feature changes in the phonemic features that define the words. The model is based on oscillators; this key feature enables it to produce surface timing patterns without an extrinsic timekeeper.
AP/TD is unlike traditional phonological theories which assume that units of phonological contrast are symbolic, i.e. do not contain quantitative specifications for how articulatory movement should unfold. In AP/TD, units of phonological contrast are gestures, defined as equations of motion that determine how constrictions will be formed in the vocal tract; constriction releases are modelled as movement back to a neutral vocal tract position. In this framework, each dimension of gestural movement towards a constriction goal is modelled as movement towards an equilibrium position in a damped, mass-spring system (analogous to the movement of a mass attached to a spring). The gesture's starting position is analogous to the position to which the mass attached to the spring is stretched, and the equilibrium position is the target position that is approached by the mass after releasing the spring. Because the system is critically damped, the mass does not oscillate, but rather asymptotes towards (approaches, but never quite reaches) the equilibrium position. It can thus be described as having point-attractor dynamics. The time required to approximate a constriction target (gestural settling time) is intrinsic to the system because it is dictated by the parameters of the mass-spring oscillator, i.e. its stiffness, mass and damping coefficients.
Other aspects of timing within AP/TD are also determined by oscillators. As we explain below, point-attractor oscillators are additionally used to adjust the timing of gestures at positions defined by prosodic structure, i.e. for final lengthening and prominence-related lengthening [81,82]. AP/TD also uses two types of freely oscillating oscillators (i.e. oscillators with limit cycle rather than point-attractor dynamics): (i) gestural planning oscillators, and (ii) a hierarchy of coupled suprasegmental planning oscillators (syllable, foot and phrase oscillators). These oscillators are used during utterance planning to determine (i) relative timing among gestures (intergestural coordination), (ii) the amount of time that each gesture shapes the vocal tract (gestural activation) and (iii) some aspects of timing attributed to suprasegmental (i.e. prosodic) structure.
In this framework, intergestural timing is determined by the relative phasing among gestural planning oscillators assigned to each gesture, and does not need to be specified by an extrinsic timekeeper. For example, if two gestural planning oscillators entrain in-phase during utterance planning, then the physical gestures that correspond to each planning oscillator will begin at the same time. Other phasing relationships are also possible, but the most stable entrainment patterns are predicted to be the most common, i.e. in-phase and anti-phase. For a more complete discussion of intergestural timing, see Nam et al. .
The amount of time that each gesture is active (i.e. its activation interval) is derived from other parameters within the system and does not need to be specified extrinsically. Gestural activation intervals specify the amount of time that a gesture actively shapes the vocal tract. Gestures whose activation intervals are as long as their settling times will have enough time to approximate their targets. On the other hand, if gestural activation intervals are shorter than gestural settling times, targets will not be approximated and undershoot will occur. If gestural activation intervals are longer than gestural settling times, then gestures will continue to asymptote towards their targets for the length of the activation interval (and will thus appear to be in a quasi-steady state for the duration of the activation interval).
Gestural activation interval timing is intrinsic, because activation intervals are specified within the model as a fixed proportion of each planning oscillator's cycle. Because gestural planning oscillations are coupled to the oscillations of the suprasegmental hierarchy of syllable-, foot- and phrase-oscillations, the physical duration of activation will depend on the frequency of oscillation of this whole planning oscillator ensemble, i.e. on overall speech rate. When speech rate (i.e. planning oscillator ensemble frequency) increases, activation intervals will be physically shorter, and undershoot will be more likely, although gestural activation intervals will still correspond to the same gestural planning oscillator proportion. Likewise, when speech rate decreases, activation intervals will be longer, and more time will be spent asymptoting (getting closer and closer) to the gesture's target.
Temporal aspects of prosodic structure are also intrinsic to the system and do not need to be extrinsically specified. There are two aspects of prosodic timing in this framework: first, interactions among higher-level organizing oscillators (e.g. syllable, foot, phrase) specify the rates of syllable, foot and phrase production. These oscillation rates, in turn, affect the rates of planning oscillators for individual gestures, which determine gestural activation intervals, because each activation interval corresponds to a proportion of a planning oscillator cycle. The second aspect of prosodic timing has to do with adjustments that are made to all gestures that are concurrently active within a specified interval (mentioned briefly above). For example, the lengthenings that commonly occur at prosodically privileged positions in an utterance, e.g. boundary-related and prominence-related lengthening, are generated by proportionally stretching the activation intervals of boundary-adjacent or prominent gestures [81,82].
Global timing, i.e. overall speech rate, is specified by the utterance-specific oscillation rate of the ensemble of suprasegmental and gestural planning oscillators, but again does not involve the specification of surface duration .
In the current form of AP/TD, surface timing characteristics cannot be specified, nor is there a mechanism that can keep track of the output durations while they are being produced, or measure them after they are produced. These features are not required in the model, because once speakers have chosen a rate of speech and have imposed prosodic boundaries and prominences on an ordered sequence of gesturally specified words, surface timing patterns emerge from the interacting mechanisms of the system.
Simko & Cummins' embodied task dynamic model [70,71] is similar to AP/TD in that it uses mass-spring oscillator systems for gestures. In this model, some aspects of surface timing are emergent, i.e. they result from the stiffness specification of the mass-spring system, and other aspects result from the coordination of these oscillators in terms of their phasing. However, as discussed above, Simko & Cummins' model cannot be considered a strictly intrinsic timing model because it uses an extrinsic timekeeping mechanism to represent an utterance duration cost and therefore the surface duration of each utterance.
Although the use of intrinsic timing has the advantage of minimizing the planning required for each utterance, the three kinds of evidence we present in the next section are difficult to reconcile with intrinsic timing approach adopted in AP/TD, and are suggestive of extrinsic timekeeper control.
(c) Evidence for extrinsic timing
In §3c(i),(ii) which follow, we provide evidence that challenges the intrinsic timing aspect of the AP/TD model, because it supports the use of an extrinsic timekeeping mechanism in speech and non-speech motor control. In §3c(iii), we present evidence which is difficult to explain in mass-spring systems, although it may be implementable. These lines of evidence motivate us to consider extrinsic timing models that include time as a parameter of movement.
(i) Increasing variability with increases in interval duration: evidence for an extrinsic timekeeper
Patterns of variability in the timing of intervals support an extrinsic timing mechanism. Many studies show more variability in interval duration for longer intervals defined by movement [84–91], and as explained in Schmidt et al. [85, p. 422], these findings are expected in extrinsic timing models: ‘the mechanism that meters out intervals of time … is variable, and the amount of variability is directly proportional to the length of the interval of time to be metered out.’ The relationship of variability to mean duration follows Weber's law, with an approximately constant coefficient of variation (standard deviation/mean) across a range of intervals (from tens of milliseconds to seconds and possibly longer), for both humans and animals, consistent with an extrinsic timing mechanism [83,84,89–93]. Support for the view that the same timekeeping mechanisms are used in perception comes from a Weber relationship between difference threshold and interval duration in perceptual discrimination tasks .
The Weber relationship between standard deviation and interval duration, suggestive of noise in a timing process and therefore of an extrinsic timing mechanism, is observed in many production tasks, including
2. Movements made to a metronome: for moving a stylus to and from a target, with interbeat intervals from 200 to 500 ms .
3. Movements made to an internally recalled rhythm in a continuation paradigm [88,90,94] among others: participants first produce a movement (e.g. tapping) in synchrony with a metronome (pacing phase), and continue the rhythm after the metronome is turned off (continuation phase). Typically, interval duration measurements are made from the continuation phase; standard deviations and mean interval durations are computed over a series of trials. Ivry & Hazeltine  found patterns of increased variability for longer tapping interval duration for intervals ranging from 325 to 500 ms. Spencer & Zelaznik  observed increased variability for longer tapping and continuous circle drawing intervals, as well as back-and-forth line drawing intervals, for intervals ranging from 300 to 500 ms (see also ).
4. Speech movements and intervals. Byrd & Saltzman  found that variability increased with movement duration for measured durations of lip aperture closings associated with a trans-boundary /m/-schwa-/m/ sequence. Movements of different durations were elicited in conditions designed to systematically vary the prosodic boundary strength before the second /m/. For example, the target sequence –mam- in mommamia was described as having no word boundary before its second /m/, whereas in momma mimi the second /m/ was separated by a word boundary from the preceding vowel. In other cases, the second /m/ was separated from the preceding vowel by a stronger boundary, and was either phrase- or utterance-initial. Movement durations were generally longer for stronger boundaries, because of constituent-final lengthening, whose magnitude increases with boundary strength (cf.  for acoustic measures). Data from Turk & Shattuck-Hufnagel  show a similar pattern for phrase-final versus phrase-medial word-final syllable rhyme measures, based on landmarks in the acoustic signal. Rhyme duration means and standard deviations were considerably higher for phrase-final words when compared with phrase-medial words; that is, monosyllabic words (e.g. Tom) had phrase-final mean durations of 346 ms (82 ms s.d.) versus phrase-medial mean durations of 193 ms (47 ms s.d.).
In AP/TD, longer movement durations at phrase boundaries arise by stretching the activation intervals in the vicinity of the boundary, that is, by decreasing the oscillation rate of a planning oscillator ensemble in a specified interval, while leaving the number of oscillations the same. Within this framework, therefore, there are no additional ‘ticks’ of an utterance-specific clock that could be used to explain the source of the additional temporal variability. Thus, the substantial body of evidence supporting increased variability with longer-duration movements is inconsistent with the AP/TD model of motor timing.
(ii) Surface timing constraints and goal specifications: evidence that surface durations are part of the phonetic plan for an utterance
Within AP/TD, desired surface durations cannot be specified as part of the utterance plan. For example, gesture durations in phrase-final position reflect the settling-time of their mass-spring system, their gestural activation interval and an adjustment which lengthens the gestural activation intervals at the boundary . But in AP/TD, the surface duration emerges from these mechanisms alone, and cannot be specified in the original utterance plan.
However, Nakai et al.  suggest that a constraint on surface durations of phonemically short vowels in phrase-final position may be required to preserve the short versus long phonemic contrast in Northern Finnish. In Northern Finnish, disyllabic words with a phonemically short vowel in the word-final syllable (CVCV(C)), the final-syllable vowel is described as phonetically half-long because its duration is intermediate between that of the short vowel in other contexts and that of the contrasting long vowel (VV). The authors observed that the magnitude of final, accentual and combined lengthening on the half-long vowel was restricted (e.g. 17% combined accentual + final lengthening on the half-long vowel versus 68% on the long vowel in the same context). Support for a surface duration constraint also comes from observations that lengthening magnitudes were smaller for half-long vowels with longer phrase-medial durations; Nakai et al.  found a negative correlation between phrase-medial half-long vowel durations and the magnitude of phrase-final lengthening. These results are consistent with the view that the surface durations of the (phonemically short) half-long vowel are restricted in order to avoid endangering the phonemic short versus long vowel quantity contrast in this language. Although it is possible to implement this type of effect in AP/TD, the effect is difficult to explain within the theory, because surface durations cannot be measured, represented or referred to as motivating factors.
Additional support for the representation of surface durations can be found in studies of rate of speech effects and durational correlates of prosodic structure and quantity [97–99]. These studies find that there is considerable variability in the strategies that different speakers use to implement these factors, but that nevertheless speakers all achieve a common surface duration pattern of relatively long surface durations, e.g. in phrase-final position, at slow speech rates and for phonemically long vowels. These findings challenge intrinsic timing in AP/TD because they suggest the equivalence of different strategies that result in similar surface duration patterns, and therefore support the specification of surface duration goals.
In summary, the two types of evidence we presented in §3c(i),(ii) strongly support the use of extrinsic timekeepers to measure, represent and specify surface movement and/or interval durations in speech. This evidence therefore supports models like DIVA/VITE in which duration is a planned parameter of movement. Although duration is not a parameter of movement in Simko & Cummins' [70,71] model, this model could probably be modified to account for these data, because Simko & Cummins use an extrinsic timekeeper to specify an utterance duration cost. However, their model might need to be amended to measure, represent and specify durations of constituents smaller than the utterance. Currently, in their model, although whole-utterance duration cost specification requires an extrinsic timer, surface timing of constituents smaller than the utterance (e.g. syllables, individual gestures) arise from phasing relations among gestures and from gestural stiffness, and is not specified directly. In §3c(iii), we present evidence which challenges this approach as well as AP/TD's approach, because it is difficult to account for in mass-spring systems. This evidence therefore motivates the consideration of extrinsic timing models of speech production which include time as a parameter of movement (and not simply as a cost).
(iii) Independent planning of the timing of movement onset versus target attainment: evidence difficult to account for in mass-spring models
Lee commented  ‘it is frequently not critical when a movement starts—just so long as it does not start too late. For example, an experienced driver who knows the car and road conditions can start braking safely for an obstacle a bit later than an inexperienced driver…’ This type of example suggests that timing variability is different at target attainment versus movement onset, difficult to explain in mass-spring models such as AP/TD, but easier to explain in extrinsic timing models because they can allow separate timing specification and prioritization for target attainment versus other parts of movement .
Several studies have confirmed the finding of differential variability in the timing of target attainment, compared with the timing of other movement events such as movement onset ([91,101–104], for non-speech motor activity;  for speech). For example, Bootsma & van Wieringen  showed that the timing of initiating forehand drives in table tennis was more than twice as variable as the timing of paddle contact with the ball. Forehand drives in this experiment had average movement times that ranged between 92 and 178 ms. Timing accuracy at paddle–ball contact was estimated on the basis of the ratio of standard deviation of the direction of travel of the paddle and its mean rate of change, and was calculated to be within 2–5 ms. By contrast, movement time standard deviations ranged from 5 to 21 ms, depending on the player, showing that movement initiation times were much more variable.
Perkell & Matthies  showed a similar pattern of timing variability for upper lip protrusion movements during spoken /i_u/ sequences, where the number of intervocalic consonants varied systematically. They observed lower variability in the timing of target attainment (maximum protrusion) relative to voicing onset for /u/, when compared with the timing of a point shortly after movement onset (maximum acceleration), relative to voicing onset for the same vowel. This pattern suggests a tighter temporal coordination of maximum lip protrusion with voicing onset than of lip protrusion movement onset with voicing onset. These findings suggest that target attainment timing is controlled independently of movement onset timing, and that target attainment timing takes higher priority. These findings are not predicted by mass-spring models in which the timing of movement onset is not independent from the timing of target achievement. That is, while AP/TD does provide a mechanism for separately adjusting the timing of the beginning and the end of an activation interval (by applying its prosodic ‘stretching’ mechanism to a proportion of the interval), it does not provide a mechanism by which these timings could be differently variable.
By contrast, an extrinsic timing mechanism can, in principle, (i) plan the timing of movement onset independently of the timing of target attainment, and (ii) account for the possibility of different degrees of variability in these two time points, as would be the case if the timing of target attainment has a higher priority than the timing of the movement onset, resulting in online adjustments to achieve high priority goals.
The separate control of different parts of a movement is also supported by evidence from spatial variability at target achievement versus other parts of movement. The first line of evidence for differential degrees of variability at different points in a movement trajectory comes from work by Todorov & Jordan , who found lower spatial variability at target achievement compared with elsewhere in movements, for a task in which participants moved a pointer through a series of circular targets on a flat table.2 When analysing their results, they sampled each movement trajectory at 100 equally spaced points along the path. They computed the average movement path, and determined spatial deviations from the average path at each of the 100 points. Results showed that spatial deviations from the average path were lowest at the circular targets, and higher in between. Paulignan et al.  report similar results for shorter-than-a-second reaching movements (variability greater for first half of reaching movement, compared with the second half as the hand approached the target), as do Liu & Todorov  for two reaching tasks. They  found that spatial variability was lowest at the beginning and end of each movement, and highest in between. Presumably the variability was low at the beginning of movement, because the movements started from a fixed point, and was low at the end of movement, because the end was the target.
These results suggest that actors are able to identify parts of a trajectory that relate most closely to task performance, and are able to prioritize spatial accuracy in these parts of the trajectory. The results are also consistent with the view that actors make use of a feedback-based error correction system to implement error correction in the parts of the trajectory whose accuracy has been prioritized. On this view, errors in planned movement trajectories (as evidenced by deviations from the mean) can be left uncorrected if they do not interfere with task performance. In addition, these data suggest that separate parts of movement are identified, so that spatial accuracy can be prioritized, something that would be straightforward if these same points were also identified for differential timing prioritization in an extrinsic timing model. Different degrees of spatial (as well as timing) variability at different parts of movement are difficult to explain in mass-spring models, though perhaps not impossible to implement.
(d) Summary of section 3
Here, we reviewed two types of timing control theory: (i) without an extrinsic timekeeper, exemplified by AP/TD and (ii) with an extrinsic timekeeper, exemplified by DIVA and many types of optimal control theory models. Several findings challenge models such as AP/TD that do not make use of an extrinsic timekeeper. These findings include greater timing variability for longer duration intervals when compared with shorter duration intervals, the apparent use of a durational constraint in Northern Finnish, as well as the use of different strategies to achieve the same duration patterns as a speech planning goal. Additionally, we presented evidence of differential timing variability at movement end when compared with movement onset. This evidence is difficult to explain in intrinsic timing (mass-spring) models, but is more straightforward to account for in extrinsic timing models. Models of speech timing control that involve an extrinsic timekeeper are therefore worth investigating, although they will require extensive development to account for the range of phenomena currently captured by AP/TD.
Understanding speech timing requires an understanding of both what timing is used for, and how it is controlled. We propose that one goal of speech timing is to make speech understandable, and that this goal is balanced against other goals, such as speaking quickly, to give the surface timing properties of speech. This view is based on findings from controlled experiments, as well as from analyses of relationships among factors proposed to account for surface timing patterns. We also presented two alternative ways of modelling surface timing patterns: (i) as an emergent property of motor control, without the involvement of an extrinsic timekeeper (as in AP/TD) and (ii) as the result of desired durational specifications made possible by an extrinsic timekeeper (as in DIVA/VITE and many optimal control theory models, where desired durational specifications are balanced against other task requirements and costs to generate (near-)optimal movements). Although the AP/TD framework currently exceeds other models in its ability to account for speech timing phenomena, several findings present challenges for this framework, and raise the possibility that models of motor control that involve an extrinsic timekeeper may ultimately provide a simpler and more comprehensive account of speech timing behaviour.
While there are aspects of what timing is used for, and of the structures that govern it, that still remain to be discovered, our current understanding of these two aspects of speech timing is more advanced than our understanding of the mechanisms that are used to control it. It is hoped that advances in experimentation, modelling and neuroscience will eventually lead to a better match between our understanding of speech timing patterns and our models of how these patterns arise.
This work was supported by an Arts and Humanities Research Council fellowship (AH/1002758/1) to the first author, and NIH R01-DC008780 to the second author.
We thank Jelena Krivokapic and two anonymous reviewers for useful comments on previous versions of this manuscript, Elliot Saltzman and Louis Goldstein for tutorial discussions on articulatory phonology/task dynamics, and Dave Lee for discussions of General Tau theory. Any errors are ours.
- © 2014 The Author(s) Published by the Royal Society. All rights reserved.