Quality of Service and Temporal Fidelity

This page discusses special quality-of-service issues for MIDI and other audio-related media and defines relevant terminology. Conventional network quality-of-service criteria are mentioned briefly: it is assumed that any transport carrying MIDI messages will provide adequate quality-of-service as measured by conventional criteria.

The MIDI 1.0 transport layer (as defined by the "hardware" specification on pages 1 and 2 of the Complete MIDI 1.0 Detailed Specification) implicitly provides certain real-time service guarantees. These guarantees are significant to musical performance and contribute to the current success of MIDI in various application areas. These guarantees characterize the basic temporal fidelity (rhythmic integrity) of MIDI musical performance data. For many (but not all) applications, these guarantees must be maintained (and improved as feasible) in order to ensure that MIDI technology remains useful to its core constituency (musicians).

Standard layered network models often downplay issues related to temporal behavior. Often, the quality of network service is primarily defined using measures such as reliability, capacity, and, sometimes, latency (defined above).

These measures are inadequate for MIDI, audio and other types of audio-related media. MIDI was designed for the purpose of conveying musical performance data. As with audio, the ear is extremely sensitive to small variations in the timing of MIDI messages used to trigger or modify audible events such as musical notes. Temporal fidelity (preservation of rhythmic integrity) is affected most strongly by jitter, but latency is also important.

Jitter

A number of perceptual studies have shown that for streams of individual audio events, timing jitter on the close order of one millisecond can be audible, particularly in the context of rhythmically complex and syncopated ensemble music. [Iyer, Bilmes et al, Lunney, Michon, Schloss, Van Noorden]

[Moore] argues convincingly that time intervals on the close order of 1.5 milliseconds are both audibly significant and controllable by human performers in common musical situations. Consider a sequence of sounds, each consisting of a pair of clicks separated by a short delay (1,2,3,4… msec). Each successive sound has a distinctly different and predictable pitch. The ability to identify musical timbres is strongly linked with their attack transients. If a paired click, as discussed above, were used as the attack transient for sound with much longer duration, the delay between the two clicks would play an important role in determining the timbral identity of the sound.

This phenomenon is particularly significant when grace notes, flams and other musical decorations are played. For example, a pianist plays grace notes by extending one finger slightly before another, and moving the wrist as the hand descends so that the first finger strikes the keyboard slightly sooner. A skilled pianist can reliably control his or her hand geometry so that one finger is about 1 millimeter lower than the other. This corresponds to a time interval on the close order of 1.5 milliseconds under typical playing conditions. While the absolute time position of a particular gesture may vary 10-20 milliseconds (or more) from one performance to another, the relative interval between the grace note and the associated note is far smaller, and quite repeatable. As explained in the preceding paragraph, even small variations in such inter-note delays are quite audible under these circumstances.

In order to preserve the rhythmic integrity of grace notes and other musical decorations, timing accuracy of 1 millisecond or less is needed. Jitter above this threshold can audibly degrade the reproduction of a musical performance.

Within a continuous audio stream, much smaller amounts of jitter are perceptible. High-quality digital audio equipment and digital-to-analog converters provide jitter levels on the order of ten picoseconds or lower.

Jitter may be caused by a number of factors:

  • Rate-limit jitter results when a bursty message stream exceeds the capacity of a data transport channel.
  • Bus contention jitter results from the need to wait a variable length of time for a shared bus to be available.
  • Clock jitter results from phase shift or other variations in the clock used to transport bits over some underlying transmission media.
  • Software processing jitter results from variations in execution time for software associated with transport stream processing.

Such variations can be due to differences in processing for different kinds of events, varying resource availability within a computer system (e.g. from multitasking or virtual memory) or other factors.

It is important to note that rate-limit jitter is deterministic, whereas bus-contention and most other forms of jitter are not. Since rate-limit jitter is caused by characteristics of the source message stream, it is possible to reorder or otherwise modify the source stream to ensure that high-priority events (such as drum notes) are transmitted at predictable times. In many cases, it is also possible to inspect a given source stream, determine the necessary transmission rate, and request a data transport channel with the appropriate characteristics. Since bus contention jitter and other kinds of jitter are non-deterministic (and cannot easily be bounded a priori), it is impossible for the sender to compensate for these forms of jitter.

Latency

Unlike jitter, a fixed amount of latency is much less likely to cause problems, as long as two conditions are true:

  1. such latencies are applied equally to all audible sources, to avoid relative timing skew between sources;
  2. the total latency does not cause a perceptible and/or objectionable delay between the time a message is sent and the time a response is perceived.

Sound travels at approximately one foot per millisecond. The distance between the sound radiating elements of most acoustic instruments and the performer's ears is in the range of one to four feet (about 0.5 - 1.5 meters). This corresponds to a latency of 1 to 4 milliseconds between the moment the performer initiates a note and the time the first acoustic results are heard. In small ensembles such as string quartets, performers are generally located within five to seven feet (about 2 meters) of each other. This corresponds to a maximum inter-performer latency of about 7 milliseconds. Rock groups and other amplified ensembles generally place speakers so that similar inter-performer latencies are produced, even when there are larger physical distances between performers on stage or in a recording studio. Headphones, of course, afford very low-latency acoustic sound reproduction (< 1 millisecond). On the other hand, music as heard by concert audiences is subject to much greater latencies.

The threshold for tolerable latency depends on the specific application. Passive listening applications such as a song player can clearly tolerate significant latencies. Game applications (where musical events are tied to game applications) are generally more demanding. Interactive performance and music composition applications (where an end user is directly involved with triggering musical events and music is the primary focus) require even better latency. Professional users, of course, have the most stringent needs.

Temporal Fidelity

Good temporal fidelity requires fixed latency with bounded jitter. Perceptible latency varies according to the application, while perceptible jitter levels are largely independent of the application.

Temporal fidelity is a system-level property. It is characterized by specific, system-level bounds for latency and jitter. Each system component should perform within better-than-system-level tolerances in order to ensure that system-level bounds are met. Therefore, each system component should be allocated a specific, proportionate share of the system-level bounds in order to maintain temporal fidelity at a given level. A complete system comprises three distinct types of components:

system_components (4K)

Each of these component types should be allocated an appropriate share of the total system-level jitter and latency budget. The MIDI 1.0 hardware specification implicitly defines the performance bounds of the middle (transport) components. Source and sink entities (controllers, sequencers, sound generators) from different manufacturers have varying performance characteristics.

For music performance, recommended system-level bounds are 10 milliseconds total latency and +/- 1.0 milliseconds peak jitter. Preferred system-level bounds are 5 milliseconds total latency and +/- 0.5 milliseconds peak jitter. In order to maintain overall temporal fidelity, the jitter and latency contributions from the media transport components should be significantly less than these system-level bounds.

Terminology

Asynchronous serial
A signaling method for encoding (and recovering) the serial clock and serial data over a physical-layer communication link or bus. Asynchronous serial data transmission uses an implicit clock (inferred from the intervals between data pulse transitions), with potentially variable spacing between successive data blocks. See Synchronous serial.
Asynchronous service
A service providing variable-rate message or data delivery within a transport (or higher) layer. The length of time needed to deliver a given message can never be guaranteed, although statistical assurances may be provided in certain networks and configurations. Asynchronous service may be reliable or unreliable. For example, in TCP/IP networks, TCP service is reliable while UDP service is not. With reliable asynchronous service, message delivery is guaranteed. If an error occurs, the message or data will be resent but will necessarily arrive at a later time. When errors occur, subsequent messages or data are often also delayed.
Asynchronous transport layers used in common networks often do not preserve the order in which messages were sent. In this case, additional logical mechanisms are needed to store and re-sequence messages in order to reconstruct the original transmission order. This introduces additional delays. See Isochronous service.
Capacity
The maximum number of bytes of message data that can be delivered per second over some communications channel measured over a specified time interval. Capacity measurements for a given network layer take into account any overhead imposed by lower layers. See Quality of Service, Throughput.
Delta time
The interval between delivery of two successive messages or groups of messages. See Jitter.
Isochronous service
A service providing constant-rate message or data delivery within a transport (or higher) layer. With this service, the time of delivery is guaranteed, but actual delivery is not: if an error occurs, the data is not resent. Subsequent messages are never delayed when an error occurs in a preceding message. Since isochronous delivery lacks a retry mechanism, it is said to be "unreliable". An isochronous transport layer also preserves the order in which messages are sent. Note that every isochronous connection has a fixed capacity (transmission rate). If a sender attempts to send too many messages at a given moment, one or more messages will be delayed. Isochronous connections can only guarantee delivery time for properly rate-limited message streams. For example, the MIDI 1.0 transport layer (unidirectional MIDI DIN connection) provides isochronous service. See Asynchronous service, Rate-limited.
Jitter
The deviation between the intended and actual time intervals (delta times) between two events over a given transport. Jitter can also be characterized as the amount of variation in latency. Peak jitter is the largest such deviation experienced by any pair of events as a result of transport over a given stream. Average jitter is the mean deviation from average latency. See Delta time, Quality of Service.
Latency
The average (mean) end-to-end transit time for a single message. Latency is measured from the time that the message (or first byte of a partial message) is submitted for delivery, up to and including the time when the last byte of the complete message is delivered to the destination. Also called 'delay' or 'transit delay.' See Quality of Service.
Rate-limited
Having a bounded transmission rate which never exceeds the momentary capacity of a given data transport service. If too much data is submitted for delivery at a given moment, some or all of the data will be delayed, introducing jitter. MIDI message streams are bursty, often containing clusters of closely-spaced events separated by long intervals containing few if any events. A message stream is said to be rate-limited if its contents never exceed the momentary capacity of the data transport carrying the stream.
Reliable
Delivery of data is guaranteed. Reliable delivery is usually accomplished using a protocol whereby the receiver acknowledges the receipt of each message. The integrity of each message is usually also verified. This enables the sender to retransmit lost or corrupted data. Reliability criteria include the frequency of errors, the means of error detection and correction, and whether error correction is automatic (provided by some lower layer) or requires additional action by the client of a given service. See Unreliable.
Quality of Service (QOS)
The set of parameters that characterize the performance of a particular transport-layer data flow (which might use either connection-oriented or connectionless data transfer). Reliability, capacity, jitter and latency (transit delay) are among the most important QOS parameters.
Synchronous serial
A signaling method for transmitting serial data and an associated clock over a physical-layer communication link or bus. Synchronous serial data transmission uses an explicit clock transmitted separately from the data, with predetermined spacing between successive data blocks. See Asynchronous serial.
Temporal fidelity
The degree to which a message stream maintains the rhythmic integrity of the original musical performance. Jitter is the primary factor degrading temporal fidelity, because it distorts the delta-times between successive MIDI messages. Excessive latency can also impair temporal fidelity, by making it harder for a musician to perform events with the desired timing in the context of other musical elements. See Jitter, Latency.
Throughput
The actual number of bytes of message data delivered per second over some communications channel measured over a particular time interval. See Quality of Service, Capacity.
Unreliable
Data is sent without acknowledgement. If a message is lost or corrupted, the sender is not informed. See Asynchronous delivery, Isochronous delivery, Reliable.

References

Bilmes, J. 1993. "Timing is of the Essence: Perceptual and Computational Techniques for Representing, Learning, and Reproducing Expressive Timing in Percussive Rhythm." Masters thesis, MIT Media Lab. http://www.icsi.berkeley.edu/~bilmes/mitthesis/index.html

Brandt, E. and Dannenberg, R. 1998. "Low-latency music software using off-the-shelf operating systems." Proceedings of the International Computer Music Conference. http://www.cs.cmu.edu/~rbd/papers/latency98/latency98.pdf

Freed, A., Chaudhary, A. and Davila, B., "Operating Systems Latency Measurement and Analysis for Sound Synthesis and Processing Applications", ICMC, Thessaloniki, Greece, 1997

Iyer, Bilmes et al. 1997. "A Novel Representation for Rhythmic Structure." Proceedings of the International Computer Music Conference.

Lunney, H. M. W. 1974. "Time as heard in speech and music." Nature (249):592.

Michon, J. A. 1964. "Studies on subjective duration 1. Differential sensitivity on the perception of repeated temporal intervals." Acta Psychologica (22): 441-450.

Moore, F.R. 1988. "The Dysfunctions of MIDI." Computer Music Journal 12(1):19-28.

Schloss, A. 1985. "On The Automatic Transcription of Percussive Music From Acoustic Signal to High-Level Analysis." Ph. D. thesis, Stanford University, CCRMA.

Van Noorden, L. P. A. S. 1975. "Temporal coherence in the perception of tone sequences." Unpublished doctoral thesis, Technische Hogeschool, Eindehoven, Holland.

Wessel, D. and M. Wright (2000), "Problems and Prospects for Intimate Musical Control of Computers". ACM SIGCHI, CHI '01 Workshop New Interfaces for Musical Expression (NIME'01)

Wright, J. and Brandt, E. (2000) "MidiWave Analysis of Windows 98 Second Edition MIDI Performance", Presented at Windows Audio Professionals Roundtable (Winter NAMM 2000) and at the 2000 Annual General Meeting of the MIDI Manufacturers Association.

Wright, J. and Brandt, E. (2001) "System-Level MIDI Performance Testing", Proceedings of the International Computer Music Conference, Havana.