Audio WAV file format

RIFF (.wav) file format has been around unchanged since the early 1990s and still in common use today.   This goes back to a time when CD audio formats were king and RIFF .wav follows the red book convention pretty closely, with the change that the number of channels, bits per sample and samples per second can vary as described in the file header.

As an example file, I have selected TADA.WAV from Windows 10, \windows\media\tada.wav.  This file is 285,228 bytes, the important part is in the first 44 bytes shown here.

* OFFSET +0       +4         +8       +C
00000000 52494646 245A0400 - 57415645 666D7420 *RIFF$Z..WAVEfmt *
00000010 10000000 01000200 - 44AC0000 10B10200 *........D¬...±..*
00000020 04001000 64617461 - 005A0400 00000000 *....data.Z......*

The first thing to notice is that .wav files and indeed all RIFF files always have “RIFF” as the first 4 characters.  The RIFF header is immediately followed by a WAVE header.  RIFF is “Resource Interface File Format” and in the formatting of file data, RIFF refers to everything as “chunks”.  A chunk is a collection of data that starts with a 4 character code identifier and is followed by a 32-bit length which is the amount of data in the chunk not including the chunk header.  These and all numbers in RIFF are stored little endian formatted.  Chunk sizes are usually even, but where it would be odd, the parser looks for the next chunk at the even address following.  That is, chunks are padded to an even size.  

The RIFF chunk is at the start of the file, and for TADA.wav, the size of the RIFF chunk is 0x245A0400, which is 0x00045a24 in big endian equals 285,220 in decimal.  The total file size equals the size of the RIFF chunk plus the 8 bytes that describe that it is a RIFF chunk and for TADA.wav, these add up and match.  

  • Filesize = 285,228 = 8 + 284,220

Next we dig into the sub-chunks of the RIFF chunk.  The RIFF chunk contains other chunks as its data.

Right away we see a “WAVE” chunk.  This is a .WAV file!  We knew that from the file extension, but now we really know it.

“WAVE” identifies the start of wave formatted data.  This always starts with a “fmt ” chunk immediately following and yes, that is a space at end.  Four character codes always have 4 characters even if the code is only 3 characters.  Four character codes do not include line feeds, just the ASCII text (not UNICODE).  The format chunk says the format of the data.  

The format chunk for this file disected is…

Field TypeDescriptionLittle endianBig endianMeaning
TAGTagIdentifier666D7420 “fmt “
ULONGFormatChunkSize100000000000001016
USHORTFormat01000001PCM
USHORTChannels02000002Stereo
ULONGSamplesPerSecond44AC00000000AC4444,100
ULONGAvgBytesPerSecond10B102000002B110176,400
USHORTBlockAlign040000044
USHORTBitsPerSample1000001016

PCM (format “1”) is the most common value for format.  There are a number of other formats defined primarily for compressed audio and these include Microsoft ADPCM as “2” and ITU G.711 a-law and u-law as 6 and 7.  It is entirely possible that a “valid” wav file can be processed by an audio player who will reply that it has been given an audio format that it does not understand, or that it has been given an audio format that no audio devices in the machine know how to process.  PCM is the most common audio format in .wav and PCM is “Pulse Code Modulation” which equals sound pressure levels measured by an analog to digital converter and stored into memory or file with no compression.

After the format chunk is USUALLY the “DATA” chunk.  For TADA.wav, this is true and the DATA chunk starts at offset 0x14 and has the following contents.

Little endianBig endianMeaning
64617461 “DATA”
005A040000045A00Data size: 285,184
datadata285,184 bytes of PCM data

Since the data chunk started at offset 0x14 (20 decimal) we can add 20 decimal plus the size of the data chunk header (8) plus 285,184 (data size) to find the start of the first chunk beyond the data.   Add those up, get 20 + 8 +  285,184 = 285220 which is the size of the file, so parsing is complete.

Everything inside the DATA chunk is PCM formatted data.  In this case, 16 bits per sample, stereo data at 44,100 samples per second.  This is CD audio format, stored in a WAV file.

Notice the sample is stereo, this means that “block alignment” is the count of bytes needed to store a full sample. Since this is stereo data, 16-bits per sample = 2 bytes per sample, times 2 channels, equals block alignment of 4 bytes.

The first PCM sample starts at offset 0x2C and it is 00000000. By convention, the left channel comes at the lower address which in this example is the 16 bits (2 bytes) at 0x2C equals 0x0000 and the right channel is 2 bytes later at 0x2E and it is also 0x0000. 

For PCM, 16 bit audio data is stored little endian (intel format). 

  • 16-bit PCM data is “signed” and zero represents silence (no sound pressure)
  • 8 bit PCM data is “unsigned” and the half way point at 0x80 represents silence (no sound pressure)

With a little bit of programming, you can plot this out and see sound waves, SIN waves even if you look at a file with perfect tone.

RIFF includes additional chunk definitions and if a parser encounters a chunk type it does not understand, it should skip it and continue at the next chunk identified by the chunk length.  Some wav file editors include provisions for adding copyright text chunks for example and these would be skipped by parsers during audio playback.  Looking at the TADA.wav shipped with Windows, this is not present.

Joe Nord

Originally posted Oct 27 2017

Comments

Comment from: Jake Visitor

Your text says 16-bit PCM data is “signed”. So, for example, looking at one channel of a 16-bit stereo PCM file, and lets say I had two bytes (little endian) that were “07 FF” hex. I would reverse these two to get “FF 07″ which would be 65527 in decimal. If it is “signed”, would not the max be +32768? I am confused as to how it is changed to signed.

11/25/17 @ 04:49 pm

Comment from: joe Member

> “FF 07″ which would be 65527
Close! FF 07 is -249. This is pretty close to zero on a 15 bit scale which means reasonably close to quiet. By 15-bit scale, I mean 15-bits of positive numbers and 15-bits of negative numbers. The top bit is “sign”.

> I am confused as to how it is changed to signed.
The sign bit (most significant bit) is a 1 (negative) so that requires a bit more work.
Answer: 2’s compliment the data to find out how negative it is. How far is the value away from 0.

1) Reverse all the bits. FF 07 becomes 00F8.
2) Add 1. 00F8 + 1 = 00F9
3) Convert to decimal. F = 15. 15*16 = 240. 240 + 9 = 249
4) Change the sign. -249

Leave a Reply

Your email address will not be published. Required fields are marked *