Let’s assume you have two files you want to append, each with an audio & a video track. For the same of simplicity let’s assume the video frame is recorded at 25 FPS (meaning a duration of 40ms per frame), and the audio frame has a frame duration of 33ms.
Let’s further assume that the first audio & video frames of the second file start at 0ms.
Now back to the first file: let’s assume the video frame has 5 frames. This means that the last frame’s timestamp is (5 - 1) * 40ms = 160ms; the video track’s duration is obviously last frame’s timestamp + frame’s duration → 160ms + 40ms = 200ms (or 5 * 40ms).
Similarly for the audio track, let’s assume 6 frames, meaning the last one starts at 5 * 33ms = 165ms & ends at 6 * 33mds = 188ms. That leaves a gap of 12ms between the end of the audio track & the end of the video track.
Now the question is: how many ms does mkvmerge add to the timestamps of the audio & video frames from the second file?
With the default append mode, file, mkvmerge uses the duration of the whole file. This is the maximum of all the end timestamps of all the frames in the first file; in this case, the maximums are 188ms (audio track) & 200ms (video track), ergo the file’s duration is 200ms. This is the value mkvmerge will add 200ms to all timestamps coming from the second file.
For the video track this means that there’s no gap in the playback as the first frame from the second file will start right after the last frame of the first file ends (both at 200ms). This isn’t true for the audio track, though: the last frame of the first file ends at 188ms, but the first frame of the second file will be played at 200ms, leaving a gap of 12ms for which there’ll be no content to play.
Players handle such gaps differently, often by shortening the duration of the video frames displayed as gaps in audio playback are much more noticeable than one sped-up video frame.
With the alternative append mode, track, mkvmerge will not use the first file’s duration for that offset. Instead it’ll ensure that each track has a continuous stream of frames. In other words, while the handling of the video track will stay the same, a delay of 188ms instead of 200ms will be used for the audio track, ensuring there’s no gap.
You might wonder why mkmverge doesn’t do this by default. The answer is that users often append content that is synchronized properly within each source file. This synchronization is lost in track append mode. Put differently: if you watch the second file, the first audio & the first video frames must be started at the same timestamp in order for their content to appear synchronized. If you use track append mode with such a file, they would not be started at the same timestamp but 12ms apart (audio earlier than video). Therefore the content of the second file would appear not to be synchronized when watching the appended file.
Such content is used way more often than content that’s the result of a splitting operation created with mkvmerge, which is what track mode is more suitable for.