Feature request / Request for comment: Additional append mode for mkvmerge

Dear all …

Prologue

According to the mkvmerge documentation, there are currently two possible append modes: file and track. I would like to ask whether it would be possible to create an additional mode that works like described below.

Video production and file formats are not my field, so please don’t hesitate to let me know if my idea is bad. Since this post might become a bit lengthy, I’ll explain my motivation separately if somebody is interested; in every case, I believe that I have a good reason to ask for it.

Definitions

Before explaining how it actually should work, we need some definitions, similar to the explanation for --append-mode in the mkvmerge documentation:

We have two input files (e.g. M2TS), called infile1 and infile2, and we want to append infile2 to infile1. Each input file contains the same number and types of tracks. Let’s say that this is one video track whose framerate is 24000/(1001s) and one audio track whose framerate is 31.25/s. The video and audio track in infile1 are called in1_v and in1_a, respectively, and the video and audio track in file2 are called in2_v and in2_a.

The timestamps of the last frames in in1_v and in1_a are called ts_in1_vlast and ts_in1_alast, respectively. The timestamps of the first frames in in2_v and in2_a are called ts_in2_vfirst and ts_in2_afirst.

The duration of one video frame is called v_dur, and the duration of one audio frame is called a_dur.

As everybody knows, the problem with joining multimedia files is that audio and video in the first file almost never end at the same time. For this example, let’s assume that in1_a ends 20ms later than in1_v. That is, if we calculate (ts_in1_vlast + v_dur) and (ts_in1_alast + a_dur), the latter result is 20ms greater than the former.

The MKV output file is called outfile.

Request

We would like to ask for a new, additional append mode, possibly called videoref, that works like so:

  1. Completely convert infile1 to MKV, filling up outfile, copying the timestamps for each frame from infile1 without changing their value; no further specialties here.
  2. Compute (ts_in1_vlast + v_dur).
  3. Append a new cluster to outfile and assign it the timestamp that has been computed in step 2.
  4. Get the first video frame from in2_v, put it into the new cluster created in step 3, and assign it the timestamp that has been computed in step 2. That video frame and the new cluster where it is in have the same timestamp afterwards.
  5. Compute (ts_in2_afirst - ts_in2_vfirst) (this usually gives 0).
  6. Get the first audio frame from in2_a, put it into the new cluster created in step 3, and assign it that cluster’s timestamp plus the offset computed in step 5. In cases where the latter is 0, the new cluster, the first video frame in it and the first audio frame in it have the same timestamp.
  7. From then on, continue to copy the video and audio frames from infile2 into outfile, computing the new timestamps in outfile as follows:
    a) Compute the time offset of each frame from infile2 relative to ts_in2_vfirst.
    b) The timestamp of that frame in outfile then is the the timestamp computed in step 2. plus the offset computed in step 7a).

Of course, it works the same when joining further input files or with more audio tracks.

The basic idea behind this is the following:

When we have multiple M2TS files that we want concatenate to a MKV file, we have perfect audio-video-synchronization in each single input file. The only way to keep that perfect synchronization is to re-synchronize the audio tracks with the video track in the way shown above each time a new M2TS input file begins.

With the method shown above, we can join an arbitrary number of input files (hundreds or thousands) without the risk of even the slightest de-synchronization. There is no other approach at the demuxer / muxer level that can also guarantee this; at least, I never have seen one.

The proposed name videoref for the new append mode stems from the fact that the video framerate is taken as the basis when the muxer timestamps are re-synchronized during the transition from the end of an input file to the begin of the next input file.

Epilogue

I believe that it wouldn’t be too difficult to implement this request. Nearly everything must be there already: To implement the append modes that already exist, there must be code that reads timestamps from frames in the input files and that processes such timestamps. Its just another algorithm to compute the new timestamps that needs to be implemented.

Maybe too naive … what do you think about it?

Thank you very much in advance, and best regards!

You obviously don’t know the source code, otherwise you wouldn’t make such a claim. On the other hand, if it isn’t that difficult to implement, why not provide a patch that does so?

Yeah I’m a bit salty when read such statements. It isn’t that easy. And I’m not interested.

Thank you very much for the fast reply!

OK, thank you very much. That’s a clear answer.

And you are right, I don’t know the source code. Perhaps I’ll try to understand where the concatenation magic happens as a first step. But I fear that I wouldn’t be able to begin with a patch without a bit of help. Currently, I’m even probably not able to spot the right place :slight_smile:

In every case, I didn’t want to say that the development of MKVToolnix was or is easy. If my statements made this impression, I apologize. I just thought that at the places where the current concatenation techniques are implemented every required information (notably every required timestamp) eventually is already accessible in the form of variables.

Best regards, and thanks again!

Apology accepted, and I apologize in turn for being testy. You just happened to trigger me, 'cause the phrase “this should be easy to implement” or some variant of it is something I’ve heard over and over again. It just gets tiresome to have to explain why things are rather difficult to implement, 'cause oftentimes the difficulty lies in the intricacies of a certain topic, and explaining those for audiences who don’t have the experience & insight into that topic almost becomes as difficult as implementing it in the first place.