Unmodified output is half the size of the input

I am just getting started ripping Blu-ray discs to backup my collection. I am trying to use mkvmerge to write the chapter data from the playlist into the MKV. When I do, the output file is about half the size of the input file.

I decided to simply “copy” the input like so.

mkvmerge -o copy.mkv original.mkv

The effect appears here too. So for now, I’m going to discard the chapter discussion, I just want to focus on the file size difference.

For reference, the input file is 1.5 GiB, the output file is 697.8 MiB.

I haven’t found any way to track down where the missing bytes have gone.

These are the routes I have pursued.

  1. The input file has four tracks. Each has one tag. “mkvmerge -i” reports no tags on the output. Could there by 800 MiB in those four tags? I doubt it, but I haven’t been able to confirm it. ffprobe only shows a “Duration” tag on the original, but that is present on the copy as well, so I don’t know what the tag is.
  2. I found the other thread here about decreased file sizes, but disabling compression yields an output file of 699.5 MiB. That’s not likely to be the issue.
  3. If I repeat the process on the copy, the output appears to be identical. It’s not throwing away half the data each time, there’s something about the input.
  4. I checked the stream statistics in VLC player and confirmed that the total bitrate is about half of the original. I can’t see a difference.
  5. I am also planning to re-encode the video to a more efficient codec for storage. When I do this, I get a bunch of “Error submitting packet to decoder: Invalid data found when processing input” and “Not a valid DCA frame” errors, but the output works perfectly. When I use my encode command on the copy, the output appears to be the same,but I don’t get those errors. Is it possible the original has 800 MiB of junk data? How could I confirm that?
  6. In theory, if the stream is unchanged, the decoded frames should be byte-for-byte identical, right? Is there a way I could check this?

I’m kind of out of my depth here, I don’t know what to do next. My intuition says everything is fine. There’s no way that mkvmerge is re-encoding anything and visually, it looks identical. But I am deeply curious here, and there is a small nagging doubt that something is amiss.

Can anyone point me in the right direction?

Welcome!

Without access to the original file it’s pretty hard to accurately say why this is happening. I can offer you my best guess, but it’s still a guess.

Blu-rays and other formats based on physical rotating discs (primarily DVDs) have to deal with the fact that readers for those discs are slow to spin up the disc when they start to read & to spin down when it stops. This is not only slow (often in the range of more than one second for each operation), the change in rotation speed is also a very noticeable change in noise level and noise quality. Us humans notice these noise changes very easily and are usually disturbed by them, unlike a constant noise by discs continuously rotating at constant speed.

For that reason players have to avoid spinning up/down all the time & try to maintain constant reading speed. The easiest way to achieve that is by having enough “content” (as in: file size) in the file so that the operating system has to keep it spinning just to read the amount of data requested by the software reading the stream. Unfortunately for them often enough the encoded movie material does not actually require that much space. Therefore the specs writer of the Blu-rays specs created a way to keep garbage data in the bitstream just to fill up a minimum size per duration requirement. This data is called “filler data”, or more precisely, “filler NALUs” with a NALU being the smallest unit of MPEG TS data a player operates on.

In Matroska these filler NALUs are usually not kept as they contain no data required for the actual decoding. They usually only contain zeros. mkvmerge always throws away filler NALUs. For certain movies the amount of filler NALUs is very, very high, and 50% is not unheard of.

What makes your case somewhat strange is that usually you don’t see filler NALUs in Matroska files as usually they aren’t copied from the MPEG TS files into Matroska files, and your source file is a Matroska file. However, it’s possible that your source Matroska file was created by a different program, one that doesn’t throw out filler NALUs.

The reported bitrate is nothing more than the track size or file size divided by the track duration/file duration. It doesn’t distinguish between content required for decoding & content not required for decoding (filler NALUs), as to do that it would have to read the whole file & analyze for filler NALUs before displaying the statistics — something no player ever does.

If you’re indeed dealing with filler NALUs, then it’s no wonder the player reports half the bitrate after throwing them out.

Without access to the source file I cannot even guess what kind of “tags” you’re talking about, to be honest.

If you want me to take a closer look at the source file & provide better answers, upload the file to my file server, please.

Thank you so much for your generous help! I uploaded my file “playlist4.mkv” under the “1526” directory.

I did not know that Blu-ray discs were encoded that way, but that does make a lot of sense.

I am new to this process, it’s very possible that I am ripping the files incorrectly. I am using ffmpeg like this.

$ ffmpeg -playlist {playlist_number} -i bluray:/dev/sr0 -map 0 -c copy -f matroska playlist{playlist_number}.mkv

ffmpeg doesn’t understand Blu-ray chapter markings, which is what led me to MKVToolNix in the first place.

Should I be using MKVToolNix to extract the playlist and skip ffmpeg altogether? That certainly would be easier. Right now, my plan is to rip with ffmpeg and then merge in the chapter information. If I can do it in one step, that seems better.

I’m interested in understanding this more fully. Could you point me to some resources on how to detect filler NALUs?


I think the tags are unlikely to be the cause, but for completeness sake, this is what I mean.

$ mkvmerge -i  playlist4.mkv
File 'playlist4.mkv': container: Matroska
Track ID 0: video (AVC/H.264/MPEG-4p10)
Track ID 1: audio (DTS-HD Master Audio)
Track ID 2: audio (DTS-HD Master Audio)
Track ID 3: subtitles (HDMV PGS)
Global tags: 1 entry
Tags for track ID 0: 1 entry
Tags for track ID 1: 1 entry
Tags for track ID 2: 1 entry
Tags for track ID 3: 1 entry
$ mkvmerge -i  copy.mkv
File 'copy.mkv': container: Matroska
Track ID 0: video (AVC/H.264/MPEG-4p10)
Track ID 1: audio (DTS-HD Master Audio)
Track ID 2: audio (DTS-HD Master Audio)
Track ID 3: subtitles (HDMV PGS)
Global tags: 1 entry

Thank you again for your time and effort, I really appreciate it!

Great. I’ll be away for a couple of days. Therefore I won’t be able to look further into the file before the weekend. I’ll postpone a more in-depth reply until then. Just to let you know.

No worries, thanks for sharing!

One thing that could point you in the right direction would be to look at each in MediaInfo; it may show what’s missing. One small point is mkvmerge will compress PGS subtitles by default.

Here’s a mediainfo output from the original and the copy.

$ mediainfo playlist4.mkv
General
Unique ID                                : 272360721755454291234801274342783604739 (0xCCE6C4582BA569614302F3C208885803)
Complete name                            : playlist4.mkv
Format                                   : Matroska
Format version                           : Version 4
File size                                : 1.53 GiB
Duration                                 : 7 min 21 s
Overall bit rate mode                    : Variable
Overall bit rate                         : 29.7 Mb/s
Frame rate                               : 23.976 FPS
Writing application                      : Lavf62.3.100
Writing library                          : Lavf62.3.100
ErrorDetectionType                       : Per level 1

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L4.1
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 4 frames
Format settings, GOP                     : M=3, N=12
Format settings, Slice count             : 4 slices per frame
Codec ID                                 : V_MPEG4/ISO/AVC
Duration                                 : 7 min 21 s
Bit rate mode                            : Constant
Nominal bit rate                         : 23.9 Mb/s
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
Frame rate mode                          : Constant
Frame rate                               : 23.976 (24000/1001) FPS
Standard                                 : NTSC
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.481
Time code of first frame                 : 10:00:00:12
Default                                  : No
Forced                                   : No
Color range                              : Limited
Color primaries                          : BT.709
Transfer characteristics                 : BT.709
Matrix coefficients                      : BT.709

Audio #1
ID                                       : 2
Format                                   : DTS XLL
Format/Info                              : Digital Theater Systems
Commercial name                          : DTS-HD Master Audio
Codec ID                                 : A_DTS
Duration                                 : 7 min 21 s
Bit rate mode                            : Variable
Channel(s)                               : 6 channels
Channel layout                           : C L R Ls Rs LFE
Sampling rate                            : 48.0 kHz
Frame rate                               : 93.750 FPS (512 SPF)
Bit depth                                : 24 bits
Compression mode                         : Lossless
Default                                  : Yes
Forced                                   : No

Audio #2
ID                                       : 3
Format                                   : DTS XLL
Format/Info                              : Digital Theater Systems
Commercial name                          : DTS-HD Master Audio
Codec ID                                 : A_DTS
Duration                                 : 7 min 21 s
Bit rate mode                            : Variable
Channel(s)                               : 2 channels
Channel layout                           : L R
Sampling rate                            : 48.0 kHz
Frame rate                               : 93.750 FPS (512 SPF)
Bit depth                                : 24 bits
Compression mode                         : Lossless
Default                                  : No
Forced                                   : No

Text
ID                                       : 4
Format                                   : PGS
Codec ID                                 : S_HDMV/PGS
Codec ID/Info                            : Picture based subtitle format used on BDs/HD-DVDs
Duration                                 : 7 min 6 s
Default                                  : No
Forced                                   : No


$ mediainfo copy.mkv
General
Unique ID                                : 174756179175502441734281746257327198078 (0x8378D0DB285AC4C2F9D7FC7FF1208F7E)
Complete name                            : copy.mkv
Format                                   : Matroska
Format version                           : Version 4
File size                                : 698 MiB
Duration                                 : 7 min 21 s
Overall bit rate mode                    : Variable
Overall bit rate                         : 13.3 Mb/s
Frame rate                               : 23.976 FPS
Encoded date                             : 2026-01-27 05:21:12 UTC
Writing application                      : mkvmerge 96.0 ('It's My Life') 64-bit
Writing library                          : libebml v1.4.5 + libmatroska v1.7.1 / Lavf62.3.100

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L4.1
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 4 frames
Format settings, GOP                     : M=3, N=12
Format settings, Slice count             : 4 slices per frame
Codec ID                                 : V_MPEG4/ISO/AVC
Duration                                 : 7 min 21 s
Bit rate mode                            : Constant
Bit rate                                 : 7 417 kb/s
Nominal bit rate                         : 23.9 Mb/s
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
Frame rate mode                          : Constant
Frame rate                               : 23.976 (24000/1001) FPS
Standard                                 : NTSC
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.149
Time code of first frame                 : 10:00:00:12
Stream size                              : 390 MiB (56%)
Default                                  : No
Forced                                   : No
Color range                              : Limited
Color primaries                          : BT.709
Transfer characteristics                 : BT.709
Matrix coefficients                      : BT.709

Audio #1
ID                                       : 2
Format                                   : DTS XLL
Format/Info                              : Digital Theater Systems
Commercial name                          : DTS-HD Master Audio
Codec ID                                 : A_DTS
Duration                                 : 7 min 20 s
Bit rate mode                            : Variable
Bit rate                                 : 3 730 kb/s
Channel(s)                               : 6 channels
Channel layout                           : C L R Ls Rs LFE
Sampling rate                            : 48.0 kHz
Frame rate                               : 93.750 FPS (512 SPF)
Bit depth                                : 24 bits
Compression mode                         : Lossless
Stream size                              : 196 MiB (28%)
Default                                  : Yes
Forced                                   : No

Audio #2
ID                                       : 3
Format                                   : DTS XLL
Format/Info                              : Digital Theater Systems
Commercial name                          : DTS-HD Master Audio
Codec ID                                 : A_DTS
Duration                                 : 7 min 20 s
Bit rate mode                            : Variable
Bit rate                                 : 2 056 kb/s
Channel(s)                               : 2 channels
Channel layout                           : L R
Sampling rate                            : 48.0 kHz
Frame rate                               : 93.750 FPS (512 SPF)
Bit depth                                : 24 bits
Compression mode                         : Lossless
Stream size                              : 108 MiB (15%)
Default                                  : No
Forced                                   : No

Text
ID                                       : 4
Format                                   : PGS
Muxing mode                              : zlib
Codec ID                                 : S_HDMV/PGS
Codec ID/Info                            : Picture based subtitle format used on BDs/HD-DVDs
Duration                                 : 7 min 4 s
Bit rate                                 : 92.2 kb/s
Frame rate                               : 0.792 FPS
Count of elements                        : 336
Stream size                              : 4.67 MiB (1%)
Default                                  : No
Forced                                   : No

I didn’t check every field, but they look pretty much the same except for the overall bit rate, but as discussed, that counts any junk data, so it doesn’t mean much. It is interesting that the “Nominal bit rate” is identical, I’m not sure how that’s calculated. It’s also interesting that the original does not report the size of each stream. There may be an answer in here, but it’s not immediately obvious. I will do some more research tonight.

I made reference to subtitle compression in my original post. I didn’t explicitly mention subtitles, though, my apologies. If I use --compression 4:none, the size increases by a few MiB, nothing that gets me close to the original size. I don’t think that’s the issue.

Sorry for the late reply; I’ve returned from my trip with quite the cold. Still recovering.

Anyway, a quick look at your file confirmed that it’s full of filler NALUs:

[0 mosu@sweet-chili 4.507s /ftp/pub/rip/mkv/bugs/discourse/1526 Unmodified output is half the size of the input] xyzvc_dump p.h264 | rg filler | wc -l
9357
[0 mosu@sweet-chili 14.179s /ftp/pub/rip/mkv/bugs/discourse/1526 Unmodified output is half the size of the input] xyzvc_dump p.h264 | rg filler | awk 'BEGIN { size = 0 } /filler/ { size += $6 } END { print size / 1024.0 / 1024.0 / 1024.0 }'
0.8448

You don’t have to understand the actual commands in detail. What they do is:

  1. The first one counts the number of filler NALUs present in the video bitstream
  2. The second sums up their total size & outputs the size in GB

So yeah. Nearly 850 MB of your 1.6 GB file is just junk/filler data.

All in all there’s nothing to worry about here; it’s all working as intended.

In case you’re interested what I did:

  1. Identified the file content (track types etc.) with mkvmerge -J playlist4.mkv
  2. Extracted the video track into an h.264 elementary stream with mkvextract playlist4.mkv tracks 0:p.h264
  3. Run a helper tool for parsing H.264/H.265 streams & dumping their NALU types called xyzvc_dump p.h264; it’s part of MKVToolNix, but I’m not sure if it’s packaged. If you’re using any of my Linux packages then it is part of them; on Windows it’s in the tools sub-directory of the installation directory; not sure about the macOS DMG at the moment
  4. xyzvc_dump outputs a single line per NALU found with an easy-to-understand format

That’s great to hear! Thank you so much for digging into this for me. I also appreciate the explanation of the process.

I do hope you feel better soon! Whenever you feel well, absolutely no rush at all, I have a couple of followup questions.

  1. I’m using Arch Linux, I installed the mkvtoolnix-cli package from the Extra repository. It doesn’t look like it includes xyzvc_dump. That looks like a super useful tool, I’d love to get more familiar with it. It doesn’t look like you directly maintain that. I may submit a request to the Arch maintainers to include this utility, or add a package like mkvtoolnix-debug to house it.

Am I missing something obvious?

  1. Is there a different way I should rip the files to avoid the junk? Does it matter?

Since it’s junk data, I expect that when I re-encode the video, it will be lost, so it probably doesn’t matter. I am curious if there is a better solution than what I currently use. As I mentioned in the OP, I stumbled into this problem while trying to mux in chapter markers, so if a different solution fixes both, that would be pretty cool.

My current understanding is that playlists should be ripped, not tracks. Playlists are what the players actually play to viewers, which can be comprised of many tracks, especially for movies. I’m using ffmpeg like so.

$ ffmpeg -playlist {playlist_number} -i bluray:/dev/sr0 -map 0 -c copy -f matroska playlist{playlist_number}.mkv

I’ve reviewed the mkvmerge docs and I don’t see any similar capability.

Is this the best way, or is there a different way I’m missing?

Thanks again!

Thanks! I already do, at least partially, otherwise I wouldn’t have been bored enough to look into it yet.

No, you’re correct. I do not maintain the Arch Linux packages (even though Arch Linux is my main development platform — I just don’t have to do that as their maintainers are pretty quick with their updates, and as it’s a rolling-release distro so users don’t have to wait six months or more for new versions to come out).

I honestly cannot tell you the best way to rip stuff. For Blu-rays I usually use MakeMKV due to its ability to handle encrypted discs. I do not know if MakeMKV removes filler NALUs from the top of my head.

I definitely do not know if ffmpeg does — I never use it for ripping Blu-rays. Judging from your sample file, I’d say it doesn’t, at least not out of the box.

Does it matter? Only insofar as the resulting file will be larger or smaller. It doesn’t have any effect on players as all players that support the various H.26? codecs must support filler NALUs (by skipping them).

That’s correct. For MKVToolNix as well, not just for others. Therefore you should only use the MPLS as inputs in Matroska (from unencrypted Blu-rays only, though; you can create unencrypted variants with MakeMKV before processing them with MKVToolNix). MKVToolNix will always remove filler NALUs.

When in doubt run MKVToolNix GUI, add either the index.bdmv from the root directory or any of the .mpls from the MPLS sub-directory and it’ll offer to scan over all existing playlists & offer you a handy dialog listing all of the relevant ones with the files they reference, their respective length, amount & types of tracks etc. You can then take a look at the command-line MKVToolNix calls mkvmerge with via the “Multiplexer” menu → “Show command line”.