Streamie, RTSP Fragmentation Unit Corruption and H.264 Dropped Frames

 

Overview

 

This article is a result of a customer support incident where Streamie was being used with Ubiquiti UniFi Protect. There is also an overlapping 2nd issue related to Reolink Duo WiFi cameras.

 

History

 

A quick bit of history regarding Streamie. I wrote a while back about Streamie's so called secret sauce, which explains the transition from Live555 to FFmpeg and finally to Krill (our home-grown RTSP streamer). With each change, the motivation was compatibility above standards compliance. There are many cheap (and even expensive) cameras out there that have their own interpretations of various standards. Ultimately, we just want video to appear on the screen. Telling the customer, "your camera is broken" won't drive demand for our product.

 

Fragmentation Units

 

When transmitted, video frames are broken up into what are called fragmentation units. The standard abbreviation, FU, is one I particularly enjoy. Each FU has two flags associated with it: start and end which I'll abbreviate "S" and "E". An FU with neither flag set falls in the middle. A simple example:

 

Frame 1: S(1), E(0) <== start

Frame 1: S(0), E(0) <== middle

Frame 1: S(0), E(0) <== middle

Frame 1: S(0), E(1) <== end

 

One otherwise peaceful evening, I had just finished mowing an acre of grass (which is mostly green weeds), when I saw that I had a text from a Streamie customer.

 

”Do you have time for a phone call?”

 

Yes, our business customers can call or text to get engineering-level support.

 

During the phone call I learned that the latest beta of Streamie was broken. That's not overly surprising, but maybe a little surprising since that build had been performing well in testing. As an aside, we try to stress Streamie far more than a typical customer ever would, so that we can find the scalability problems first. Anyway, the customer went on to explain that a months-old version of Streamie that had been performing flawlessly was identically broken. That was actually good news, because I was confident that whatever was causing the problem wasn't directly a Streamie bug. I'm good at breaking stuff, but I can't retroactively break a working build. Maybe one day.... But even if the problem isn't a Streamie problem, I can't just tell the customer "it's not my problem." Something is broken. If we can't figure out how to fix it, maybe we can improve Streamie to handle it better.

 

Streamie has a built-in debugging feature that customers can enable. Once enabled, extensive logs are recorded and then securely transmitted after the problem is reproduced by the customer. We are able to access those logs and determine next steps.

 

The specific issue was frequent -12909 decoding errors produced by Apple's VideoToolbox framework related to the H.264 video streams from the UniFi Protect cameras. With logs in hand, I started scrolling through them in hopes of finding a smoking gun. There are some very granular (per-frame) debugging features that are disabled for production builds, and in these logs there was no indication of what was causing the decoder to freak out.

 

Meanwhile, the customer discovered a way to reliably reproduce the issue. 1) Start streaming a bunch of cameras on one device. Everything looks fine. 2) On a 2nd device, start streaming a bunch of cameras. Almost immediately, streams on both devices would fail. I enabled some more debugging features (there's diminishing returns: too much debugging info obscures the problem), produced another build, waited impatiently for TestFlight to process it and had the customer repeat the testing procedure. We went through this process more than a few times as I narrowed the scope of the issue.

 

Eventually though, I discovered the issue. With per-fragmentation-unit debugging enabled, I saw this:

 

Frame 1: S(1), E(0) <== start

Frame 1: S(0), E(0) <== middle

Frame 1: S(0), E(0) <== middle

Frame 2: S(1), E(0) <== unexpected start of frame 2

Frame 2: S(0), E(0) <== middle

Frame 2: S(0), E(0) <== middle

Frame 2: S(0), E(1) <== end

...

 

Krill had naively (although until now, safely) assumed fragmentation units would not be corrupt. I briefly looked into fixing the data, but quickly determined that Frame 1 was truncated. I also couldn't reproduce the issue on my Protect system, so each iteration of the investigation required a lot of assistance from the customer. So, without being able to fix the problem, I needed Streamie to toughen up and deal with it. To accomplish this, I changed Krill to check for broken FUs and signal that all subsequent frames up to the next key frame, should be dropped. This will cause a brief image "freeze" but within a second or two, everything will continue playing, without upsetting the video decoder and without alarming the user.

 

Code 001

 

H.264 SPS & Dropped Frames

 

I've been on the hunt for a great wifi camera for ages. It appears to be one of those things humanity just can't get right. Yet. Plenty of companies make nearly perfect PoE cameras, but apparently once you mix in wifi, the B-Team runs the show.

 

As my hunt continued, I discovered (and wrote about) the Reolink Duo WiFi. It has a unique aesthetic and decent specs: RTSP, ONVIF, people/vehicle detection, 2.4GHz and 5GHz wifi and a 150 degree field of view (due to the dual lenses). Honestly, the only spec it is lacking is H.265, which is a pretty big shortcoming, but whatever. Could I finally have a wifi camera to recommend to customers?

 

I had previously discovered an annoying Reolink firmware bug related to the ONVIF unique identifiers their cameras used. I complained about it on Twitter and a few days later I was testing out beta firmware with all of my suggestions in place and working. Truly incredible. With that in mind, I documented everything I'm about to discuss and contacted them with the details. These problems are more difficult to investigate, so who knows what the result will be.

 

The setup process with the Reolink app is about easier than most of their competitors. Using Streamie to discover the ONVIF service and connect to the camera and choose the left / right streams was similarly painless. And that's exactly where the fun ended. While the Reolink app itself (using its proprietary Baichuan protocol) was able to flawlessly stream from the camera, Streamie was not. In addition to the stream not being smooth, there were frequent -12909 decoder errors.

 

Once again, I gradually started enabling various granular debugging points, but I couldn't find anything suspicious. I wanted to look deeper into the H.264 stream itself. I wanted to use [h264bitstream](https://github.com/aizvorski/h264bitstream) to analyze the data, so I needed to write out a .264 file with the stream data. After a few tries, I got it working:

 

Code 002

 

What I saw right away was dropped frames.

 

!! Found NAL at offset 10105696 (0x9A3360), size 19522 (0x4C42)

1.1: sh->frame_num: 24

!! Found NAL at offset 10125222 (0x9A7FA6), size 19728 (0x4D10)

1.1: sh->frame_num: 25

No 26?

!! Found NAL at offset 10144954 (0x9ACCBA), size 19420 (0x4BDC)

1.1: sh->frame_num: 27

!! Found NAL at offset 10164378 (0x9B189A), size 19539 (0x4C53)

1.1: sh->frame_num: 28`

 

Not knowing enough about H.264 encoding, I assumed dropped frames were at fault for the decoder issues. If I could parse each frame number, then maybe I could detect dropped frames? And maybe that would improve stability? It turns out you can't simply parse the frame number. Instead, you first have to find out how many bits are in the frame number, as specified by the Sequence Parameter Set (SPS). Specifically, we want log2_max_frame_num_minus4:

 

Code 003

 

And with that, you can parse a dependent frame and find out its position in the Group of Pictures:

 

Code 004

 

After all that work, it turns out that dropped frames are perfectly fine. Well, at least I got to learn some interesting things. Ultimately, it turned out that I needed more robust fragmentation unit corruption handling. In the back of my mind, despite years of reliable streaming, I can't shake the thought that maybe my IO code is to blame?

 

 

Related Topics

 

Blog