[The following transcript is more for the techies of my readership. For those of a less technical inclination, feel free to wait for the next post on “Active Design Ideas” which I have separated out due to the length of this post.]
I am now going to ground this discussion into an example software architecture by considering some design problems that I have experienced in designing a multi-threaded video player pipeline. The issues I highlight would generally apply to many video player designs.
The following image is a highly simplified top-level schematic, the original was just an A4 pdf printed from a whiteboard which I find much better than trying to work out designs using a computer-based UML drawing tool. The gross motor movement of hand drawing “in the large” seems to help the thinking process.
There are 3 basic usual commands for controlling any video player that has random access along a video timeline:
- Show a frame
In this example there is a main controller thread that handles the commands and controlling the whole pipeline. I am going to conveniently ignore the hard problem of actually reading anything off a disk fast enough to keep a high resolution high frame-rate player fed with data!
The first operation for the pipeline to do is to render the display frames in a parallel manner. The results of these parallel operations, since they will likely be produced out of order, need to be made into an ordered image stream that can then be buffered ahead to cope with any operating system latencies. The buffered images are then transferred into an output video card, which has only a relatively small amount of video frame storage. This of course needs to be modeled in the software so that (a) you know when the card is full; and (b) you know when to switch the right frame to the output without producing nasty image tearing artefacts.
These are all standard elements you will get with many video player designs, but I want to highlight three design issues that I experienced in order to get an understanding of what I will later term an “Organising Principle”.
First there was slow operation resulting in non real-time playout. Second, occasionally you would get hanging playout or stuttering frames. Third, you could very occasionally get frame jitter on stopping.
Given what I said about Goethe and his concept of Delicate Empiricism, the very first thing to do was to reproduce the problem and collect data, i.e. measure the phenomenon WITHOUT jumping to conclusions. In this case it required the development of logging instrumentation software within the system – implemented in a way that did not disturb the real-time operation.
With this problem I initially found that the image processing threads were taking too long, though the processes were doing their job in time once they had their data. So it was slowing down BEFORE they could get to start their processing.
The processing relied on some fairly large processing control structures that were built from some controlling metadata. Since this build process could take some time these structures were cached with their access keyed by that metadata, which was a much smaller structure. Accessing this cache could occasionally take a long time and would give slow operation, seemingly of the image processing threads. This cache only had only one mutex in its original design and this mutex was taken both for accessing the cache key and for building the data structure item. Thus when thread A was reading the cache to get at an already built data item, it would occasionally block behind thread B which was building a new data item. The single mutex was getting locked for too long while thread B built the new item and put it into the cache.
So now I knew exactly where the problem was. Notice the difference between the original assumption of the problem being with the image processing, rather than with the cache access.
It would have been all too easy to jump to an erroneous conclusion, especially prevalent in the Journeyman phase, and go in and change what was thought to be the problem. Although such a change would not actually fix the real problem, it could have changed the behaviour and timing so that the problem may not present itself, thus looking like it was fixed. But then it might resurface 3 to 6 months later, a costly and damaging process for any business.
In this case the solution here was to have finer grained mutexes: one for the key access into the cache and a separate one for accessing the data item, which was then lazily built on first access.
Hanging Playout or Stuttering Frames
The second case was either hanging or stuttering playout. This is a great example because it illustrates a principle that we need to learn when dealing with any streamed playout system.
The measurement in this case was extremely ‘old school’, simply by printing data to a log output file, although of course only a few chars per frame, because at 60fps (a typical modern frame-rate) you only have 16ms per frame.
In this case what was happening was that the streaming at the output end of the pipeline was getting out of order. Depending upon how the implementation was done, it would either lock the whole player or get a stuttered playout. Finding the cause of this took a lot of analysis of the output logs and many changes to what was being logged.
What I found was that there was an extra ‘hidden’ thread added within the output card handling in order to thread off some of the other pre-processing that needed to happen, BUT there was no enforcement of frame streaming order. This would mean that the (relatively) small amount of memory in the output card could get fully allocated and there would be a gap in the ordering. Thus it was not possible to fill that gap in the frame order with the correct frame when it eventually came along, because there was no room in the output card to put that frame. Thus usually resulting in a playout hang.
This is why, with a streaming pipeline where you always have limited resources at some level, allocation of those resources MUST be done in streaming order. This is a dynamic principle that can take a lot of experience to learn.
The usual Journeyman approach to such a problem is just to add more memory, i.e. more resource! This will just hide the problem because processing will still be done out of order, but because you have increased the spare capacity it will not go wrong until you next modify the system to use more resource. At this point the following statement is usually made:
“But this has been working ok for years!”
One of the things I am always saying to less experienced programmers when trying to debug such problems is:
“Do not change any of the existing functionality.
Disturb the system as little as possible.
Keep the bug reproducible so you
can measure what is happening.
Then you will truly know when you have fixed the fault.”
Frame Jitter on Stop
The third case was one of frame jitter when stopping playout. The problem was that although the various buffers would get cleared, there could still be some frames “in flight” in the handover threads. This is a classic multi-threading problem and one that needs careful thought.
In this case when it came time to show the frame at the current position, an existing playout had to be stopped and the correct frame would need to be processed for output. This correct frame for the current position would make its way through to the end of the pipeline, but could get queued behind a remnant frame from the playout. This remnant frame would most likely have been ahead of the stop position because of the pre-buffering that needed to take place. Then when it came time to re-enable the output frame viewing in order to show the correct frame, both frames would get displayed, the playout remnant one first. This would then manifest the frame jitter.
One likely fix of an inexperienced programmer would be to make the system sit around waiting for seconds while the buffers were cleared and possibly cleared again, just in case! (The truly awful “sleep” fix.) This is one of those cases where, again due to lack of deep analysis, a defensive programming strategy is used to try and force a fix of what initially seems to be the problem. Again, it is quite likely that this may SEEM to fix the problem, and is likely to happen if the person is under pressure.
The final solution to this particular problem was to use the concept of unique “command ids”. Thus each command from the controlling thread, whether it was a play request or a show frame request, would get a unique id. This id was then tagged on to each frame as it was passed through the pipeline. Then by using a globally accessible “valid command id set” the various parts of the pipeline could decide if they had a valid frame that could be allowed through, or could be quietly ignored.
When stopping the play all that had to be done was to clear the buffers, remove the relevant id from the “valid id set” and this would allow any pesky remaining “in flight” frames to be ignored since they had an invalid id. This changed the stop behaviour from being an occasional, yet persistent bug, into a completely reliable operation.
In the next post I will recap the above human process of finding and fixing the problems.