Using GCC 2.6.3 under BSDI/OS 2.0 on an AMD 486DX4-100 CPU, the Fano decoder runs up to 375 kb/s while the Viterbi decoder runs at a constant speed of 75.1 kb/s. On a 90 MHz Intel Pentium, the speeds are 594kb/s (Fano) and 118 kb/s (Viterbi). This is fast enough to be useful in many amateur packet radio applications.
The code is available over the Internet. See a this web page for details.
Each set of tapped data is summed modulo-2 (XORed) and becomes one of the encoded output symbols. For example, in the K=7 rate 1/2 code I use in my Viterbi decoder, user data is clocked into the 7-stage shift register one bit at a time. The first encoded symbol is then formed by modulo-2 summing the outputs of shift register stages 1, 3, 4, 6 and 7, while the second symbol is formed by modulo-2 summing stages 1, 2, 3, 4 and 7. Then the next data bit is clocked in and the cycle repeats.
Convolutional encoders are extremely simple; that's one reason they're so popular on deep space probes. Convolutional decoding takes more work. Basically, all convolutional decoders work by generating local hypotheses about the originally transmitted data, running those hypotheses through a local copy of the encoder, and comparing the encoded results with the (noisy) encoded data that is actually received. A decoder can pursue one hypothesis at a time as in the sequential decoder, or it can pursue many in parallel as in the Viterbi decoder. Both decoders compute metrics as estimates of the "closeness of fit" between the received and locally regenerated symbols, and they provide cumulative path metrics with the decoded packet as a useful estimate of the decoder's "confidence" in the result.
Unlike block codes, convolutional decoders can easily use soft decision information from the demodulator about the quality of each received symbol. They easily handle variable packet sizes such as those found in computer networks, and they're fairly easy implement in software (though they don't necessarily run fast!)
This is not to say that convolutional codes are always superior to block codes. Block codes have some complementary advantages, such as the ability to handle larger data blocks and a greater ability to handle burst errors. The two are sometimes concatenated to combine their strengths, as in the Voyager and Galileo spacecraft.
But one has to start somewhere, so I implemented some convolutional decoders to see if they could run fast enough on personal computers to be useful in amateur packet radio. I believe I have been successful.
I use BSDI/OS, a commercial port of Berkeley (BSD) UNIX to the 32-bit Intel processors (386/486/Pentium) and a close cousin of FreeBSD and NetBSD. BSDI supports a pure protected-mode 32-bit environment that is much more efficient for FEC than the 16-bit "real mode" provided for DOS backward compatibility.
The performance gains from a 32-bit environment are simply too great to pass up. Given that the 386 is already virtually obsolete and the 486 is now on its way out, this doesn't seem like an unreasonable assumption. My code does not assume BSDI; it should run well in other 32-bit environments, including 32-bit DOS extenders. Since it is in portable C, it should also execute in 16- and 8-bit environments, but I expect it would be rather slow.
I chose a code with a constraint length K of 32. Sequential decoders can easily handle such large constraint lengths, and 32 was a natural choice given the CPU word size. I originally chose the NASA Standard polynomials used by the Pioneer 10 and 11 probes. I then found two other polynomials in the literature with slightly better performance, so I added these as compile-time options.
Sequential decoders have been heavily analyzed, and their behavior is well understood. As might be expected, decoding time is a random variable that depends strongly on the quality of the received signal. When the Eb/N0 is several dB more than the decoder threshold (2.5 dB for this code), decoding quickly proceeds at a nearly constant rate. (See fig 1). Near the threshold, the mean decoding time and its variance increase. (See fig 2.) Below the threshold the variance becomes infinite and many packets "time out" the decoder. (See fig 3).
An important figure of merit for any Fano decoder is its "computational rate", the speed at which it can examine a branch in the code tree and make a decision to move forward, backward or sideways in the tree. On a "clean" packet the decoder rapidly moves forward, but a noisy packet causes many backward and sideways moves that slow down the process.
The main part of the Fano decoder is a loop that executes on every forward motion. Imbedded in this loop is a smaller loop that does backward motion when needed. Optimizing this code was fairly straightforward. It consisted mainly of "factoring out" operations from the loop whenever possible so they're executed only once per packet no matter how many moves the decoder makes. For example, all possible metrics for each branch (four for this code) are computed before going into the loop, even though only two are actually examined on any particular forward decoder motion. Although this represents a little extra work on a clean packet, this helps substantially on a noisy packet by avoiding the repeated recomputation of branch metrics.
Another optimization is to stash quite a bit of information into the data structure that represents a node in the decoding tree. Each node element includes the decoder state and cumulative metric along with the list of all possible branch metrics just mentioned. This lets the decoder move rapidly backward by simply decrementing a pointer, without having to recompute anything. This improves performance on noisy packets at the cost of extra memory and a slight reduction of decoding speed on clean packets.
The decoder can't actually do this, of course, since the number of possible data sequences doubles with each additional data bit. But the Viterbi decoder recognizes at each point that certain sequences cannot possibly belong to the maximum likelihood path, and it discards them. This happens because the encoder memory is limited to K bits; a Viterbi decoder in steady-state operation keeps only 2^(K-1) paths.
The complexity of a Viterbi decoder therefore increases exponentially with the constraint length K. Larger K's do perform better against noise, but a Viterbi decoder for a K=32 code is just not practical. For my Viterbi decoder I chose the rate 1/2 K=7 NASA standard convolutional code used on Voyager and in many commercial satellite systems.
This shorter code has less coding gain than the K=32 sequentially decoded code, but Viterbi decoding has the practical advantage of executing at a constant speed regardless of the received signal-to-noise ratio. This makes it attractive for real time, delay-limited applications that can tolerate some uncorrected errors, e.g., digital speech. Viterbi decoding is also more tolerant than sequential decoding to metric table and receiver AGC errors. And the inherent parallelism of the Viterbi decoder makes it easy to implement in hardware, an important consideration when contemplating very high speeds. It is not yet clear, though, that these advantages outweigh the slower average execution speed when implemented in software in a packet radio system. Experimentation with both is warranted.
The structure of the Viterbi decoder is very different from the Fano decoder. The Fano decoder consists of several pages of fairly "random" C code. But the Viterbi decoder consists almost entirely of a simple Add/Compare/Select (ACS) operation that is executed repeatedly for each decoded data bit -- 64 times per bit for a K=7 code.
As the name implies, an ACS operation adds a branch metric to each of two cumulative path metrics that converge into a node, compares them and selects the "surviving" (best) path. This implements the "discard paths that can't possibly be right" feature at the heart of Viterbi decoding.
By analogy, consider a highway routing problem. If you determine that the route from Baltimore to New York via Philadelphia is shorter (better) than the route that goes through Pittsburgh, then you don't really have to consider the Pittsburgh route at all when extending your trip beyond New York to Boston. The same thing happens in distance-vector routing algorithms such as RIP and NET/ROM: each node only propagates its best routes to its neighbors.
An important design parameter for a Viterbi decoder is the path memory length. An ideal decoder would keep every possible path in memory and delay a final decision about the first bits of the packet until the very end. But this isn't really necessary. Analysis and simulation have shown that several constraint lengths back, the paths usually merge into a single maximum-likelihood candidate. So a Viterbi decoder that retains 4-5 constraint lengths of decoded data will achieve nearly the same uncorrected error rate as an ideal decoder.
For a K=7 code, a 32-bit path memory is a nice match to a 32-bit word size. The ACS operation can record the surviving path at each node by simply copying the single machine word containing it. This is fast and efficient since 32-bit operations on a 32-bit machine are no more costly than smaller operations.
ACS operations are often paired into "butterflies", so called because of their appearance on a trellis diagram. There are 32 butterflies per data bit for a K=7 code. As you might guess, the butterfly macro was the focus of almost all of our optimization efforts.
Optimization focused on two goals: minimizing the number of CPU cycles per butterfly and maximizing the L1 (on-chip) cache hit ratio. This latter goal is especially important on the 486DX2 and DX4 since external bus cycles are proportionately even more expensive in a clock-multiplied CPU.
Originally I executed the butterflies in a tight loop, 32 loops per data bit, giving the expected poor performance. The most dramatic speedup came from unrolling this loop, but not for the reason you might suspect. The usual reason to unroll a loop is to amortize the loop control overhead over more instructions. This contributed a little, but the big win came from something else.
The ACS operations work on an array of node elements. There are two such arrays, one for the decoder's current state and another for the new state after the current symbols have been processed. After each received symbol pair is processed, the two arrays are swapped. (Actually, pointers to these arrays are swapped). Every 8 bits, the path with the best metric is chosen and the high order 8 bits are extracted and placed in the output buffer.
Each butterfly works on a fixed set of old and new state array elements and a fixed pair of hypothesized branch symbols. The original code computed the addresses of each array element and the branch symbols as functions of the loop index, which meant doing it repeatedly at execution time. But when the loop was unrolled, constants replaced the loop index in these expressions. This allowed them to be completely evaluated at compile time and replaced with immediate constants in the generated code, giving the dramatic speedup.
Around this time I upgraded my CPU from the clock-doubled 486DX2-66 to a clock-tripled 486DX4-100. (The latter really should have been named the 486DX3-99...) On many CPU-intensive operations this new chip gives nearly the 50% improvement you'd expect from the ratio of internal clock speeds. This was true for my DES code and for the Fano decoder, but the Viterbi decoder only sped up by about 15%.
A little analysis revealed why. The 486's on-chip cache operates in write-through mode. For programs that only occasionally write to memory, this isn't a problem because the CPU also has a 4-deep memory write queue. As long as there's room, the CPU can schedule a write and continue execution. But if the queue is full, the CPU blocks until space opens up.
Each Viterbi ACS reads five memory words: two node metrics, two branch metrics and the surviving path. It also writes two words to the new node: the survivor's path and metric. This is an unusually high ratio of memory writes to memory reads. The reads generally hit in the cache, but the writes go through to main memory. If there's a bottleneck in writing to main memory, this would explain why the DX4-100 isn't much faster than the DX2-66: the external memory bus is no faster for the faster CPU.
Unfortunately there is no easy way to disable write-through caching on the 486. Fortunately, it's easy to do on the Pentium; most systems apparently support write-back caching. This is a big win in Viterbi decoding because the data written to cache in one cycle is immediately read in the next and overwritten in the one beyond that. There's no reason to go to main memory at all.
Not having a Pentium, I stopped further optimization efforts. Franklin Antonio N6NKF, who does have a Pentium, picked up the code and tried a few more tricks. He examined the assembler code generated by his compiler (Watcom) and discovered that by reordering the butterflies to place together those operating on the same pair of branch symbols, the compiler did a better job of allocating registers and keeping common subexpression results involving the branch metrics. This produced a noticeable speedup that to my surprise carried over somewhat to the 486.
These results assume nonfading additive white gaussian noise (AWGN) channels and ideal BPSK or QPSK modems. I used the classic "rejection method" (see Numerical Recipes in C) to produce normally distributed (gaussian) noise samples from a conventional uniform random number generator.
Keep in mind that without coding, ideal BPSK and QPSK both require an Eb/N0 of about 9.6dB for a bit error rate of 1e-5.
The value N represents the number of forward decoder motions per data bit required to decode a particular packet; for each run, both an average and a cumulative histogram are given. For example, in the Eb/N0 = 3dB run the average number of forward decoder motions per user data bit was 2.46, while 2.2% of the packets took 6 or more motions per bit. To estimate the throughput of the decoder at a given Eb/N0, simply divide the rate for "clean" packets (375 Kb/s on the 486DX4-100) by the average N for that Eb/N0. This ignores the cost of backward motions, but they're relatively cheap.
The decoder imposes a limit on N for each packet; if this limit (10000 for these runs) is exceeded, the packet is erased (discarded). In Fig 3 note that even though 33% of the packets were erased, none were decoded in error. This is a nice side benefit of long constraint length codes: separate error correction mechanisms like CRCs are not really required.
Figure 1 - Fano decoder performance 2.5 dB above threshold Seed 806575651 Amplitude 100 units, Eb/N0 = 5 dB metric table Eb/N0 = 5 dB Frame length = 1152 bits, delta = 17, cycle limit = 10000, #frames = 1000 decoding errors: 0 Average N: 1.181747 N >= count fraction 1 1000 1 2 0 0 Figure 2 - Fano decoder performance 0.5 dB above threshold Seed 806575707 Amplitude 100 units, Eb/N0 = 3 dB metric table Eb/N0 = 3 dB Frame length = 1152 bits, delta = 17, cycle limit = 10000, #frames = 1000 decoding errors: 0 Average N: 2.464074 N >= count fraction 1 1000 1 2 540 0.54 4 71 0.071 6 22 0.022 8 9 0.009 10 6 0.006 20 3 0.003 40 0 0 Figure 3 - Fano decoder performance 1.5 dB below threshold Seed 806575769 Amplitude 100 units, Eb/N0 = 1 dB metric table Eb/N0 = 1 dB Frame length = 1152 bits, delta = 17, cycle limit = 10000, #frames = 1000 decoding errors: 0 Average N: 561.711378 N >= count fraction 1 1000 1 2 1000 1 4 1000 1 6 1000 1 8 1000 1 10 997 1 20 977 0.98 40 929 0.93 60 890 0.89 80 859 0.86 100 837 0.84 200 762 0.76 400 677 0.68 600 620 0.62 800 576 0.58 1000 556 0.56 2000 489 0.49 4000 413 0.41 6000 378 0.38 8000 351 0.35 10000 326 0.33
decoded data: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000033000000000000 00000000000000000000000000000000 decoded data: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 002a2400000000000000000000000000 decoded data: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 00000000000001000000000000000000 Seed 806709882 Amplitude 100 units, Eb/N0 = 3.5 dB Amplitude 100 units, Eb/N0 = 3.5 dB Frame length = 1152 bits, #frames = 100, bits decoded = 115200 frame errors: 3 (0.03) bit errors: 10 (8.68056e-05)