Maximizing performance with the DES example

vitorbal · September 15, 2008, 9:23pm

Hello there,

As I examined the DESTester application, I decided to try to get the most performance out of it, so I could compare it with software-implemented versions of the DES.

Right off the bat I noticed the performance was getting hindered by the small size of the transfer blocks - 2048 bytes. So I swapped the 2k rams with 64k rams, and the pipe calls were now working with 65536 byte blocks. This greatly improved the time taken for the whole thing to work. The program was taking 4 seconds to work through 20MB of data, down from 40 seconds!

Later on, I found out there was a new version of the DES module that used pipelines to process the information. Instead of getting an output every 16 cycles, the new version could get an output every cycle (after the first 16) !
So i modified the code to make it work with the new version. This gave me a 1 second improvement on the 20MB file. Clearly, the chokepoint was related to the transfers, and not the DES module itself. With a frequency of 100MHz, the DES was certainly processing through 20MB very fast. According to my calculations, 2010^6 bytes of data being processed at 10010^6 Hz frequency, 8 bytes at a time, gives us:

(2010^6) / (10010^6 * 8) = 1/40 seconds

So the processing itself should be taking aprox. 1/40 of a second to complete.

To have more precise measurements of the time being taken on the transfers and on the processing itself, I decided to adapt the PipeTest C++ code to my new DESTester code. Using the same principle, the program now calculated the time taken on PipeIn, on PipeOut, and overall time to completion. This is what the code time-calculation part somewhat looks like:

    wxDateTime then = wxDateTime::UNow();
    xem->WriteToPipeIn(0x80, 65536, buf);
    wxTimeSpan diff = (wxDateTime::UNow()).Subtract(then);
    duration = duration + (diff.GetMilliseconds()).ToLong();

This are the results for the same 20MB file:
total time taken on pipes: 2084 milliseconds
| pipe in : 1172 ms
| pipe out : 912 ms
–> overall time: 2745 milliseconds

The transfers were taking 2084 milliseconds, and the processing 661 milliseconds? This is very weird, considering on paper I was getting 1/40 of a second… maybe this is taking in consideration all the setup times between RAM blocks?
Nonetheless, I was wondering if someone from Opal Kelly would have any suggestions on how to minimize this chokepoint from transfers, without having to resort to external memory conected to the FPGA? I am currently working on a version of the DESTester that works through the data input directly from the Pipe, instead of first writing it to a RAM, and only using a RAM to write the output data, doing pipeouts to empty it when it fills up.

Any ideas or workaround suggestions are greatly appreciated.

Thanks,
Vitor

okSupport · September 16, 2008, 2:12am

@Vitor–

Have a look at page 15 of the FrontPanel User’s Manual. There are some measurements taken from PipeTest that will help guide you.

DESTester was certainly not written for speed. It would be much faster to stream 8 MB at a time through a BlockRAM-based FIFO and allow the DES module to fill a similar FIFO on the other end which then stores the result in SDRAM. When the SDRAM is full, use PipeOuts to transfer the results back to the PC.

USB is a fast bus when used for bulk data. But like any bus, a performance penalty is associated with smaller transfers.

The Spartan-3 FPGA has a fair bit of BlockRAM, but the SDRAM is definitely helpful to keep transfer sizes nice and big to get better performance.

vitorbal · September 16, 2008, 1:28pm

Thank you for the suggestion,
I had already taken in consideration using SDRAM, and that will be my next step.
Regardless, going back to the 20mb example I used, considering the time taken on pipes already takes in account the time that it takes to setup the pipe connections, what other factors do you think are responsible for the remainder time of 661 milliseconds? This shouldn’t all be processing time, seeing as acording to the calculations I presented, the processing time for a 20mb file should be around 1/40 of a second = 0,025 seconds. Any ideas?

//Vitor