Hello there,
As I examined the DESTester application, I decided to try to get the most performance out of it, so I could compare it with software-implemented versions of the DES.
Right off the bat I noticed the performance was getting hindered by the small size of the transfer blocks - 2048 bytes. So I swapped the 2k rams with 64k rams, and the pipe calls were now working with 65536 byte blocks. This greatly improved the time taken for the whole thing to work. The program was taking 4 seconds to work through 20MB of data, down from 40 seconds!
Later on, I found out there was a new version of the DES module that used pipelines to process the information. Instead of getting an output every 16 cycles, the new version could get an output every cycle (after the first 16) !
So i modified the code to make it work with the new version. This gave me a 1 second improvement on the 20MB file. Clearly, the chokepoint was related to the transfers, and not the DES module itself. With a frequency of 100MHz, the DES was certainly processing through 20MB very fast. According to my calculations, 2010^6 bytes of data being processed at 10010^6 Hz frequency, 8 bytes at a time, gives us:
(2010^6) / (10010^6 * 8) = 1/40 seconds
So the processing itself should be taking aprox. 1/40 of a second to complete.
To have more precise measurements of the time being taken on the transfers and on the processing itself, I decided to adapt the PipeTest C++ code to my new DESTester code. Using the same principle, the program now calculated the time taken on PipeIn, on PipeOut, and overall time to completion. This is what the code time-calculation part somewhat looks like:
wxDateTime then = wxDateTime::UNow();
xem->WriteToPipeIn(0x80, 65536, buf);
wxTimeSpan diff = (wxDateTime::UNow()).Subtract(then);
duration = duration + (diff.GetMilliseconds()).ToLong();
This are the results for the same 20MB file:
total time taken on pipes: 2084 milliseconds
| pipe in : 1172 ms
| pipe out : 912 ms
–> overall time: 2745 milliseconds
The transfers were taking 2084 milliseconds, and the processing 661 milliseconds? This is very weird, considering on paper I was getting 1/40 of a second… maybe this is taking in consideration all the setup times between RAM blocks?
Nonetheless, I was wondering if someone from Opal Kelly would have any suggestions on how to minimize this chokepoint from transfers, without having to resort to external memory conected to the FPGA? I am currently working on a version of the DESTester that works through the data input directly from the Pipe, instead of first writing it to a RAM, and only using a RAM to write the output data, doing pipeouts to empty it when it fills up.
Any ideas or workaround suggestions are greatly appreciated.
Thanks,
Vitor