I spent today working on a faster way to get data in and out of my transform. Because the transform is a stateless function, the AXI Stream interface fits it really well, and so I plan to convert to that interface. To feed data in and out at a high speed, Xilinx provides an AXI DMA Stream block. Additionally, while I can access blocks directly out of DDR, I wanted to reduce the latency for temporary storage, so I plan to use the on-chip memory (OCM) as an intermediate cache – the CPU will copy the data from DDR into the proper format on the on-chip memory, the DMA streaming interface transforms the data, and then the CPU copies it back to DDR. I will eventually use DMA to also perform the DDR to OCM copy, or do a different form of caching altogether.
I started playing with the DMA Stream controller by looping back the stream input to output – making a really fast (and complicated) memcpy() implementation. I then benchmarked this copying data between two different areas of OCM, versus a software memcpy().
[root@alarm ~]# ./ocmtest Running bandwidth test on OCM: 223.696213 MB/s Resetting DMA... Reset complete Enabling DMA... DMA enabled Doing 100000 transfers of 8192 bytes 369.009009 MB/s
The AXI DMA block has two separate master AXI ports, one for reading and one for writing memory. Rather than connect them to the same port on the Zynq PS, I connected them to separate ports and got 399.6MB/s. Note that this bandwidth is bidirectional – this amount is both being copied from and to the memory. My clock is 100MHz and by bus width is 64 bits, meaning that I would expect to get 800MB/s – I’m not sure why the actual throughput is exactly half yet.
I then started work on the stream-compatible transform – first by simply testing the effect of pipelining on my memcpy(). Here’s what the block diagram looks like at the moment.