DMA stream on Zynq

I spent today working on a faster way to get data in and out of my transform. Because the transform is a stateless function, the AXI Stream interface fits it really well, and so I plan to convert to that interface. To feed data in and out at a high speed, Xilinx provides an AXI DMA Stream block. Additionally, while I can access blocks directly out of DDR, I wanted to reduce the latency for temporary storage, so I plan to use the on-chip memory (OCM) as an intermediate cache – the CPU will copy the data from DDR into the proper format on the on-chip memory, the DMA streaming interface transforms the data, and then the CPU copies it back to DDR. I will eventually use DMA to also perform the DDR to OCM copy, or do a different form of caching altogether.

I started playing with the DMA Stream controller by looping back the stream input to output – making a really fast (and complicated) memcpy() implementation. I then benchmarked this copying data between two different areas of OCM, versus a software memcpy().

[root@alarm ~]# ./ocmtest 
Running bandwidth test on OCM:
223.696213 MB/s
Resetting DMA...
Reset complete
Enabling DMA...
DMA enabled
Doing 100000 transfers of 8192 bytes
369.009009 MB/s

The AXI DMA block has two separate master AXI ports, one for reading and one for writing memory. Rather than connect them to the same port on the Zynq PS, I connected them to separate ports and got 399.6MB/s. Note that this bandwidth is bidirectional – this amount is both being copied from and to the memory. My clock is 100MHz and by bus width is 64 bits, meaning that I would expect to get 800MB/s – I’m not sure why the actual throughput is exactly half yet.

I then started work on the stream-compatible transform – first by simply testing the effect of pipelining on my memcpy(). Here’s what the block diagram looks like at the moment.

Screenshot from 2014-01-04 19:58:39

5 thoughts on “DMA stream on Zynq

  1. Interesting post.

    I am facing a similar problem and I would like to avoid the complexity of the DMA hardware. I have a wavelet engine in the PL that needs data that has been previously processed by the PS. Some of this data will be in the cache etc so I cannot DMA from DDR because coherence problems but I need to use the caches.

    My simpler solution and more portable is to use software memcpy and build a data flow so when one block is written by the PS to the PL, another one is processed by the PL and another one is read from the PL to the PS.

    These two results you have 223.696213 MB/s and 369.009009 MB/s compare the soft memcpy and the hardware DMA ? if this is the case the soft memcpy does not perform bad I think.

    Regards,

    1. The two benchmarks are both copying between two regions in OCM – one with memcpy() through /dev/mem (uncached) and the other with the DMA hardware wired in a loopback configuration. Using memory-mapped I/O, I was able to get only about 50MB/s through my hardware with 100% CPU usage.

      Are you using Linux or bare metal? One option is if most of your data is in caches, you can do cache-coherent DMA by wiring to the ACP port rather than the HP ports. However, if you need your CPU to do other things while your hardware is running, this might slow down your CPU.

      Currently I have the same problem that you do, I am using /dev/mem to work on data in OCM but it is uncached and too slow for my software routines. The solution for me is going to be a simple kernel module using the Linux DMA API (there is a nice DMA API howto text file in the kernel). The “streaming” DMA mapping type might be what you want – it lets you use the DDR and OCM cached, but will automatically flush the caches when you call the mapping function to get the address to pass to the DMA hardware.

      Thanks for your interest, I might write an article when I get my kernel module finished so that you can use it too.

  2. Hi Thomas,

    Currently we are working on Zynq 7020 board , the project is : storing Video on to a SD card using PS, could you please suggest us how this could be done using DDR (dataflow),we need to know how data can be fed into ddr and fetched out of it.

    Thanks and Regards,
    Sachin Edwin

    1. Hello,

      I assume the video is coming in on the PL side? You want to put it into DDR, then read it out of DDR with the PS and write the frames on the PS to a SD card?

      I would recommend starting with the OCM – it is smaller so you will need to use small video frames, however it’s easier to debug. You will need to choose a Xilinx DMA component in order to stream your video data into the RAM, and configure it appropriately. Most of the DMA components have an AXI-Lite bus for configuration from the PS.

      You can see the code used to read and write from the “Xilinx AXI-Stream DMA” block here, in Linux userspace: https://github.com/tdaede/daala/blob/hw/src/hw.c

      The main challenge with DDR is that if you are using Linux on the PL, you will need to allocate a large continuous buffer area, which isn’t too hard but requires some knowledge of Linux and device trees.

  3. Hi Thomas

    Thanks for the post

    There are a number of things that still can be tried such as enabling the scatter gather, maybe that helps for small blocks.

    I am interested in trying DMA tests, can you post your test code please.

    Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>