Thursday 19 April 2012

Two Way DMA Transfer over PCI Express bus

PCI Express is the bus with star like topology utilizing  point to point connection between closest agents. The bus uses packet based serial data transfers between Root Complex (usually CPU and System memory) and Endpoints (peripheral devices). Endpoints can also talk to each other. Data is transferred between bus agents through hubs called Switches. At the same time there can be a lot of (more than one) data transfer transaction going on. That is the key difference between PCI Express bus and older PCI bus that is almost completely out of use nowadays. Old PCI bus "enjoyed" shared bus topology and could handle only one transaction at a time. 



But there is more to it than that. Each point to point connection comprises two channels - transmit and receive. That means that there are separate physical pins that transfer data "up" and "down".  On a physical layer some data is always transferred between points to keep link stable and alive. That actually can be done simultaneously. On second higher layer - Data Link Layer - some other data packets are transfered including but not limited to ACK/NACK packets, credit information and stuff. In fact this is done automatically and is completelly hided away from system level programmer of FPGA designer. What is not hided though is Transaction Layer Packets which are generated at the will of programmer. When you access a register inside FPGA from CPU over PCIe bus or when one FPGA (endpoint) passes command or data to another Endpoint FPGA a Transaction Layer packet is generated. 

For example let us consider plain DMA data transfer between FPGA device and RAM in the direction to device. In this case devices driver running on CPU allocates memory buffer inside system memory, filles it with data, passes its bus address to device and sends a command to device to start transfer. After that device starts transactions on the bus as a master and sends read requests to system memory controller and receives data as completion packets. Almost the same things happens when data is transferes in the opposite direction. The only difference is that in this case device writes data to system memory buffer allocated by the driver. In high performane computing very often this is enough - first we pass data to device, it performs some calculation over it, and then writes data back to system memory. 

But sometimes we need to perform calucations on a stream of data. In this case it can be very helpfull to receive and transmit data simultaneously. At the first glance it is impossible, because we have only one PCIe interface. But each PCIe connection has two channels - transmit and receive. And that means that we actually can simultaneously data transfer in and out of the FPGA over one PCIe interface. Of coarse FPGA core logic should support it and driver has to be written accordingly. 

In this case devices driver running on CPU allocates two memory buffer inside system memory - one for data that should be passed to device and another for data that should be received back. The addresses of these buffers are passed to device and then it starts generate read requests to system memory. When it receives  first data it starts pipeline calculation over it and since first data comes out of this pipeline device starts a write transaction on the PCIe bus. 

And yes, this stuff works on Rosta pcie modules. See ya!

Tuesday 18 October 2011

C to RTL Synthesis

Hi again 

I want to post some update on current work. Right now me and collegues from Institute of System Programming am working on developing a system for our hardware  (FPGA project + system level code) capable of running calculations based on different C to RTL Synthesis tools. 

For now we are focusing on Open Source projects like ROCCC and C-to-Verilog (CTV). Both tools generate RTL code that can be inserted into FPGA project. The hardware interface is not very difficult to integate and depends on the C function prototype or interface. Imagine you have a C fucntion like
void my_func(int* Ain, int* Bout)
Where Ain and Bout are arrays of data you whant to process inside the function. When translated to RTL this  function will have a simple interface to memory: two ports (one per array) to static like (BRAM for example) memory will be generated. Besides that some simple control interface (reset, start and done signals) will be generated.

So as applied programmer you have to
1. Load data to memory from host
2. Connect memory to C-to-RTL circuit and send command to start calculations
3. After calc is finished fetch data from memory to host

Comparing ROCCC and CTV tools I can say that ROCCC generates more optimised RTL, but implies more restrictions on your C code.

The work is just started and there is a lot yet to be done, so I better get back to work :))


Tuesday 2 August 2011

FPGA Partial Reconfiguration through PCIe Interface

I have just finished pcie partial reconfiguration (PR) design on Xilinx Virtex 6 FPGA. Now right after computer powerup only small part of FPGA becomes configured from flash with small static part that contains pcie core and interface logic to internal configuration access port (ICAP). During BIOS PCIe bus scan our FPGA becomes dicovered and memory resources are assigned to it. After complete OS load user application can access FPGA by pcie interface and transfer there partial bit file for unconfigured region. 

IMHO there are at least 3 reasons to use this approach to FPGA configuration in scenario when FPGA is connected to computer host by PCIe bus:
  • FPGAs become bigger every year and size of bit files grows also. It implies that to use these modern FPGAs you need bigger flash memory device all the time, which may be a problem - for example there is no such dense SPI flash memory chips on the market.  PR can solve this problem, because only small static configuration file should be stored in flash in this case. 
  • While configuration file length rises so does the configuration time. And as stated by PCIe specification all pcie devices should be up and running and able to answer on BIOS configuration requests within 100 ms after computer powerup. Very often it is very tight limit for large FPGAs and PR approach can also help because it takes much less to for initial configuration to complete. 
  • Also this approach helps a lot when there is a need to reconfigure FPGA while computer is running. For example in high performance computing (HPC) field. 
So you should you do to use PR over PCIe in your next design? 

First, check out Xilinx Application note on this topic xapp883_Fast_Config_PCIe. It is a good starting point. 

Also you will need to learn how to use Xilinx PlanAhed software that supports PR design flow. I suggest you master Partial Reconfiguration User Guide UG702 which will help get aware of general partial reconfiguration process and give you some experince in using PlanAhead. 

On design entry level you will have to follow several rules, for example you will definatelly want to insert additional registers on all signals that cross partition boundary on both sides of it. It is needed to preserve timing, which is always an issue in design with partitions. Check Hierarchical Design Methodology Guide UG748 for more details on design flow with partitions.




Thursday 14 July 2011

First Entry

Hello

Let me introduse my blog. My name is Yuri Rumyantsev - FPGA designer and software programmer from Moscow Russia. I work for Russian company Rosta Ltd. I want to make this blog professinal - here I will discuss whatever conserns me in computer and electronic industry. I will try to share my skills and knowledge of hands on experience of FPGA design, embedded programming and system software development.

Also I am instructor of "FPGA Design Methodology" training course at Moscow State University, so I plan to post here topics on FPGA use cases in education and research.

Allthough I am from Russia, I plan to, but not restricted to, write in this blog in English for several resons.

  • To make this blog available for broader audience
  • Have a little more practice in English 
But language should not be an issue here, so I welcome comments also in Russian.