PCI Express is the bus with star like topology utilizing point to point connection between closest agents. The bus uses packet based serial data transfers between Root Complex (usually CPU and System memory) and Endpoints (peripheral devices). Endpoints can also talk to each other. Data is transferred between bus agents through hubs called Switches. At the same time there can be a lot of (more than one) data transfer transaction going on. That is the key difference between PCI Express bus and older PCI bus that is almost completely out of use nowadays. Old PCI bus "enjoyed" shared bus topology and could handle only one transaction at a time.
But there is more to it than that. Each point to point connection comprises two channels - transmit and receive. That means that there are separate physical pins that transfer data "up" and "down". On a physical layer some data is always transferred between points to keep link stable and alive. That actually can be done simultaneously. On second higher layer - Data Link Layer - some other data packets are transfered including but not limited to ACK/NACK packets, credit information and stuff. In fact this is done automatically and is completelly hided away from system level programmer of FPGA designer. What is not hided though is Transaction Layer Packets which are generated at the will of programmer. When you access a register inside FPGA from CPU over PCIe bus or when one FPGA (endpoint) passes command or data to another Endpoint FPGA a Transaction Layer packet is generated.
For example let us consider plain DMA data transfer between FPGA device and RAM in the direction to device. In this case devices driver running on CPU allocates memory buffer inside system memory, filles it with data, passes its bus address to device and sends a command to device to start transfer. After that device starts transactions on the bus as a master and sends read requests to system memory controller and receives data as completion packets. Almost the same things happens when data is transferes in the opposite direction. The only difference is that in this case device writes data to system memory buffer allocated by the driver. In high performane computing very often this is enough - first we pass data to device, it performs some calculation over it, and then writes data back to system memory.
But sometimes we need to perform calucations on a stream of data. In this case it can be very helpfull to receive and transmit data simultaneously. At the first glance it is impossible, because we have only one PCIe interface. But each PCIe connection has two channels - transmit and receive. And that means that we actually can simultaneously data transfer in and out of the FPGA over one PCIe interface. Of coarse FPGA core logic should support it and driver has to be written accordingly.
In this case devices driver running on CPU allocates two memory buffer inside system memory - one for data that should be passed to device and another for data that should be received back. The addresses of these buffers are passed to device and then it starts generate read requests to system memory. When it receives first data it starts pipeline calculation over it and since first data comes out of this pipeline device starts a write transaction on the PCIe bus.
And yes, this stuff works on Rosta pcie modules. See ya!