Flexible data-flow buffer management Thu, Sep 18 2014 PM
Referencing the computational offload discussion from the GrConn working group: The GNU Radio scheduler is in charge of allocating and managing stream buffers. Which is a big issue for individuals trying to integrate GNU Radio with various computational offload devices. In most of these cases, because the device determines the size and location of streaming memory, the offload block usually ends up using memory copies in-between scheduler memory and device memory. So, I've been thinking about (and implementing) APIs to deal with diverse memory situations for the last 2-3 years now.
First attempt
The GRAS project supported a different way to deal with memory. GRAS gave processing blocks the ability to deal with custom input and output buffer pools for every IO port, with hooks to deal with commiting and releasing the memory from the device. A companion project GREX has an OpenCL demo that created custom buffer pools for the input and output, and used the buffer hooks to cause DMA reads and writes -- aka enqueuing CL map events.
GRAS had a problem though: How to deal with interacting memory domains? In other words: what happens when two blocks, each which has a custom buffer implementation, get connected? Back to back OpenCL kernels? How about an OpenCL kernel to a filter with a circulr buffer? I never implemented an answer to this problem. It just fails. But the user could manually insert a copy block into the stream to make the design work.
Whats wrong with this implementation:
- The user needs determine the incompatibility and to manually insert copy blocks
- When the domains are the same, we want to avoid not just copies, but DMAs as well
Second attempt
With the Pothos project, we took this effort a step further. Pothos reimplements the same basic concept of custom buffer management with commit and release hooks. But also, the Topology::commit() implementation traverses the graph, querying every block's ports for the domain (a string that is unique to the memory domain). If multiple blocks operate in the same domain and can share buffers or memory, they should report the same domain string.
Users can specify a custom domain as an argument when setting up the port. Also, users can provide custom buffer managers through the getInputBufferManager() and getOutputBufferManager() overloads. For API specifics, see the blocks coding guide
The buffer manager is a queue-like interface class. Users can inherit from this interface and provide hooks to deal with buffer allocation, buffer freeing, commit/release hooks. Every output port has an associated buffer manager, even in cases when the destination block provides the custom manager; its the output port which actually holds the manager and makes calls on it.
A managed buffer is a reference counted object with a pointer, length, and associated buffer manager. This buffer object is passed as a message to downstream consumers which hold a reference. The buffer is returned to its manager when all consumers release their references to it. The fact that managed buffers are a finite resource is what provides backpressure in the system.
The buffer accumulator is another queue-like interface class. Every input port has a buffer accumulator. The accumulator is responsible for holding references to incoming buffers, releasing buffers when consumed, and amalgamating buffers when contiguous. It will even memory copy when buffers are discontiguous and the consumer requires a specific amount.
Examples cases
Handling the general case
In the general case, blocks do not specify a custom memory domain, and do not overload getInputBufferManager() and getOutputBufferManager(). In this case, the Topology::commit() implementation sees that two general-purpose source and destination ports have no custom buffer manager, and therefore decides to allocate a generic slab-based buffer manager for the source port.
Handling DMA style devices
DMA style blocks need to specify a custom domain for input and output ports, and provide custom overloads for getInputBufferManager() and getOutputBufferManager(). The pop operation of a managed buffer should perform the DMA action (read or write). The push operation of a managed buffer should make the buffer avilable again. And the block's work() method should handle waiting on the DMA events to complete.
- The Pothos OpenCL project has a decent example of this in-action.
Handling back to back domains
Suppose that two blocks have the same custom domain. When these blocks are connected, they should keep the buffers on the device, and avoid any DMA-type actions for the flow between the two blocks.
Much like the above case, the block overloads getInputBufferManager() and getOutputBufferManager(). However, the overloaded function sees that the domain passed into the call is the same as its own. In this case, the block can choose to abdicate allocating a buffer manager to use the one from the other block, or set custom parameters on the manager to avoid the DMA.
- See the OpenCL kernel for an example.
Back to back handled externally
Suppose that two blocks have completely external processing between them, and we dont want to manage any overhead in between the blocks at runtime. To handle this, we dont even need to pass around buffers, the buffer managers can be complete dummies. The producer block in this case shouldnt really bother producing. The consumer block in this case should just use work to wait on operations to complete. Also, the domain information passed into the getInputBufferManager() and getOutputBufferManager() overloads can be used to directly determine the topology of the processing within the offload board.
Handling incompatible domains
Suppose two blocks have different memory domains, or more specifically, one of the overloaded calls throws because it cant handle the domain, or requires a custom buffer manager. In this case, the topology must insert a copy block into the stream.
Handling shared memory devices
The shared memory case isnt really any different than the DMA case. Although the memory doesnt need to move, we can use the same hooks to commit and release the memory, and to wait on the device to complete processing.
Handling doubly mapped buffers
Certain blocks like filters that don't consume all of the input that they have read work more efficiently using doubly(or circularly) mapped buffers. Thats because circular buffers avoid discontinuities, and therefore avoid requiring memory copies. In this case, the block should overload the get buffer manager call and return a circular buffer. A factory for circular buffers is provided by the API.
The buffer accumlator tracks the buffers coming in and amalgamates the managed buffers when they are continuous. The logic also supports shifting the buffer's alias point to avoid discontinuities near the buffer's edge.
Arbitrary topology cases
You can easily imagine more complicated topologies that combine all of the above with multiple producers to a consumer, and multiple consumers to a producer. In these cases, the goal of the Topology implementation is to determine which cases are acceptable by inspecting the blocks, and which cases need to be addressed by inserting copy blocks into the stream.
There is definity more work to be done with this logic, because there are currently cases that are getting copy blocks that could, given the right information, be handled in a better way.