Divide-and-Conquer Ray Tracing on the GPU

Due to the highly parallel nature of DACRT I wanted to move my previous implementation of the algorithm to the GPU. The architecture change provides a speedup for the DAC ray tracer, but also highlights some of the problems with performing DACRT completely dataparellel, such as undeterministic ray or geometry storage size.
To focus on testing DACRT and avoid constructing primitive wrappers and kernels, the implementation is based on the thrust library.

A dataparallel DACRT algorithm

I based my dataparallel DACRT algorithm on Zhou et al. [2008]‘s KD-tree implementation for two reasons:

  1. I’m familiar with the idea, so it would be faster to get something up and running.
  2. My current GPU is an ‘old’ 330M, which isn’t that fast when it comes to atomic operations, so those modern workqueue and similar ideas are sadly not efficient on my hardware setup.

A list of nodes in DACRT represents a cut of a hierarchical acceleration structure over both the rays and the geometric primitives. As such a node can be defined as

struct Node {
    Bound b; // A hypercube bounding the current spatial subscene
    Rays rs; // A set of active rays within the bound.
    Primitives ps; // The set of active geometric primitives.

For the actual implementation the set of rays, rs, and primitives, ps, where simply a begin and end index into a list of their objects, but for simplicities sake and to save on pseudocode below I shall use them as if they where the actual list.

DACRT(Node[] ns, Node[] nextNs, Node[] leafs) {
    // Eliminate nodes where the ray tracing problem has been reduced appropriately.
    foreach(Node n in ns) in parallel
        if (IsLeaf(n)) {

    // Divide the current nodes into smaller ones.
    foreach(Node n in ns) in parallel {
        Bound[] bs' = Subdivide(n.b);
        foreach (Bound b' in bs') in parallel {
            Rays rs' = Intersecting(b', n.rs); 
            Primitives ps' = Intersecting(b', ps);
            nextNodes.add(Node(b', rs', ps'));

While this seems incredibly easy, there is a lot going on behind the scenes in terms of bookkeeping. Fx. since the GPU is a massively dataparallel device, the intersection and partitioning of rays is not left to a single thread as the above code would suggest, but is in fact parallelized across the entire device. Explaining how this is done is beyond the scope of this small post, but I may do a post on it at a later date, when I’ve experimented with different schemes. Suffice it to say here that it does require each ray to know which node it belongs to to determine which of the next nodes it should be associated with, performing a scan operation and then a scatter, as described in Sengupta et al.[2007].

Spatial Partition VS Object Partition


Source Code

The project’s source can be found on github.


Naive-Ray-Tracing: A divide-and-conquer approach by Benjamin Mora. [2011]
Fast Ray Tracing by Ray Classification by Arvo and Kirk [1987]
Real-Time KD-Tree Construction on Graphics Hardware by Zhou et al. [2008]
Scan Primitives for GPU Computing by Sengupta et al. [2007]