smallpt in Rust

With Rust alpha 2, Rust has finally reached a level of stability where I feel comfortable exploring it and not worrying about spending more time updating my code due to Rust changes than I will spend time actually coding and learning Rust.

For my first Rust experiment I ported Kevin Beason’s smallpt to Rust. The result can be found here. The code hasn’t been squished into 99 lines, but is instead slightly more readable than the original. The reason for picking smallpt as a first project was to get a feel for Rust’s math library and try some simple threading with dataparallel loops. And of course render a nice image in the most inefficient way possible! :)

Cornell Spheres

So far I’m quite happy with Rust: The functional inspired syntax is nice and simple. Pattern matching and struct destructuring is quite powerful and makes for some very safe code patterns in combination with Option and Result. The scoped thread I used for dataparallelism was quite simple to use and it would be easy to wrap the whole thing in a dataparallel for_each method, which might be my next little Rust project. However, the borrow checker, which is Rust best and most unique selling point, is also the hardest part about Rust. Even in my tiny project with only a tiny amount of parallelism, I had to fight the borrow checker and try to interpret compiler errors. I am sure that if I spend enough time with Rust, me and the borrow checker could become best of friends, but there is a steep learning curve to it. Hopefully it will be worth it and I can transfer the ownership idioms and patterns that I learn in Rust back to my other projects in other languages.

Random generator on the GPU

Random number generators are a very important tool in the programmer toolbelt. It’s useful for producing arbitrary random behaviour in computer games, creating cool procedural textures/environments and sampling in Monte Carlo simulations.

A true random number generator will always produce a perfectly random and unpredictable number, but pseudo-random number generators are usually enough. A pseudo-random number generator will, given a seed, produce a sequence of numbers with good random properties. In this blog post I will outline how to implement such a random generator first on a single threaded CPU and then how to move it to the multithreaded GPU.

A barebone Pseudo-Random Number Generator, PNRG, implementation might look something like this

class PRNG {
private:
    unsigned int seed;
public:
    /**
     * Seeds the pseudo-random number generator.
     */
    PRNG(unsigned int s) {
        seed = s;
    }

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        ...
    }
};

Pretty basic so far. Seed the generator and of course store that seed. But I won’t give up the implementation of Next just yet. First we need to

In order for a PRNG to produce a sequence of numbers longer than one it needs to update its seed after each new random number. A simple update would be to increment the seed by 1, but to produce good random numbers, the seed itself should preferably be random. Lets think that over for a moment; after having produced a random number using our seed, we need to update the seed itself with a new random number. Lucky for us that we just created one, so we can use the random number generated as our new seed.

class PRNG {
private:
    unsigned int seed;
public:

    ...

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        seed = RANDOM_GENERATION(seed);
        return seed;
    }
};

All that remains is to select some algorithm to produce a random number from a seed. For my experiments I selected a linear congruential generator with the Microsoft Visual/Quick C/C++ parameters. This allowed me to remove the modulo operator, which can be costly, and it produced reasonably good results. The full implementation then becomes

class PRNG {
private:
    unsigned int seed;
public:
    /**
     * Seeds the pseudo-random number generator.
     */
    PRNG(unsigned int s) {
        seed = s;
    }

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        seed = 214013 * seed + 2531011;
        return seed;
    }
};

This PRNG works well on single threaded hardware, unfortunately it doesn’t work well on hardware with millions of independent threads, such as the GPU. The reason is obvious. The PRNG produces a list of random numbers and to produce a number for the million’th thread, we would have to call Next() a million times on that thread alone. In general the n‘th thread would have to call Next() n times, leading to a total of n(n+1) / 2 invocations to produce n numbers. Of course we could create a list of a million random values on the CPU and then send those to the GPU, but that solution both wastes memory and doesn’t scale well for threads requiring more than a few random numbers.

Fortunately we can do better. All we need to initialise the PRNG is a reasonably random seed, so if we can generate such a seed pr. thread, then each thread can create its own local PRNG and start happily creating random number sequences. To create the seed for each thread I send one random number, s, from the CPU to the GPU. I then add that number to a hash of the thread’s id, to create the seed. The reason that I hash the thread’s id is to avoid similarities between random numbers produced by neighbouring threads.

In code this amounts to

void GPUKernel(unsigned int CPUseed) {
    PRNG rand = PRNG(CPUseed + Hash(threadId))
    ... Do stuff with the random numbers ...
}

where I used the hash found in thrust’s Monte Carlo example.

unsigned int Hash(unsigned int a) {
   a = (a+0x7ed55d16) + (a<<12);
   a = (a^0xc761c23c) ^ (a>>19);
   a = (a+0x165667b1) + (a<<5);
   a = (a+0xd3a2646c) ^ (a<<9);
   a = (a+0xfd7046c5) + (a<<3);
   a = (a^0xb55a4f09) ^ (a>>16);
   return a;
}

Some last notes.

I used 32 bit unsigned ints throughout this post with a maximum value of 2^32. A lot of random generators won’t generate 32 random bits, but may fx. only generate 16 bits and leave the 16 most significant bits as 0. To be sure that we utilise the entire spectrum of random number sequences across all threads it is important that the seed contains at least as many significant bits as the PRNG is expected to produce. This is easily achieved by concatenating random numbers of fewer significant bits into a 32bit bitstring.

The linear congruential generator does not produce the best sequence of random numbers, which is required in monte carlo simulations. I selected it for my tests because it was fast and had a low memory overhead. For better random results try using another generator from the list found at wikipedia.