Random generator on the GPU

Random number generators are a very important tool in the programmer toolbelt. It’s useful for producing arbitrary random behaviour in computer games, creating cool procedural textures/environments and sampling in Monte Carlo simulations.

A true random number generator will always produce a perfectly random and unpredictable number, but pseudo-random number generators are usually enough. A pseudo-random number generator will, given a seed, produce a sequence of numbers with good random properties. In this blog post I will outline how to implement such a random generator first on a single threaded CPU and then how to move it to the multithreaded GPU.

A barebone Pseudo-Random Number Generator, PNRG, implementation might look something like this

class PRNG {
private:
    unsigned int seed;
public:
    /**
     * Seeds the pseudo-random number generator.
     */
    PRNG(unsigned int s) {
        seed = s;
    }

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        ...
    }
};

Pretty basic so far. Seed the generator and of course store that seed. But I won’t give up the implementation of Next just yet. First we need to

In order for a PRNG to produce a sequence of numbers longer than one it needs to update its seed after each new random number. A simple update would be to increment the seed by 1, but to produce good random numbers, the seed itself should preferably be random. Lets think that over for a moment; after having produced a random number using our seed, we need to update the seed itself with a new random number. Lucky for us that we just created one, so we can use the random number generated as our new seed.

class PRNG {
private:
    unsigned int seed;
public:

    ...

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        seed = RANDOM_GENERATION(seed);
        return seed;
    }
};

All that remains is to select some algorithm to produce a random number from a seed. For my experiments I selected a linear congruential generator with the Microsoft Visual/Quick C/C++ parameters. This allowed me to remove the modulo operator, which can be costly, and it produced reasonably good results. The full implementation then becomes

class PRNG {
private:
    unsigned int seed;
public:
    /**
     * Seeds the pseudo-random number generator.
     */
    PRNG(unsigned int s) {
        seed = s;
    }

    /**
     * Produces the next pseudo-random number in the sequence.
     */
    unsigned int Next() {
        seed = 214013 * seed + 2531011;
        return seed;
    }
};

This PRNG works well on single threaded hardware, unfortunately it doesn’t work well on hardware with millions of independent threads, such as the GPU. The reason is obvious. The PRNG produces a list of random numbers and to produce a number for the million’th thread, we would have to call Next() a million times on that thread alone. In general the n‘th thread would have to call Next() n times, leading to a total of n(n+1) / 2 invocations to produce n numbers. Of course we could create a list of a million random values on the CPU and then send those to the GPU, but that solution both wastes memory and doesn’t scale well for threads requiring more than a few random numbers.

Fortunately we can do better. All we need to initialise the PRNG is a reasonably random seed, so if we can generate such a seed pr. thread, then each thread can create its own local PRNG and start happily creating random number sequences. To create the seed for each thread I send one random number, s, from the CPU to the GPU. I then add that number to a hash of the thread’s id, to create the seed. The reason that I hash the thread’s id is to avoid similarities between random numbers produced by neighbouring threads.

In code this amounts to

void GPUKernel(unsigned int CPUseed) {
    PRNG rand = PRNG(CPUseed + Hash(threadId))
    ... Do stuff with the random numbers ...
}

where I used the hash found in thrust’s Monte Carlo example.

unsigned int Hash(unsigned int a) {
   a = (a+0x7ed55d16) + (a<<12);
   a = (a^0xc761c23c) ^ (a>>19);
   a = (a+0x165667b1) + (a<<5);
   a = (a+0xd3a2646c) ^ (a<<9);
   a = (a+0xfd7046c5) + (a<<3);
   a = (a^0xb55a4f09) ^ (a>>16);
   return a;
}

Some last notes.

I used 32 bit unsigned ints throughout this post with a maximum value of 2^32. A lot of random generators won’t generate 32 random bits, but may fx. only generate 16 bits and leave the 16 most significant bits as 0. To be sure that we utilise the entire spectrum of random number sequences across all threads it is important that the seed contains at least as many significant bits as the PRNG is expected to produce. This is easily achieved by concatenating random numbers of fewer significant bits into a 32bit bitstring.

The linear congruential generator does not produce the best sequence of random numbers, which is required in monte carlo simulations. I selected it for my tests because it was fast and had a low memory overhead. For better random results try using another generator from the list found at wikipedia.