# smallpt in Rust

With Rust alpha 2, Rust has finally reached a level of stability where I feel comfortable exploring it and not worrying about spending more time updating my code due to Rust changes than I will spend time actually coding and learning Rust.

For my first Rust experiment I ported Kevin Beason’s smallpt to Rust. The result can be found here. The code hasn’t been squished into 99 lines, but is instead slightly more readable than the original. The reason for picking smallpt as a first project was to get a feel for Rust’s math library and try some simple threading with dataparallel loops. And of course render a nice image in the most inefficient way possible!

So far I’m quite happy with Rust: The functional inspired syntax is nice and simple. Pattern matching and struct destructuring is quite powerful and makes for some very safe code patterns in combination with Option and Result. The scoped thread I used for dataparallelism was quite simple to use and it would be easy to wrap the whole thing in a dataparallel for_each method, which might be my next little Rust project. However, the borrow checker, which is Rust best and most unique selling point, is also the hardest part about Rust. Even in my tiny project with only a tiny amount of parallelism, I had to fight the borrow checker and try to interpret compiler errors. I am sure that if I spend enough time with Rust, me and the borrow checker could become best of friends, but there is a steep learning curve to it. Hopefully it will be worth it and I can transfer the ownership idioms and patterns that I learn in Rust back to my other projects in other languages.

# Image Based Lighting

There are various ways of approximating global illumination in todays real-time 3D visualisations.

• Diffuse global illumination from static objects can be baked into the scene and light probes can be used for applying diffuse GI to dynamic actors.
• NVIDIA recently released their GIWorks solution which relies on voxel/cone tracing to generate real-time GI from both static and dynamic actors.
• There are also certain screen space techniques, as can be seen in the Unreal Engine 4.

However, these techniques either only support diffuse lighting, require a lot of time, effort and know-how to implement or aren’t supported by low-end hardware, such as phones or laptops. Another alternative that I will describe here is a simple, yet powerful technique, called Image Based Lighting.

Image based lighting, or IBL, is an extension to lighting using environment maps. But where environment maps are usually used to simulate purely specular reflections in the scene, IBL uses irradiance environment maps to simulate everything from purely specular reflections, to glossy and diffuse reflections. Fig. 1 shows such an environment map used for specular lighting and some of the irradiance environment maps created from it. Fig.2 two shows the application of the irradiance environment maps. Notice the diffuse reflection on the T-Rex, the glossy reflections in the chessboard and the fresnel effect on the snow speeder, which has diffuse surfaces when looking directly at them, but glossy surfaces at grazing angles.

Fig 1. Multiple irradiance environment maps created with different glossiness factors. The left one is completely specular and the right one is very rough.

Fig 2. A scene containing a LEGO Star Wars Snow Speeder getting attacked by a T-Rex with a teapot on a chessboard at the beach. (Sounds ridiculous when you spell it out like that.)

Image Based Lighting basics

So how do we go about creating an effect like this? First we need to take a look at our surface description. If our surface is described by the BRDF, f_r(\omega_o, \omega_i), where \omega_o is the direction of the view vector and \omega_i is the direction towards some incoming light. In layman’s terms the BRDF describes the ratio of incoming radiance that is reflected towards the viewer. The total amount of light coming from \omega_i, L(\omega_i), that is reflected in the view direction is computed by

L_o(\omega_o, \omega_i) = f_r(\omega_o, \omega_i) L(\omega_i) cos\theta_i

In most non-physically based rendering systems, a surface is described by a Lambert BRDF and a non normalized Blinn or Phong BRDF for creating specular highlights.

With that in mind we have a basic description of light reflecting of a surface for one incoming and one outgoing direction. However, our environment doesn’t represent light arriving at the surface from a single direction, such as a point light does, instead it illuminates the surface from every possible incoming direction, so it represents incoming light over the entire visible hemisphere. So to compute the amount of light reflected from the environment along the \omega_i, we have to integrate over the visible hemisphere, \Omega, that is

L_o(\omega_o) = \int_\Omega \! f_r(\omega_o, \omega_i) L_{env}(\omega_i) cos\theta_i\, \mathrm{d}\omega_i
Preconvoluting the environment map

Solving this integral for arbitrary environment maps or complex BRDFs is not feasible and approximating it using Monte-Carlo Integration cannot be done in realtime on low-end hardware, so we need to come up with an approximation for real-time usages. An in-depth description of how this can be done and what the tradeoffs are can be found in Real Shading in Unreal Engine 4. In essence we need to split the integral and separately integrate over the environment, which convolutes the environment, and integrate the BRDF over the hemisphere, computing the scale that the BRDF applies to the environment light. Convoluting the environment can then be done in a preprocess. In my implementation I have chosen to convolute using the power cosine distribution instead of the actual BRDF. This means that the irradiance map won’t match the BRDF completely, but as I wanted multiple BRDFs to use the same map, it wasn’t possible to support all at the same time. The IBL will be precomputed from the environment map using the expression

L_{IBL}(\omega_o, \sigma) = \int_\Omega \! f_{powercos}(\omega_o, \omega_i, \sigma) L_{env}(\omega_i) \, \mathrm{d}\omega_i

which can be done using Monte Carlo Integration, where σ is the relative ‘shininess’ or glossiness of the surface and f_{powercos} is the power cosine distribution. The whole precomputation is stored in a single texture, storing the low frequency irradiance maps in the low resolution miplevels.

Apply to BRDF

With our convoluted environment map we can now efficiently fetch convoluted incident radiance from the environment map. All we then have to do to apply the light to the material is to scale it by the BRDF. To do that we have to estimate the amount of light coming from the environment that is reflected along \omega_o, i.e we have to solve

\rho(\omega_o) = \int_\Omega \! f_r(\omega_o, \omega_i) cos\theta_i\, \mathrm{d}\omega_i

which is known as the directional-hemispherical reflectance function. How to do this depends on f_r. For simple BRDFs like Lambert or a completely specular BRDF, solving the integral is doable. More complex BRDFs cannot be solved analytically in real-time, but the solution can be baked into a texture and looked up at runtime, as discussed in Real Shading in Unreal Engine 4. Encoding \\rho in a spherical harmonic is also a possibility. Choose the right tool for the right BRDF. However \rho is resolved, we now have all components needed for applying the IBL to the BRDF and compute the radiance reflected along \omega_o

L_o(\omega_o) = \rho(\omega_o) L_{IBL}(\omega_o, \sigma_r(\omega_o))
Demo Time

The result can be seen in the demo below or a demo with more IBLs can be downloaded for Windows and Mac.
Note that due to texture resolutions, mirror reflections are not supported in the webplayer.

Future Work

A detail that was conveniently left out was how to compute \sigma_r for a given BRDF. \sigma_r represents an approximate mapping from an arbitrary BRDF to the power cosine distribution, but it is an approximation and as such estimating it depends on the BRDF used. Estimating \sigma_r becomes even harder for anisotropic BRDFs, where a single L_{IBL}(\omega_o, \sigma_r) sample isn’t enough to capture the anisotropic properties of the surface. Interesting future work would be developing a general framework for estimating \sigma_r for any isotropic BRDF. Additionally it would be interesting to investigate if glossiness could be broken down into N cones inside the hemisphere for anisotropic BRDFs and used to create a believable estimation of anisotropic reflections.

References

# Blob-O-Matic

So, shadows. The eternal challenge with rasterised rendering.
There are many use cases for shadows, which account for the many ways in which to approximate them; several shadow mapping techniques exist for real time shadows from dynamic geometry, there’s baked lighting for static geometry, there’s screen space ambient occlusion for ambient occlusion between both dynamic and static geometry and there’s blob shadows for highly ambient lit environments, where the geometry needs to look grounded. The last case is the one we’ll concern ourselves with in this post.
When creating 3D applications with a lot (20 to 10000) of dynamically placed static objects and targeted at low-end devices, fx. indoor architectural visualisation in a browser, you can’t depend on having resources for fancy screen space effects or lots of rendering passes to create shadow maps, first bounce GI or SSAO. What is usually done then is to create blob shadows for the static assets located on even surfaces and use those blob shadows to ground the objects in the scene. These blob shadows are created by artists by rendering the silhouette of the object and then blurring the silhouette by hand in fx. Photoshop. The blurring can either be done by simply applying a uniform blur filter to the entire silhouette or by blurring it via gefühl to achieve a hard shadow where the geometry is close to the floor and a soft shadow where the geometry is further away. An example of uniform blurring vs context sensitive blurring can be seen in the images below.

Fig 2a: Hard bike shadow with depth dependent falloff

Fig 2b: Soft bike shadow with depth dependent falloff

In figure 1a we see the hard silhouette of the bike projected onto the floor below it and in 1b the silhouette has been blurred six times using a fixed size blur radius.
In figure 2a the silhouette is still hard, but it has been created with an exponential intensity falloff based on the depth of the shadow. Figure 2b then shows the silhouette from 2a, which has been blurred six times using a blur radius that depends logarithmically on the intensity in the current processed pixel.
The difference is clearly noticeable; in figure 1b the naive blob shadow of the bike looks like the blob shadow from a shark or rocket and while it does help to ground the bike in the scene, it doesn’t give the viewer any impression of the vertical shape of the bike. In figure 2b we get a much better blob shadow, here the shadow is strong in places where the bike’s geometry is close to the floor and drops off as parts of the bike get further away. This is especially obvious around the wheels, where the shadow’s decrease in intensity can be seen to follow the curvature of the wheel.
Having seen artists spend days on creating these blob shadows, with varying success, and with the prospect of them spending/wasting even more time on it in the future, I decided to try to create a blob shadow creator extension for Unity, a Blob-O-Matic if you will.
The process is relatively easy. First the model is rendered to a rendertexture using an orthogonal camera located right below the model. The distance to the fae plane is set to the distance that we want the shadows to be cast. The model gets rendered using a replacement shader that outputs the geometry’s distance to the near plane instead of the colors. This gives us the silhouette seen above in figure 2a. Alternatively the depth buffer could have been used to grab the distance of the vertex from the near plane. This would allow the plugin to support arbitrary static vertex shaders, but the current method is faster, the codebase simpler and if you’re performing vertex deformation of a static model in the vertex shader, you should probably consider preprocessing the model anyway.
Once the depth sensitive silhouette has been rendered, multiple blur passes are performed on the silhouette. These all use a varying blur kernel size, where dark pixels are assigned a low blur radius and brighter pixels are given a higher radius. The result of this is that dark shadows, created where the model is close to the plane, stay dark, and brighter shaded shadows get blurrier and receive a larger shadow penumbra. To reduce banding artefacts from the blur passes I opted to not use Unity’s cone tap technique and instead used blur by simply computing an offset into the silhouettes mipmap level. The downside to using mipmaps for blurring is that I can’t specify my own blur kernels (gaussian, spline-based), which is why you can specify blur passes instead to approximate a gaussian blur.
The end result can be seen and experimented with in this Unity webplayer.

I am currently putting together an editor extension, where models and prefabs can be selected in the Inspector and allow the user to bake a drop shadow into them. A picture of the extension can be seen below with shadows generated for the famous Utah teapot and Mitsuba’s material preview model.

Blob-O-Matic GUI

As future work on the extension I would like to add support for depth independent transparent geometry. This would require two passes, the first pass would render opaque geometry as is done now, the second pass would collect the opacity of the transparent geometry and combine it as is done in standard Whitted ray tracers.
The current version of the plugin also requires a Unity layer to be set aside for the extension, as all models are placed and rendered from the scene. As far as I can see this is the only solution that allows me to render combined models into an arbitrarily sized rendertexture and let them be rotated by the user. AssetPreview.GetAssetPreview is simply too limited right now. If anyone knows a better solution I would love to hear it, as the current solution is far from perfect and can easily result in hidden memory leaks, although they only exist at editor time.
The GUI itself could use a major overhaul, but for now it does what it is supposed to.

While browsing the programming subreddit today I learned about two really cool GLSL shader sites with live demos, which both deserve to be shared and remembered for future browsing.

Both sites contain both the demos and the GLSL source code for rendering demos with 2 triangles. There’s a ton of mandelbrot’s, raymarchers and cool small demoscene examples. My personal favorites are the cloud demo, which I hope to implement a lighter version of in my own terrain demo at some point, and a raymarched version of the 2009 demo Elevated by RGBA and TBC.

# Random generator on the GPU

Random number generators are a very important tool in the programmer toolbelt. It’s useful for producing arbitrary random behaviour in computer games, creating cool procedural textures/environments and sampling in Monte Carlo simulations.

A true random number generator will always produce a perfectly random and unpredictable number, but pseudo-random number generators are usually enough. A pseudo-random number generator will, given a seed, produce a sequence of numbers with good random properties. In this blog post I will outline how to implement such a random generator first on a single threaded CPU and then how to move it to the multithreaded GPU.

A barebone Pseudo-Random Number Generator, PNRG, implementation might look something like this

class PRNG {
private:
unsigned int seed;
public:
/**
* Seeds the pseudo-random number generator.
*/
PRNG(unsigned int s) {
seed = s;
}

/**
* Produces the next pseudo-random number in the sequence.
*/
unsigned int Next() {
...
}
};


Pretty basic so far. Seed the generator and of course store that seed. But I won’t give up the implementation of Next just yet. First we need to

In order for a PRNG to produce a sequence of numbers longer than one it needs to update its seed after each new random number. A simple update would be to increment the seed by 1, but to produce good random numbers, the seed itself should preferably be random. Lets think that over for a moment; after having produced a random number using our seed, we need to update the seed itself with a new random number. Lucky for us that we just created one, so we can use the random number generated as our new seed.

class PRNG {
private:
unsigned int seed;
public:

...

/**
* Produces the next pseudo-random number in the sequence.
*/
unsigned int Next() {
seed = RANDOM_GENERATION(seed);
return seed;
}
};


All that remains is to select some algorithm to produce a random number from a seed. For my experiments I selected a linear congruential generator with the Microsoft Visual/Quick C/C++ parameters. This allowed me to remove the modulo operator, which can be costly, and it produced reasonably good results. The full implementation then becomes

class PRNG {
private:
unsigned int seed;
public:
/**
* Seeds the pseudo-random number generator.
*/
PRNG(unsigned int s) {
seed = s;
}

/**
* Produces the next pseudo-random number in the sequence.
*/
unsigned int Next() {
seed = 214013 * seed + 2531011;
return seed;
}
};


This PRNG works well on single threaded hardware, unfortunately it doesn’t work well on hardware with millions of independent threads, such as the GPU. The reason is obvious. The PRNG produces a list of random numbers and to produce a number for the million’th thread, we would have to call Next() a million times on that thread alone. In general the n‘th thread would have to call Next() n times, leading to a total of n(n+1) / 2 invocations to produce n numbers. Of course we could create a list of a million random values on the CPU and then send those to the GPU, but that solution both wastes memory and doesn’t scale well for threads requiring more than a few random numbers.

Fortunately we can do better. All we need to initialise the PRNG is a reasonably random seed, so if we can generate such a seed pr. thread, then each thread can create its own local PRNG and start happily creating random number sequences. To create the seed for each thread I send one random number, s, from the CPU to the GPU. I then add that number to a hash of the thread’s id, to create the seed. The reason that I hash the thread’s id is to avoid similarities between random numbers produced by neighbouring threads.

In code this amounts to

void GPUKernel(unsigned int CPUseed) {
PRNG rand = PRNG(CPUseed + Hash(threadId))
... Do stuff with the random numbers ...
}


where I used the hash found in thrust’s Monte Carlo example.

unsigned int Hash(unsigned int a) {
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
return a;
}


Some last notes.

I used 32 bit unsigned ints throughout this post with a maximum value of 2^32. A lot of random generators won’t generate 32 random bits, but may fx. only generate 16 bits and leave the 16 most significant bits as 0. To be sure that we utilise the entire spectrum of random number sequences across all threads it is important that the seed contains at least as many significant bits as the PRNG is expected to produce. This is easily achieved by concatenating random numbers of fewer significant bits into a 32bit bitstring.

The linear congruential generator does not produce the best sequence of random numbers, which is required in monte carlo simulations. I selected it for my tests because it was fast and had a low memory overhead. For better random results try using another generator from the list found at wikipedia.

# Morton Codes

I’ve recently had to use 3D and 5D Morton codes for spatial datastructure construction. Unfortunately I found that the material on the net actually describing how to encode, decode and interprete Morton codes is non-existent to shallow at best. The best sources I’ve found are wikipedia and an excellent blog post by Fabian “ryg” Giesen. This spurred me to create this blog post about my own experience with Morton codes and link to resources with more information.

Introduction

Morton codes provides an ordering along a space-filling curve while preserving data locality. Lets break that down a bit shall we?

A space-filling curve is a curve whose range contains the entire N-dimensional hypercube enclosing your data domain mapped to discrete cells. When working with 3-dimensional data the hypercube is usually just an axis aligned bounding box. What this all means is that the data is enclosed by the hypercube and that hypercube is then subdivided along each dimension, creating an N-dimensional grid, lets call that a hypergrid. A space-filling curve that contains the entire hypercube is guaranteed to pass through each cell of the grid.

A space-filling curve preserves data locality if similar data gets mapped within the same range on the curve. This means that a cell in the hypergrid gets mapped to a range on the curve and therefore all the data inside that cell gets mapped to a position inside that range. A good space filling curve will also map neighbouring cells close each other.

Morton codes provides such a spatially local mapping. Below are three iterations of the Morton order in 2- and 3-dimensions, where each iteration adds a new subdivision to the cells of the previous iteration.

MortonCurve 2×2

MortonCurve 4×4

MortonCurve 8×8

MortonCurve 2x2x2

MortonCurve 4x4x4

MortonCurve 8x8x8

Encoding Morton Codes

Creating the Morton code for some data amounts to determining coordinates (x, y) of the data inside the hypergrid and interleaving those coordinates. The figure below shows the Morton codes and curve in the 2-dimensional case with a 8×8 grid. Interleaving the binary representation of a cell’s coordinates yields the Morton code for the respective cell and connecting the cells in order of increasing Morton code connects the entire Morton curve.

Binary Morton curve creation

Implementing a function that can create Morton codes boils down to two problems: First seperate the least significant bits in c_x and c_y by N-1 and secondly combine them into a Morton code.

The implementation that separates the bits via a classical Divide-and-Conquer approach looks like this. In the comments after each line you can see the position of the significant bits.

unsigned int SeparateBy1(unsigned int x) {
x &= 0x0000ffff;                  // x = ---- ---- ---- ---- fedc ba98 7654 3210
x = (x ^ (x <<  8)) & 0x00ff00ff; // x = ---- ---- fedc ba98 ---- ---- 7654 3210
x = (x ^ (x <<  4)) & 0x0f0f0f0f; // x = ---- fedc ---- ba98 ---- 7654 ---- 3210
x = (x ^ (x <<  2)) & 0x33333333; // x = --fe --dc --ba --98 --76 --54 --32 --10
x = (x ^ (x <<  1)) & 0x55555555; // x = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
return x;
}


Adapting the code to separate by two, three or four is pretty straightforward. Since I needed separation by 4 for a 5D ray representation I'll include that here.

unsigned int SeparateBy4(unsigned int x) {
x &= 0x0000007f;                  // x = ---- ---- ---- ---- ---- ---- -654 3210
x = (x ^ (x << 16)) & 0x0070000F; // x = ---- ---- -654 ---- ---- ---- ---- 3210
x = (x ^ (x <<  8)) & 0x40300C03; // x = -6-- ---- --54 ---- ---- 32-- ---- --10
x = (x ^ (x <<  4)) & 0x42108421; // x = -6-- --5- ---4 ---- 3--- -2-- --1- ---0
return x;
}


Once we are able to separate the bits in the coordinate's binary representation, interleaving the result to obtain the Morton code is a simple matter of bit-shifting and or'ing.

MortonCode MortonCode2(unsigned int x, unsigned int y) {
return SeparateBy1(x) | (SeparateBy1(y) << 1);
}

MortonCode MortonCode5(unsigned int x, unsigned int y, unsigned int z, unsigned int u, unsigned int v) {
return SeparateBy4(x) | (SeparateBy4(y) << 1) | (SeparateBy4(z) << 2) | (SeparateBy4(u) << 3) | (SeparateBy4(v) << 4);
}

Decoding Morton Codes

Given a Morton code it can be useful to know which coordinates were used to generate it by deconstructing or decoding it. Doing this amounts to reversing the encoding, which is done by inverting the shifts and reversing the order of the masks, yielding the following functions.

unsigned int CompactBy1(unsigned int x) {
x &= 0x55555555;                  // x = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
x = (x ^ (x >>  1)) & 0x33333333; // x = --fe --dc --ba --98 --76 --54 --32 --10
x = (x ^ (x >>  2)) & 0x0f0f0f0f; // x = ---- fedc ---- ba98 ---- 7654 ---- 3210
x = (x ^ (x >>  4)) & 0x00ff00ff; // x = ---- ---- fedc ba98 ---- ---- 7654 3210
x = (x ^ (x >>  8)) & 0x0000ffff; // x = ---- ---- ---- ---- fedc ba98 7654 3210
return x;
}

unsigned int CompactBy4(unsigned int x) {
x &= 0x42108421;                  // x = -6-- --5- ---4 ---- 3--- -2-- --1- ---0
x = (x ^ (x >>  4)) & 0x40300C03; // x = -6-- ---- --54 ---- ---- 32-- ---- --10
x = (x ^ (x >>  8)) & 0x0070000F; // x = ---- ---- -654 ---- ---- ---- ---- 3210
x = (x ^ (x >> 16)) & 0x0000007F; // x = ---- ---- ---- ---- ---- ---- -654 3210
return x;
}


Using the CompactByN functions, we can construct a MortonDecodeN function similar to MortonCodeN seen above.

void MortonDecode2(MortonCode c, unsigned int x, unsigned int y) {
x = CompactBy1(c);
y = CompactBy1(c >> 1);
}

void MortonDecode5(MortonCode c, unsigned int x, unsigned int y, unsigned int z, unsigned int u, unsigned int v) {
x = CompactBy4(c);
y = CompactBy4(c >> 1);
z = CompactBy4(c >> 2);
u = CompactBy4(c >> 3);
v = CompactBy4(c >> 4);
}

Constructing Acceleration Datastructures

Recently a lot of methods for creating acceleration datastructures for ray tracing have focused on using space-filling curves to speed up construction time, fx LBVH and HLBVH. The idea is simple: Map your multi-dimensional data onto the one-dimensional space-filling curve, sort the data according to its position on the curve and then create an acceleration structure on top of the cells. Due to the recursive nature of Morton codes, creating the acceleration structure is very simple.

An interesting observation about Morton codes and acceleration structures is that the binary representation of a Morton code, m, can be interpreted as a traversal of an implicit binary tree to reach m's cell. The traversal is performed by looking at each bit in turn, starting with the most significant bit. Each bit represents a splitting of the domain's data along the axis that the bit derived its value from. The value of the bit will then tell us which side of that splitting plane our traversal should proceed to, with 0 meaning we traverse the 'lower' child and 1 meaning the 'upper' child.

This is best clarified by example so again we will look at the figure with binary Morton codes.

Binary Morton code representation

Using cell (2, 3) as our example cell we are working with the Morton code 001110. Starting from the tree's root node, which spans the entire domain, (0,0) -> (7,7), the first bit 001110 tells us that our bounds should be split along the spatial y-median, between y=3 and y=4, and that the cell we are looking for is found among the lower numbered cells. Splitting (0,0) -> (7,7) along y's spatial median we get the new bounds (0,0) -> (7,3).

Next the second bit 001110 tells us that we should split the bounds along the spatial x-median, between x=3 and x=4, and again visit the lower child. This yields the new bounds (0,0) -> (3,3)

Repeating this process for all bit in the morton code we get the bounds
(0,0) -> (3,3) => (0,2) -> (3,3) => (2,2) -> (3,3) => (2,3) -> (3,3) => (2,3) -> (2,3)
where the last bounds enclose our starting cell (2,3) perfectly.

Usually, knowing how to encode and decode Morton codes is enough to sort data and create acceleration structures around that data. However, recently I've had to recursively assign geometry to a sorted list of rays, starting with assigning the geometry seen by all the rays to the root node and then recursively refine that for each child. In order to do that I found it helpful to do several things.

• Given two Morton codes, m0 and m1, determine the Morton code that encodes the minimum coordinates of both m0 and m1.
• Given two Morton codes determine the encoding of their maximum coordinates.
Minimum and maximum

Finding the Morton code that encodes the minimum coordinates of two Morton codes is trivial.

MortonCode Minimum2(MortonCode lhs, MortonCode lhs) {
unsigned int lhsX, lhsY, rhsX, rhsY;

// Decode the left hand side coordinates.
MortonDecode2(lhs, lhsX, lhsY);

// Decode the right hand side coordinates.
MortonDecode2(rhs, rhsX, rhsY);

// Find the minimum of the coordinates and re-encode.
return MortonCode2(Min(lhsX, rhsX), Min(lhsY, rhsY));
}


But we can do better! Decoding and re-encoding means doing a lot of bit-shifting instructions that we'd like to avoid. First, realising that we're not interested in the actual values of x and y, but rather the minimum of their encoded values is a tremendous help. Secondly, SeperateBy1 preserves ordering, so if X < Y then SeperateBy1(X) < SeperateBy1(Y), as long as X and Y are valid coordinates. Using this we can construct a Minimum function that avoids a lot of instructions.

MortonCode Minimum2(MortonCode lhs, MortonCode lhs) {
// Isolate the encoded coordinates.
unsigned int lhsX = lhs & 0x55555555; // lhsX = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
unsigned int rhsX = rhs & 0x55555555; // rhsX = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
unsigned int lhsY = lhs & 0xAAAAAAAA; // lhsY = f-e- d-c- b-a- 9-8- 7-6- 5-4- 3-2- 1-0-
unsigned int rhsY = rhs & 0xAAAAAAAA; // rhsY = f-e- d-c- b-a- 9-8- 7-6- 5-4- 3-2- 1-0-

// Find the minimum of the encoded coordinates and combine them into a single word.
return Min(lhsX, rhsX) + Min(lhsY, rhsY);
}


Maximum can be constructed in a similar manner.