Nucleus is a real-time rendering system I’m creating for Boxen. It’s also the subject of my MSc thesis and an evolution of my two older rendering engines that turned out to be exercises in over-engineering, over-generalization and cache thrashing. This iteration attempts to be more pragmatic.

At the top level, Nucleus is split in two layers: the graphics base which lies on a level similar to XNA and a high level rendering interface.

Graphics base

This layer completely isolates the underlying rendering API (currently OpenGL 3.3 + CgFX) and gives an easy to use interface on top of it. The main building blocks are:

  • Effect – wraps vertex, geometry and fragment shaders together, allows shader specialization (e.g. setting the implementations of Cg interfaces, providing sizes for unsized Cg arrays). Once compiled, allows the programmer to provide values for uniform parameters shared by all instances of the effect. Finally, it allows instantiation into EffectInstances.
  • EffectInstance – provides a name space and storage for uniform and varying parameters associated with a single instance of an Effect. Additionally, uses a VertexArray to cache current varying parameter bindings. Automatically performs reference counting of resources assigned as uniform (Textures) and varying parameters (VertexBuffers). The memory for EffectInstances is allocated from pools in contiguous slabs for both the structure data and the effect-dependent payload. By adding an indirection to parameter storage, multiple Effects and EffectInstances may source their parameters from the same variable in memory, possibly managed entirely by the client.
  • Buffer – allows the specification and modification of graphics API-allocated memory (VRAM, AGP, system).
    • VertexBuffer – used for generic untyped vertex attributes. The actual types are specified when binding varying parameters to EffectInstances and specifying the offset, stride and type of the underlying data.
    • IndexBuffer – specialized for u16 and u32 mesh indices.
    • UniformBuffer – bindable uniform / constant buffers for shader programs.
  • VertexArray – a special kind of a driver-level object that caches varying parameter bindings, allowing for reduction of API calls.
  • Framebuffer – wrapper over all classes of framebuffers with automatic usage of Multiple Render Targets, rendering to textures, floating point support, etc.
  • Texture – 1D, 2D, 3D, rectangle, cube, ...

The aforementioned resources are accessed via opaque handles created and managed by a Renderer, eliminating the potential of fatal user mistakes such as manual disposal of a resource and its subsequent usage or memory corruption.

Additionally, the Renderer provides functionality to create, dispose and render the contents of a RenderList according to the current RenderState.

The RenderList is a collection of indices (ordinals) of EffectInstances and basic associated data required to render them, including:

  • Model→world and world→model transformation 4×3 matrices
  • An IndexBuffer along with the count of indices to use, offset, min and max therein
  • The number of instances of the object to be rendered via hardware instancing
  • Mesh topology

An important factor to note is that the RenderList does not plainly contain EffectInstances, but rather their rendering ordinals. The ordinals are u32 numbers assigned to each effect instance and managed internally by the renderer. Their order is determined by a heuristic attempting to minimize the number of state changes required when rendering the instances (currently: sorting by texture handles). This means that once the render list is constructed, the algorithm to minimize the required state changes basically boils down to sorting the ordinals, which is a very cheap operation. More ordering options will be made available in the future, so that the objects may be sorted by distance or using per-Effect routines built with knowledge of their performance characteristics.

As mentioned before, the Renderer also gives access to the RenderState, a set of common rendering settings, such as the mode of Z-testing, blending, back-face culling, etc.

The Graphics base layer can therefore be used instead of the underlying graphics API, completely hiding its complexity and error-prone setup. It has been tested (in isolation from the high level of Nucleus) by implementing chunked terrain and standard mesh rendering. Meshes are loaded from a custom format exported from 3DS Max by a plugin created specifically for Nucleus.

High level rendering model

The most fundamental idea behind Nucleus is that of providing a more intuitive programming level than hardware shaders deliver. While they tightly correspond to the way GPUs work, they are not a very good mental model for designing rendering algorithms, nor a comfortable basis for artists to work with. Nucleus leans more towards the model employed by Renderman.

The basic building block in Nucleus is called a kernel. From a conceptual point of view, it’s just a function which may be implemented in NVidia’s Cg language. Alternatively however, a kernel may be composed of other kernels connected into a directed acyclic graph.

Each object to be rendered must have 3 basic kernels provided for it. They must be implementations of the following abstract kernels:

  • Structure – Defines the macro-structure and mostly contains what a vertex or geometry shader might. It’s responsible for providing the primitives to be rendered, including data such as positions, normals, partial derivatives of position with respect to texture coordinates, etc.
  • Reflectance – Enables the object to interact with lights by implementing a BRDF. Examples include the Lambertian model, Phong, Cook-Torrance and Ashikhmin-Shirley.
  • Material – This is the level on which most artists will work, specifying the albedo of the rendered object, its bumpiness, specular tint, emissive lighting, roughness, etc. Kernels extending Material will be created for types of objects in a scene, as well as for particular instances thereof.

A scene will usually also contain lights. Each must have a kernel specified for it. In this case, an implementation of the abstract Light. At program runtime, Light kernels are connected to the per-object kernel types. This mechanism allows specification of custom attenuation, sampling and shadowing algorithms, which automatically are applicable to any Reflectance kernel.

The big picture therefore looks like this:

Nucleus uses a Domain Specific Language for the definition of graphs and plain kernels. The DSL may embed Cg code, so that a sample kernel might look like:

HalfLambert = kernel Reflectance {
    float NdotL = dot(normal, toLight).x;
    diffuse = NdotL * 0.5f + 0.5f;
    diffuse *= diffuse * intensity;
    specular = 0;

The effect of such a declaration is a kernel conforming to the interface of Reflectance, hence it can be used as a Reflectance kernel for an object.


Naturally, kernels, which live entirely on the GPU, need a way to have data provided for them. Each kernel type sources its inputs from a separate location, as to enable arbitrary composition.

Structure kernels have a CPU counterpart which unpacks data, creates vertex buffers, and finally provides it to the GPU. Additionally, the CPU side will be able to answer queries about the geometry of the object, such as ray-surface intersections and bounding volume computations. This part is necessarily strongly coupled with the type of an Asset being rendered.

Structure kernels and Assets together define a way for the renderer to rasterize primitives from various sources of data, the most basic of which is a triangle mesh. Another combination of the above might allow the rendering of height-mapped terrains, subdivision surfaces or geometry shader-based generation of primitives.

A Reflectance kernel plus its inputs is called a surface within Nucleus. Unlike Structure kernels, surfaces don’t come with any CPU logic attached. They are just collections of parameter values. A Reflectance kernel might implement the Phong lighting model, then a surface using this kernel can be specialized for various types of plastic or other dielectrics.

Finally Material kernel instances are known as just materials and similarly to surfaces, only carry specialized sets of data provided for the kernel and no logic to be evaluated on the CPU-side.

Lights are currently pretty heavy-weight entities in Nucleus, each being represented by an instance of the Light class and treated specially by renderers. All lights affecting the rendered scene are given a chance to prepare data for the kernels associated with them. This includes rendering of shadow maps, computing projection matrices from look-at constraints and determining bounding volumes based on an attenuation model and luminous intensity.

The result of this separation is the ability to render any Asset with any material applied to it, under any reflectance model, with any number of affecting lights. The next sub-section discusses just how this rendering may happen.

Rendering algorithms

The most straightforward way of rendering a collection of objects under the influence of lights is via the classical Forward rendering algorithm.

The way kernels are composed into GPU shaders in the Forward algorithm is relatively simple. Just take all the lights affecting an object, connect them to the Reflectance kernel, sum the outputs and multiply that with the output of the albedo and specular tint outputs of the Material kernel:

Such a graph is then split into vertex, geometry and fragment shaders using a specially designated node within the Structure kernel to serve as the bridge between the geometry and fragment pipelines.

Deferred rendering

Due to the particular split of the basic kernels which Nucleus enforces, it’s possible to use the same kernel implementations for various rendering algorithms, not just the traditional forward approach.

One downside of the forward renderer is that it requires large numbers of shaders to be compiled: basically O(materials * surfaces * light combinations). It’s not uncommon to have hundreds or thousands of shaders generated like this in a large game project. This issue can be partially mitigated by using multi-pass rendering instead of specializing the shader for all possible light combinations. Such an approach has the downside of requiring geometry re-rendering for each pass, hence can quickly become a bottleneck, particularly with large meshes affected by many lights. The computational complexity then becomes O(objects * lights).

Recently, a class of algorithms designed to resolve these very problems have been gaining popularity. Collectively known as deferred rendering, they manage to shift the costs around the rendering problem and break it down into separate stages, turning the aforementioned multiplicative computational complexity into a sum: O(objects + lights), albeit with additional memory overhead.

Nucleus contains an implementation of one such flavor of deferred rendering, known as light pre-pass. The computation is divided into 3 stages:

  • Render geometric attributes of objects in the scene into a set of textures (collectively called the G-buffer), containing:
    • surface depth
    • normal
    • surface ID (Nucleus – specific)
  • Draw bounding volumes around lights, computing lighting per-pixel with geometric attributes of the scene reconstructed from the G-buffer. Use additive blending to sum contributions.
  • Render scene objects again, this time fully evaluating their materials and combining them with the illumination computed in the previous step.

Light pre-pass – Stage 1

The first stage is simple. Most of the attributes required for the G-buffer are computed by the Structure kernel. Still, a Material kernel may want to alter e.g. normals in order to perform normal mapping. Hence, the first stage is:

Light pre-pass – Stage 2

The second stage is the tricky one. Deferred renderers are notorious for being quite rigid with respect to illumination models achievable with them. This is because once the G-buffer is rendered, the light stage doesn’t have access to per-object shaders, which might perform custom shading. Light volumes must then be rendered with the same shader for each pixel, regardless of which scene object a fragment in the G-buffer comes from.

A common approach is to use just one reflectance model ( usually the half-angle version of Phong’s ) for all pixels and only control roughness. This may work just fine for games which don’t require drastically varying surface styles.

Another solution is to render the majority of the scene using deferred rendering, but fall-back to forward rendering for surfaces requiring a different reflection model. Such an approach means that objects using the second-class reflectance models have to be used sparingly, and thus form a constraint imposed on artists.

The approach which Nucleus takes instead is not a novel one, but hasn’t been widely adopted due to being fairly questionable on past generations of graphics hardware.

The G-buffer stores the surface ID to be used at a given pixel and a fragment shader of the light stage branches on it, choosing the appropriate Reflectance kernel implementation. The kernel index and parameters for it are stored in a texture, into which the G-buffer contains a coordinate. The current approach allows up to 256 different surfaces, each of which might potentially use a different Reflectance kernel. Obviously, this is not free either, however with the advent of real branching in recent GPU generations means that in practice the cost does not seem very high.

float surfaceId = tex2D(attribTex, uv);
float reflectanceId = tex2D(surfaceId, float2(surfaceId, 0));
if (reflectanceId < A) {
    result = BRDF1(
        tex2D(surfaceId, float2(surfaceId, ...)),
        tex2D(surfaceId, float2(surfaceId, ...)),
        tex2D(surfaceId, float2(surfaceId, ...))
} else if (reflectanceId < B) {
    result = BRDF2(...);
} else if (reflectanceId < C) {
} else {

Please note that the BRDF kernel is composed with a light kernel, whose independent execution may hide some of the latency of the nasty dependent fetches above.

As a result, the second stage of light pre-pass in Nucleus presents itself like this:

What the above graph doesn’t show is how the light volume geometry is constructed. The current implementation uses a geometry shader to instantiate a cube for each light, taking a single point for the input. It also performs approximate culling of light volumes with the help of the EXT_depth_bounds_test OpenGL extension where it’s available.

The output of the light stage will typically look like this (100 lights on Marko Dabrovic’s Sponza model):

Light pre-pass – Stage 3

Once the Illumination buffer has been computed, the final stage of light pre-pass is simple:

Light pre-pass – Results

Here’s Marko Dabrovic’s / Crytek’s Sponza model being lit by 50 point lights using the ABg BRDF:

Semantic type system

One of the issues plaguing graph-based shader generation approaches is that they don’t remove much of the tedious work one has to go through in order to author a new effect. Data still needs to be converted between coordinate and representation spaces and each field must be manually connected between two consecutive nodes.

Nucleus tries to improve on the state of the art by implementing a rich semantic type system, which carries more concepts than just the data type. This semantic information may then be used to automate certain operations. The approach is hugely inspired by Abstract Shade Trees by McGuire et al.

This rich type system allows special traits to be associated with input and output parameters of kernels, for example:

  • The basis to which a vector belongs – world, view, clip, ...
  • The color space – RGB, sRGB, logLUV, ...
  • Linearity of a value – linear, logarithmic, exponential, ...
  • Whether a vector is normalized or not
  • The very meaning of the parameter, e.g. whether it’s a color or a normal vector

Consider a reflectance model which requires that its inputs be in world coordinates. It will simply declare its parameters with the basis trait set to world. If it requires the normal vector to be normalized, it will set the unit trait to true, etc. The kernel compiler will infer all appropriate conversions statically.

Just like in the work by McGuire et al., Nucleus is able to automatically find connections between particular parameters of kernels. It’s not necessary to wire each output to each input manually – just as the semantic type system allows conversions to be performed, it may also perform an implicit search in the graph of all possible conversions from a set of inputs, hence choosing the optimal source for each parameter. In the case of an ambiguity, parameters may still be connected manually.

Given a simple kernel graph such as:

Nucleus might resolve the connections it in the following way:

Semantic expressions

Nucleus also extends the technique proposed in Abstract Shade Trees by allowing the use of simple expressions which operate on semantic traits in the specification of output parameter semantics.

When working with a prototype of Nucleus which utilized a simpler version of the type system, it became apparent it’s useful not only to compute values using kernels, but also let the output semantics of kernel functions depend on the input semantics of these functions and the kernels they’re connected to. For instance, One could have a kernel which samples a texture. Textures may be tagged with traits, specifying what sort of data they contain. If such a texture is connected to the sampling kernel, it’s crucial to be able to express that the type of the sample should retain the traits of the texture. This is enabled by semantic expressions.

The signature of the sampling kernel could be:

Tex2D = kernel (
    in texture <type sampler2D>,
    in uv <type float2 + use uv>,
    out sample <in.texture.actual + type float4>

In this case, the traits of the output parameter “sample“ depend on the actual parameter which is connected to the input “texture“ parameter in a kernel graph. Hence, the result of connecting an input with a semantic <type sampler2D + use color> will be a sample with the semantic <type float4 + use color>.

Semantic expressions are most commonly used in semantic converters, which may then perform arbitrarily complex additive and subtractive manipulation of traits. Converters are used automatically by Nucleus when resolving connections between parameters whose types don’t match exactly. Each converter has an integral cost value associated with it. This value is an approximation of the computational complexity of performing the given conversion. This lets Nucleus optimize automatic conversion paths.

Converters, just as regular kernels, may be tagged as linear as to allow Nucleus to move an operation from the fragment to the vertex stage, even when it’s used in a Reflectance or Material kernel and not just Structure.


In addition to regular scene rendering, Nucleus contains special support for post-processing. It is performed by specifying a kernel and feeding it data. In this case, the data is just a texture (multiple inputs into the post-processing pipeline are planned for a later stage).

A simple post-processing pipeline might look like this:

Nodes using the special Blit kernel are used to break the graph to be rendered in multiple passes. This enables the implementation of algorithms such as the separable Gaussian blur. Blit nodes may also rescale the input as well as change its internal format.

In this case, the graph would be broken down into two passes:

In more complex cases, Nucleus will also find which passes may be performed at the same time and automatically use Multiple Render Targets.

Functional composition

Implementing post-processing using the framework of a regular graph-based editor is a tricky business. Consider the following graph:

The Blur kernel expects to get an image which it may sample multiple times, with various offsets. Connecting the Input directly to Blur would have done the trick, however the user has decided to filter the input through a power function first. The function operates not on Images, but on individual samples. Normally, the system would give up completely, however Nucleus has one more trick up its sleeve.

In this case, the Image type is nothing but a kernel. Particularly, one with the following signature:

Image = kernel(
    in uv <type float2 + use uv>,
    out sample <type float4>

Hence, the Blur kernel above expects another kernel as its input, but the parameter connected to this input is not a kernel. Its type, however, is the type of the kernel’s output parameter. This special case causes the type system to consider the whole incoming graph for functional composition. As a result, the graph might be turned into:

At code generation time, this graph-based function gets turned into a Cg interface, which the implementation of Blur may use just like any other Image.

Naturally this very mechanism is not restricted to just the post-processing component of Nucleus and may be used e.g. in order to implement procedural texturing.


This part is currently under development.

A graph-based authoring tool, Nucled (more screenshots from the previous prototype can be found here), will be provided in order to aid the shader artist. Except the connections between kernels, their implementations may be edited and the results visible immediately. Despite some criticism that graph-based tools sometimes receive, they can still be invaluable for fast shader prototyping and authoring given some discipline and expertise. In the case of Nucled, the implementations of more complex kernels can be specified entirely by programmers, whereas artists would then only provide inputs to them, thus the poor performance argument may hopefully be dodged.

Text settings:


Background settings:

Horizontal gradient

Vertical gradient (overrides color)