Search this blog

17 December, 2009

Lighting Compendium - part 1

Lighting is, still, one of the most challenging issues in realtime rendering. There is a lot of reserach around it, from how to represent lights to material models to global illumination effects.

Even shadows can't be considered solved for any but the simplest kind of lightsource (directional or sunlight, where using Cascaded Shadow Maps seems to be a de facto standard nowadays).

It looks like we have a pletheora of techniques, and choosing the best can be a daunting. But it you look a little bit closer, you'll realize that really, all those different lighting systems are just permutations of a few basic choices. And that by understanding those, you can end up with novel ideas as well. Let's see.

Nowadays you'll hear a lot of discussion around "deferred" versus forward rendering, with the former starting to be the dominant choice, most probably as the open world action-adventure-fps genere is so dominant.

The common wisdom is that if you need a lot of lights, deferred is the solution. While there is some truth in that statement, a lot of people accept it blindly, without much thinking... and this is obviously bad.

Can't forward rendering handle an arbitrary number of lights? It can't handle an arbitrary number of analytic lights, true, but there are other ways to abstract and merge lights, that are not in screen space. What about spherical harmonics, irradiance voxels, lighting cubemaps?

Another example could be the light-prepass deferred technique. It's said to require less bandwidth than the standard deferred geometry buffer one, and allow more material variation. Is that true? Try to compute the total bandwidth of the three passes of this method compared to the two of the standard one. And try to reason about how many material models you could really express with the information light-prepass stores...

It's all about tradeoffs, really. And to understand those, you have first to understand your choices.

Choice 1: Where/When to compute lighting.

Object-space. The standard forward rendering scenario. Lighting and material's BRDF are computed (integrated) into a single pass, the normal shading one. This allows of course a lot of flexibility, as you get all information you could possibly want to perform local lighting computation.
It can lead to some pretty complicated shaders and shader permutations as you keep adding lights and materials to the system, and it's often criticized for that.
As I already said, that's fairly wrong, as there's nothing in the world that forces you to use analytic lights, that require ad-hoc shader code for each of them. That is not a fault of forward rendering, but of a given lighting representation.
It's also wrong to see it as the most flexible system. It knows everything about local lighting, but it does not know anything about global lighting. Do you need subsurface scattering? A common approach is to "blur" diffuse lighting, scatter it on the object surface. This is impossible for a forward renderer, it does not have that information. You have to start thinking about multiple passes... that is, deferring some of your computation, isn't it?
Another pretty big flaw, that can seriously affect some games, is that it depends on the geometric complexity of your model. If you have too many, and too small triangles, you can incour in serious overdraw overheads, and partial-quads ones. Those will hurt you pretty badly, and you might want to consider offloading some of all your lighting computations to other passes for performance reasons. On the other hand, you get for free some sort of multiresolution ability, and that's because you can split easily your lighting between the vertex and pixel shaders.

Screen-space. Deferred, light-prepass, inferred lighting and so on. All based on the premise of storing some information on your scene in a screen-space buffer, and using that baked information to perform some of all of your lighting computations. It is a very interesting solution, and once you fully understand it, it might lead to some pretty nice and novel implementations.
As filling the screen-space buffers is usually fast, with the only bottleneck being the blending ("raster operations") bandwidth, it can speedup your shading quite a bit, if you have too small triangles leading to a bad quad efficiency (racap: current GPUs rasterize triangles into 2x2 pixel sample blocks, but quads on the edges have only some samples inside the triangle, all samples get shaded, but only the ones inside contribute to the image).
The crucial thing is to understand what to store in those buffers, how to store it, and which parts of your lighting compute out of the buffers.
Deferred rendering chooses to store material parameters and compute local lighting out of them. For example, if your materials are phong-lambert, then what does your BRDF need? The normal vector, the phong exponent, the diffuse albedo and fresnel colour, the view vector and the light vector.
All but the last are "material" properties, the light vector depends on the lighting (surprisingly), we store in the "geometry buffer", in screenspace the material properties, and then run a series of passes for each light, that provide the last bit of information and compute the shading.
Light-prepass? Well, you might imagine even without knowing much about it, that it chooses to store lighting information and execute passes that "inject" the material one and compute the final shading. The tricky bit, that made this technique not so obvious, is that you can't store stuff like the light vector, as in that case you would need a structure capable of storing in general, a large and variable number of vectors. Instead, light-prepass exploits the fact that some bits of light-dependent information are to be added together in the rendering equation for each light, and thus the more lights you have the more you keep adding, without needing to store extra information. For phong-lambert, those would be the normal dot view and normal dot light products.
Is this the only possible choice to bake in screenspace lighting without needing an arbitrary number of components? Surely not. Another way could be using spherical harmonics per pixel for example... Not a smart choice, in my opinion, but if you think about deferred in this way, you can start thinking about other decompositions. Deferring diffuse shading, that is the one were lighting defines shapes, and compute specular in object space? Be my guest. The possibilities are endless...
But where deferring lighting into multiple passes really shows its power, over forward rendering, is when you need to access non-local information. I've already made the example of subsurface scattering, and also on this blog I've talked (badly, as it's obvious and not worth a paper) about image-space gathering, that is another application of the idea. Screen-space ambient occlusion? Screen-space diffuse occlusion/global illumination? Same idea. Go ahead, make your own!

Other spaces. Why should we restrict ourselves to screen space baking of information? Other spaces could prove more useful, especially when you need to access global information. Do you need to access the neighbors on a surface? Do you want your shading complexity be independent of camera movements? Bake the information in texture space. Virtual texture mapping (also known as clipmaps or megatextures) plus lighting in texture space equals surface caching...
Light space is another choice, and shadow mapping is only one possible application. Bake lighting and you get the so called reflective shadow maps.
What about world-space? You could bake the lighting passing through a given number of locations and shade your object by interpolating that information appropriately. Spherical harmonic probes, cubemaps, dual paraboloid maps, irradiance volumes are some names...

Note about sampling. Each space has different advantages. Think how you can leverage them. Some spaces for example have some components that remain constant, while they would vary in others. Normalmaps are a constant in texture space, but they need to be baked every frame in screenspace. Some spaces enable baking at a lower frequency than others, some are more suitable for temporal coherency (i.e. in screenspace you can leverage on camera reprojection but other spaces you could avoid updating everything every frame). Hi-Z culling and multi-resolution techniques can be the key to achieve your performance criteria.

Ok, that's enough for now.

Next post I'll talk about the second choice, that is how to represent your lighting components (analytic versus table based, frequency versus spatial domain etc...) and how to take all those decisions, some guidelines to untangle this mess of possibilities...
Meanwhile, if you want to see a game that actually mixed many different spaces and techniques, to achieve lighting I'd suggest you to read about Halo 3...

19 November, 2009

Coding tactics versus design strategies

Today, while I was coming back home from work, I had a discussion with a colleage about one of our most important game tools: our animation system.

Said system is very big and has many features, it's probably one of our greater efforts and I doubt there is something more advanced out there. Now, it's even becoming a sort of game rapid prototyping thing, and it supports a few scripting languages, plus an editor, written in another language.

To make all those components communicate properly takes quite a bit of code, and so we needed to create another component that somewhats facilitates connecting the others.While we were discussing about the merits of techniques such as code generation versus code parsing, to link together different languages, it was clear that what was needed was indeed some sort of reflection, and that having said reflection would remove the need of also other parts of code (i.e. serialization).

So I went back home, and started thinking about why we didn't have that. Well, surely the problem had to be historical. Right now looking at our design, the "right" solution was obvious, but I knew that system started really, really small and evolved over the years.I realized actually that we didn't have in general a standard reflection system...

That's rather odd, as in many companies when you start creating your code infrastructure, reflection ends being one of the core components, and everyone uses that, it's more like a language extension, one of the many things you have to code to make C++ look less broken. We didn't have anything like that. We really don't have a core, we don't have an infrastructure at all!


Lack of strategy. We do have a lot of code. A lot. Many different tools, you won't believe how many, and I think that noone really can even view them all. We keep all those modules and systems in different repositories, really in different ogranizational structures with different policies and owners... It's huge and it can look messy.

To overcome the lack of a real infrastructure, some studios have their own standards, maybe a common subset of this huge amount of code that has been approved and tested as the base of all the products that studio makes. Some other studios do not do that, some other again do it partially.


Are we stupid? It looks crazy. I started thinking about how we could do better. Maybe, instead of choosing a subset of technologies that make our core and gluing them together with some bridge code and some test code, we could make our own copies of what we needed, and actually modify the code to live together more nicely. Build our infrastructure by copying and pasting code, modifying it, and not caring about diverging from the original modules.

But then what? It would mean that everything we modify will live in its own world, we can't take updates made by others, and we can't take other modules that depend on one of the pieces that we modified. And every game, to leverage on this new core, had basically yo be rewritten! Even cleaning up the namespaces is impossible! No, it's not a way that could be practical, even if we had the resources to create a team working on that task for a couple of years.


What went wrong? Nothing really. As bad as it might look, we know that it's the product of years of decisions, all of which I'm sure (or most of them) were sane and done by the best experts in our fields. We are smart! But... in the end it doesn't look like it! I mean, if you start looking at the code it's obvious that there was no strategy, the different pieces of code were not made to live together.


Is it possible to do better? Not really, no. We know that in software development, designing is a joke. You can't gather requirements, or better, requirements are something that you have to continuosly monitor. They change, even during the lifetime of a single product. How could we design technology to be shared... it's impossible!

Your only hope is to do the best you can do in a product, and then in another, and start to observe. Maybe there is some functionality that is common across them, that can be ripped out and abstracted into something shareable. Instread of trying to solve a general problem, solve a specific one, and abstract when needed. Gather. That's sane, it's the only way to work.


But then you get to a point were something started in a project, got ripped out because it was a good idea to do so, and evolves on its own across projects. Then another studio in the opposite side of the world sees that component, thinks its cool and integrates it. Integrates it together with its own stuff, that followed a similar path. The paths of those two technologies were not made to work together, so for sure the won't be orthogonal, they won't play nice. There will be some bloat. And the more you make code, promote it to a shareable module, and integrate other modules, the more bloat you get. It's unavoidable, but it's the only thing that you could do.


So what? We're looking at a typical problem. Strong tactics, good local decision, that do not lead over time to strong strategy. It's like a weak computer chess player (or go, chess is too easy nowadays). What's the way out of this? Well... do as strong computer chess programs do! They evaluate tactics over time. They go very deep, and if they find that the results are crap, they prune that tree, they trash some of their tactical decisions and take others. Of course computer chess can go forward in time and then back, wasting only CPU time.

We can't go back, but we can still change our pieces on the chessboard. We can still see that a part of the picture is going wrong and delete it... at least if we took the only important design decision out there: making your code optional. That's the only thing you have to do, you have to be sure to work in an environment where decisions can be changed, code can be destroyed and replaced. Two paths, two different technologies after ten years intersect. Good. They intersect too much? They become bloated? You have to be able to start a new one, that leverages on the experience, on the exploration done. But that is possible only if everything else does not deeply depend on those two.


Tactics are good. Tactics are your only option, in general. If you're small, have little code, have a few programmers, then you might live in the illusion that you can have a strategy. You're not, it's only that strong tactics at that size, look like a strategy. It's like playing chess on a smaller board, the same computer player that seemed weak, becomes stronger (even more clear again, with go) *. And of course that's not bad.

Some design is not bad, drawing the overall idea of where you could be going... Like implementing some smarter heuristics for chess. It's useful, but you don't live with the idea that it's going to be the solution. It can improve your situation by a small factor, but overall you will still need to bruteforce, to have iterations, to let things evolve. Eventually, over the years, relying on smart design decision is not what is going to make a difference. They will turn bad. You have to rely on the idea that tactics can become strategy. And to do that, you have to be prepared to replace them, without feeling guilty. You've explored the space, you've gathered information (Metropolis sampling is smart).

---

* Note: that's also why a lot of people, smart people, do not believe me when I say that stuff like iteration, fast iteration, refactoring, dependency elimination, languages and infrastructures that support those concepts, are better than a-priori design, UML and such. They have experience of too little worlds (or times). I really used to think in the same way, and even now it's very hard for me to just go and prototype, to ignore the (useless and not achievable) beauty of a perfect design drawn on a small piece of paper. We go to a company, or get involved into a project, or have experience of a piece of code. We see that there is a lot of crap. And that we could easily have done better! Bad decisions everywhere, those people must be stupid (well... sometimes they are, I mean, some bad decisions were just bad of course). Then we make our new system, we trash the old shit, and live happily. If the system is small enough, and the period of time we worked on it is small enough, we will actually feel we won... We didn't, we maybe took the right next move, a smart tactical decision. I hope that it didn't took too long to make it... because anyway, that's far from winning the match! But it's enough to make us care way too much about how to take that decision, how to make that next move, and not see that the real match does not care much about that, that they are not even fighting the big problem. It's really hard to understand all that, I've been lucky in my career, as I got the opportunity to see the problems at many different scales.

14 September, 2009

Fix for FXComposer 2.5 clear bug

The new Nvidia FXComposer is still mostly made of dog poo. Sorry, but it's an application that adds zero useful features, and a tons of bugs. Well not really, shader debugging would be useful to me, but I tried to debug my posteffect, made with SAS, and it failed miserably, so...

Unfortunately, FXComposer 1.8 is getting really old nowadays and sometimes it crashes on newer cards... so I'm forced to juggle between to two to find the one that has less bugs...

One incredibly annoying thing is that 2.5 on XP does not clear the screen, if you're using SAS at least. It doesn't work both on my Macbook 17'' and my Dell Pc at work (nothing weird, two of the most popular products in their categories, both with NVidia GPUs), so I had to find a workaround. Ironically, they seem to have so many bugs in SAS just now that for the first time since the beginning of FX composer, they released a little documentation about it... So now you kind of know how to use it, but you can't because it's bugged...

If you're having the same problem, here's my fix. I hope that NVidia engineers will make this post obsolete soon by showing some love for their product and fix this, as it's such a huge bug.

#define USEFAKECLEAR

[...]

struct FSQuadVS_InOut // fullscreen quad
{
float4 Pos : POSITION;
float2 UV : TEXCOORD0;
};

FSQuadVS_InOut FSQuadVS(FSQuadVS_InOut In)
{
return In;
}

#ifdef USEFAKECLEAR
struct FakeClear_Out
{
float4 c0 : COLOR0;
float4 c1 : COLOR1;
float4 c2 : COLOR2;
float4 c3 : COLOR3;

float d : DEPTH;
};

FakeClear_Out FakeClearPS(FSQuadVS_InOut In)
{
FakeClear_Out Out = (FakeClear_Out)0;
Out.d = 1.f;

return Out;
}
#endif

[...]

#ifdef USEFAKECLEAR
pass FakeClear
<
string Script =
"RenderColorTarget0=ColorBuffer1;"
"Draw=Buffer;";
>
{
ZEnable = true;
ZWriteEnable = true;
ZFunc = Always;

VertexShader = compile vs_3_0 FSQuadVS();
PixelShader = compile ps_3_0 FakeClearPS();
}
#endif

09 September, 2009

Calling for a brainstorm

Problem: Depth of Field. I want bokeh and we want correct blurring of both front and back planes.

Now I have some ideas on that. I think at least one good one. But I'd like to see, in the comments (by the way, if you like this blog, read the comments, usually I write more there than in the main post) your ideas for possible approaches.

Some inspiration: rthdribl (correct bokeh, but bad front blur, slow), lost planet dx10 (correct everything, but slow), dofpro (photoshop, non-realtime).

Words (and publications): gathering, scattering, summed area tables, mipmaps, separable filters, push-pull

I have developed something out of my idea. It currently has four "bugs", but only one that I don't know yet how to solve... The following image is really bad, but I'm happy with what it does... I'll tell you more, if you tell me your idea :)


Update: The following is another proof of concept, just to show that achieving correct depth blur is way harder that anchieving good bokeh. In fact you could have a decent looking bokeh even just using separable filters, the first image to the left just takes twice the cost of a normal separable gaussian, and looks almost as good as the pentagonal bokeh from the nonrealtime photoshop lens blur (while the gaussian filter, same size, looks awful)


Second Update: I've managed to fix many bugs, now my prototype has a lot less artifacts... I don't know if you can see it, but the DOF correctly bleeds out, even when the object overlaps an in-focus one. Look at the torus, the far side shows clearly the out-bleeding (as you can see an artefact still along the edges...). Now you should be able to see that the very same happens for the near part (less evident as it doesn't have problems, I dunno why, yet), and it doesn't shrink where the torus overlaps the sphere.


08 September, 2009

You should already know...

...that singletons are a bad, baaad, bad, idea: The Clean Code Talks - "Global State and Singletons"

It's interesting how people think that globals are bad, but singletons are not. They are the same thing, but still, internet noise can really destroy your brain.

You're told that globals are evil, and you're told that design patters are good. There is so much noise about that, they just become facts, with no need of reasoning around those.

Singletons are a design pattern (probably the only one you really know or seen actually used), so they're good.

Now try to remove that brainwashed ideas from your programmers minds... They won't believe you so easily, everyone loves design patterns (in my opinion, they're mostly crap), but you can use internet at your advantage. The video I've posted is on youtube, and done by google. That should be a very powerful weapon. Enjoy.

04 September, 2009

Code quality

Sooner or later, in every company there will be a discussion about code quality, how to monitor it, and how to make engineers care more, design and produce better code. How to massage and hone the existing one into more improving its quality...

Well it's everything very useful but sometimes we go to much into the details and lose focus on the only thing that really matters. What we always have to remember is that code rots, and it's inevitable.

It's up to your company to come up with a balance between how much effort you have to spend on design and maintenance, you can make quick and dirty hacks that will rot fast, or well designed software that rots slower but won't hit the target (because it moved somewhere else while you were designing for it)...

Or you can be smart and do iterative development with many short cycles and a lot of refactoring. In the end, the only thing you really want to avoid is to stick with only one pair of shoes for the rest of your life.

Make your code perishable, plan for its death, do not rely on its existence. Modules, interfaces, dependency control, that's your only goal when you're thinking about long-term. You have to be able to throw away everything.

The added bonuses of designing code this way are everywhere. You can throw less experienced programmers to a feature just because you know that it won't affect anything else than itself, and it can be replaced later on. Good code won't be built on top of bad code, bad code does not spread. You can create DLLs and achieve faster iteration. Your compile/link times will be faster and you can create testbeds with only the features a given area needs. You can do automatic/unit tests more easily. You can multithread more easily, and so on.

31 August, 2009

This year's buzzword: Image-Space

It shouldn't be news that the quality of publications done at Siggraph has been going progressively worse, in the last years, and other conferences are becoming more and more important, Eurographics, but also Siggraph Asia (see for example, those really neat publications).

A lot of politics, sponsors, pressure from universities, younger reviewers... Nowadays Siggraph is more an occasion to meet people and see what's going on in the industry, than a showcase of the best graphic research on the planet.

I didn't see anything groundbreaking, and a lot of publications were addressing problems that are not, in my view at least, so crucial. Still I don't think this year Siggraph's was bad, and you'll find plenty of coverage of the event online, so I won't write about that.

I have the impression that the non-realtime rendering, and GI in particular, has seen a slowdown recently, but it may also be that my interest shifted away from those subjects, so I don't have a good picture.

To me what's more interesting, at least now, is realtime graphics, and I'm probably more sensitive to publications in that field. At the main conference unsurprisingly, the most exciting realtime 3d presentation was done by Crytek (see this) but I was really looking forward for the papers of the HPG, one of the Siggraph's side-conferences.

Generally, there is some pretty good stuff there, like the Morphological Antialiasing paper, and many others... But you have to filter out the buzz and I was really bothered by some papers that, in my opinion, simply should not have been there. I don't know really why they bother me, probably it's also because I've seen published some ideas in the past, that I didn't bother to publish thinking they would have been rejected anyway, or maybe it's just that I have too many friends in the research community with good ideas and little luck.

Hardware-accellerated Global Illumination by Image Space Photon Mapping. Wow! Let's read...
And what's that? Well, if you've followed any GPU GI research in the last 1-2 years, it's really easy. They're using a RSM for the first-hit of the lights, they read it back in CPU and use that data for normal photon tracing (claiming that the slowest part is the first-hit, so they care about doing only that in GPU), then they splat the photons using... photon splatting.

It's mostly a tech-demo, it would be cool if they published it as such, maybe with better assets it could be a worthy addition to the other demos NVidia has. Maybe they could have published this applied research in NVidia's GPU Gems. But Siggraph?

Why "image space" anyway? And the worst part, why they don't say "RSM" or "splatting"? They cite those works as "related research", and that's it. They don't use any of those terms, they replaced everything with something else that makes it sound better and new... Photon splatting sounds slow, let's use "Image Space", is way more cool (doesn't matter if there's nothing happening in image space there). RSM are well-known... let's call them... bounce maps (genius)!

Image Space Gathering. Even worse! And it came just after the previous one in the HPG conference! It's something really minor, the only application seems to be blurry reflections, and from the images it doesn't look so nice for that either.

The algorithm? Render your image, and then blur it. But hey, preserve the edges using the Z-buffer, and make your kernel proportional to that too. Wow! Don't blow my mind with such advanced shit man!

They say "image space" and "gathering" and in the abstract,they also use "cross bilateral filtering". The idea is simple and little more than a not-so-neat trick, a curiosity with limited applications. But there's the buzz!

I think it would be easy to write a buzz-meter, check for the frequency of some keywords in the abstracts and build a filtering system that intelligently filters all the noise...

25 August, 2009

Experiment: DOF with Pyramidal Filters

I'm back from a short trip to Montreal, a lot of people have exhaustively blogged about Siggraph and such things so I won't.

Instead as promised, here is the source code from my DOF experiment. Works in FX Composer 2.5, I couldn't use 1.x this time because it crashed on my macbook... 2.5 has other bugs I had to work around, but I managed to solve those (see the code).

As with some other previous snippets I published, those are small tests I did at home, I hope they can be inspiring for someone, maybe even in other domains, more than something I'll use right now on a game (the catch is, if they were, I couldn't publish them anyways ;)

A little background:

Pyramid filtering is a very useful technique for implementing a wide class of continuous convolution kernels. They can be used for a wide range of applications, from image upsampling to inpaiting (see: Strengert , 2007. Pyramid Methods in GPU-Based Image Processing. Conference on Computer Graphics Theory and Applications (GRAPP'07), volume GM-R, pp.21-28).

As with separable filtering, they can be computed in linear time, but unlike them, it's easy to simultaneously compute the convolution for different kernel sizes simultaneously.

A pyramid filter works by applying a convolution, with a small kernel, on the source image, and downsampling the result into a smaller texture. This step is called analysis, and it's repeated multiple times. Each new analysis level allows us to compute our kernel over a wider area. After a given number of analysis steps, we perform the same number of synthesis steps, where we start on the smallest level of our image pyramid, and go up by convolving and upsampling the image.

This pyramid lets us vary the size of our filtering kernel in the synthesis step. As the kernel size depends on the depth of the pyramid, deciding on which level a given pixel starts its synthesis process affects the size of the filter applied to that region of the image.

There are a few problems to be solved if you want to implement DOF with this.

The first one is that you need to mask some samples during your blurring pass, and it's not so obvious to choose how to do that, as the filtering is a two-pass process now. Ideally you'd want to mask different thinks during the first and the second pass, but it's not really possible.

The second problem is how to deal with foreground blur, that is, bleeding out the blur outside foreground object borders (see Starcraft II Effects & Techniques. Advances in Real-Time Rendering in 3D Graphics and Games Course - SIGGRAPH 2008).

I found a solution that does not look too bad, it's still improvable (a lot) and tweakable, but I didn't work further on it because it's stil unable to achieve a good bokeh, and I think that's really something we have to improve in our DOF effects now. Probably it's better suited for other effects, where you don't have so many discontinuities in your blur radius. Smoke and fog scattering for example could work very well.

float Script : STANDARDSGLOBAL <
string UIWidget = "none"; // suppress UI for this variable
string ScriptClass = "scene"; // this fx will render then scene AND the postprocessing
string ScriptOrder = "standard";
string ScriptOutput = "color";
string Script = "Technique=Main;";
> = 0.8; // FX Composer supports SAS .86 (directX 1.0 version does not support scripting)

#define COLORFORMAT "A16B16G16R16F"
//#define COLORFORMAT "A8B8G8R8"

// FX Composer 2.5, on Vista, on my MacBook, does ignore "Clear=Color;" and similar... So I have to FAKE IT!
#define USEFAKECLEAR

// -- Untweakables, without UI

float4x4 WorldViewProj : WorldViewProjection < uiwidget="None">;

float2 ViewportPixelSize : VIEWPORTPIXELSIZE
<
string UIName="Screen Size";
string UIWidget="None";
>;

// -- Tweakables

float ZSpillingTolerance
<
string UIWidget = "slider";
float UIMin = 0;
float UIMax = 2;
float UIStep = 0.01;
string UIName = "Blending tollerance for background to foreground";
> = 0.1;

float DOF_ParamA
<
string UIWidget = "slider";
float UIMin = 0;
float UIMax = 20;
float UIStep = 0.01;
string UIName = "Depth of field";
> = 1;

float DOF_ParamB
<
string UIWidget = "slider";
float UIMin = -10;
float UIMax = 10;
float UIStep = 0.1;
string UIName = "Depth distance";
> = -5;

// -- Buffers and samplers

texture DepthStencilBuffer : RENDERDEPTHSTENCILTARGET
<
float2 ViewportRatio = {1,1};
string Format = "D24X8";
string UIWidget = "None";
>;

// No version of FX Composer let me bind a mipmap surface to as a rendertarget, that's why I need all this:
#define DECLAREBUFFER(n) \
texture ColorBuffer##n : RENDERCOLORTARGET \
< \
float2 ViewportRatio = {1./n,1./n}; \
string Format = COLORFORMAT; \
string UIWidget = "None"; \
int MipLevels = 1; \
>; \
sampler2D ColorBuffer##n##Sampler = sampler_state \
{ \
texture = ; \
MagFilter = Linear; \
MinFilter = Linear; \
AddressU = Clamp; \
AddressV = Clamp; \
}; \

DECLAREBUFFER(1)
DECLAREBUFFER(2)
DECLAREBUFFER(3)
DECLAREBUFFER(4)
DECLAREBUFFER(5)
DECLAREBUFFER(6)

// -- Data structures

struct GeomVS_In
{
float4 Pos : POSITION;
float2 UV : TEXCOORD0;
};

struct GeomVS_Out
{
float4 Pos : POSITION;
float4 PosCopy : TEXCOORD1;
float2 UV : TEXCOORD0;
};

struct FSQuadVS_InOut // fullscreen quad
{
float4 Pos : POSITION;
float2 UV : TEXCOORD0;
};

#ifdef USEFAKECLEAR
struct FakeClear_Out
{
float4 Buffer0 : COLOR0;
float Depth : DEPTH;
};
#endif

// -- Vertex shaders

GeomVS_Out GeomVS(GeomVS_In In)
{
GeomVS_Out Out;

Out.Pos = mul( In.Pos, WorldViewProj );
Out.PosCopy = Out.Pos;
Out.UV = In.UV;

return Out;
}

FSQuadVS_InOut FSQuadVS(FSQuadVS_InOut In)
{
return In;
}

float4 tex2DOffset(sampler2D tex, float2 UV, float2 texTexelSize, float2 pixelOffsets)
{
// DirectX requires a half pixel shift to fetch the center of a texel! (fx composer 2.5)
return tex2D(tex, UV + (texTexelSize * (0.5f + pixelOffsets)));
}

float4 SceneBakePS(GeomVS_Out In) : COLOR
{
float linZ = 1/In.PosCopy.z;

float3 color = frac(In.UV.xyx * 3); // just a test color...

return saturate(float4(color, linZ));
}

#ifdef USEFAKECLEAR
FakeClear_Out FakeClearPS(FSQuadVS_InOut In)
{
FakeClear_Out Out = (FakeClear_Out)0;
Out.Depth = 1.f;

return Out;
}
#endif

float ComputeDofCoc(float linZ)
{
return abs(DOF_ParamA * ((1/linZ) + DOF_ParamB));
}

float ComputeAntiZSpillWeight(float linZcenter, float linZ)
{
return max(ZSpillingTolerance - abs(linZ - linZcenter),0) / ZSpillingTolerance;
}

float4 AnalysisPS(
FSQuadVS_InOut In,
uniform float level,
uniform sampler2D InColor,
uniform float2 SrcUVTexelSize
) : COLOR
{
const float AnalysisRadius = 1;

// Reduction step / gathering
float4 col = tex2DOffset( InColor, In.UV, SrcUVTexelSize, 0.f.xx );
float centerLinZ = col.w;
float weight = 1;

float4 res = tex2DOffset(InColor, In.UV, SrcUVTexelSize, float2(-1,-1) * AnalysisRadius);
float resweight = ComputeAntiZSpillWeight(centerLinZ, res.w);
weight += resweight;
col += res * resweight;

res = tex2DOffset(InColor, In.UV, SrcUVTexelSize, float2(1,-1) * AnalysisRadius);
resweight = ComputeAntiZSpillWeight(centerLinZ, res.w);
weight += resweight;
col += res * resweight;

res = tex2DOffset(InColor, In.UV, SrcUVTexelSize, float2(-1,1) * AnalysisRadius);
resweight = ComputeAntiZSpillWeight(centerLinZ, res.w);
weight += resweight;
col += res * resweight;

res = tex2DOffset(InColor, In.UV, SrcUVTexelSize, float2(1,1) * AnalysisRadius);
resweight = ComputeAntiZSpillWeight(centerLinZ, res.w);
weight += resweight;
col += res * resweight;

col /= weight; // normalize
//col.w = centerLinZ; // leave Z as untouched as possible

return col;
}

float4 SynthesisPS(
FSQuadVS_InOut In,
uniform float level,
uniform sampler2D InColor,
uniform float2 SrcUVTexelSize,
uniform sampler2D PrevInColor,
uniform float2 PrevSrcUVTexelSize
) : COLOR
{
// Always sampling the first level to obtain linZ
float linZ = tex2DOffset(ColorBuffer1Sampler, In.UV, 1.f/ViewportPixelSize, 0.f.xx).w;
float dof_coc = ComputeDofCoc(linZ);

// Expansion step / scattering
float4 col = tex2DOffset(InColor, In.UV, SrcUVTexelSize, 0.f.xx);
float4 prevcol = tex2DOffset(PrevInColor, In.UV, PrevSrcUVTexelSize, 0.f.xx);

// Blur out
if(ComputeDofCoc(col.w) < xyz =" prevcol.xyz;
//dof_coc = max(dof_coc, max(ComputeDofCoc(col.w), ComputeDofCoc(prevcol.w)));

float useThisLevel = saturate(dof_coc - (level /* *2 */));

//return float4(useThisLevel.xxx, 1.f); // DEBUG
return float4(col.xyz, useThisLevel);
}

float4 FSQuadBlitPS(FSQuadVS_InOut In, uniform sampler2D InColor, uniform float2 InTexelSize) : COLOR
{
return tex2DOffset(InColor, In.UV, InTexelSize, 0.f.xx).xyzw;
}

// -- Technique

#define DECLAREANALYSIS(s,d) \
pass Analysis##s \
< \
string Script = \
"RenderColorTarget0=ColorBuffer"#d";" \
"Draw=Buffer;"; \
> \
{ \
ZEnable = false; \
ZWriteEnable = false; \
AlphaBlendEnable = false; \
VertexShader = compile vs_3_0 FSQuadVS(); \
PixelShader = compile ps_3_0 AnalysisPS( s, ColorBuffer##s##Sampler, 1.f/(ViewportPixelSize/s) ); \
}

#define DECLARESYNTHESIS(prevs,s,d) \
pass Synthesis##d \
< \
string Script = \
"RenderColorTarget0=ColorBuffer"#d";" \
"Draw=Buffer;"; \
> \
{ \
ZEnable = false; \
ZWriteEnable = false; \
AlphaBlendEnable = true; \
SrcBlend = SrcAlpha; \
DestBlend = InvSrcAlpha; \
ColorWriteEnable = 7; \
VertexShader = compile vs_3_0 FSQuadVS(); \
PixelShader = compile ps_3_0 SynthesisPS( d, ColorBuffer##s##Sampler, 1.f/(ViewportPixelSize/s), ColorBuffer##prevs##Sampler, 1.f/(ViewportPixelSize/prevs) ); \
}

// debug-test stuff:
#define ENABLE_EFFECT
#define ENABLE_SYNTH

technique Main
<
string Script =
#ifdef USEFAKECLEAR
"Pass=FakeClear;"
#endif
"Pass=Bake;"
#ifdef ENABLE_EFFECT
"Pass=Analysis1;"
"Pass=Analysis2;"
"Pass=Analysis3;"
"Pass=Analysis4;"
"Pass=Analysis5;"
#ifdef ENABLE_SYNTH
"Pass=Synthesis5;"
"Pass=Synthesis4;"
"Pass=Synthesis3;"
"Pass=Synthesis2;"
"Pass=Synthesis1;"
#endif
#endif
"Pass=Blit;"
;
>
{
#ifdef USEFAKECLEAR
pass FakeClear
<
string Script =
"RenderColorTarget0=ColorBuffer1;"
"RenderDepthStencilTarget=DepthStencilBuffer;"
"Draw=Buffer;";
>
{
ZEnable = true;
ZWriteEnable = true;
ZFunc = Always;

VertexShader = compile vs_3_0 FSQuadVS();
PixelShader = compile ps_3_0 FakeClearPS();
}
#endif

pass Bake
<
string Script =
"RenderColorTarget0=ColorBuffer1;"
"RenderDepthStencilTarget=DepthStencilBuffer;"
"Clear=Color;"
"Clear=Depth;"
"Draw=Geometry;";
>
{
ZEnable = true;
ZWriteEnable = true;

VertexShader = compile vs_3_0 GeomVS();
PixelShader = compile ps_3_0 SceneBakePS();
}

#ifdef ENABLE_EFFECT
DECLAREANALYSIS(1,2)
DECLAREANALYSIS(2,3)
DECLAREANALYSIS(3,4)
DECLAREANALYSIS(4,5)
DECLAREANALYSIS(5,6)
DECLARESYNTHESIS(6,6,5)
DECLARESYNTHESIS(6,5,4)
DECLARESYNTHESIS(5,4,3)
DECLARESYNTHESIS(4,3,2)
DECLARESYNTHESIS(3,2,1)
#endif

pass Blit
<
string Script =
"RenderColorTarget0=;"
"RenderDepthStencilTarget=;"
"Clear=Color;"
"Draw=Buffer;";
>
{
ZEnable = false;
ZWriteEnable = false;
VertexShader = compile vs_3_0 FSQuadVS();
PixelShader = compile ps_3_0 FSQuadBlitPS( ColorBuffer1Sampler, 1.f/(ViewportPixelSize/1) );
}
};

23 July, 2009

Analytic diffuse shading

No big news yet, I've been studying a little bit analytic methods for shading and occlusion, I can't report anything really now, even because I'm not yet satisfied with what I've done.

But I'd like to share this link: http://www.me.utexas.edu/~howell/tablecon.html, differential element to finite area is what you’ll need.

Also, you might find this useful, if you're starting to play around with spherical harmonics, it's a small snipped of what I've been doing, with Mathematica:
(* analytic solution for real spherical harmonics test *)

shIndices[level_] := (Range[-#1, #1] & ) /@ Range[0, level]
shGetNormFn[l_, m_] := Sqrt[((2*l + 1)*(l - m)!)/(4*Pi*(l + m)!)]
shGetFn[l_, m_] :=
Piecewise[{{shGetNormFn[l, 0]*LegendreP[l, 0, Cos[\[Theta]]],
m == 0}, {Sqrt[2]*shGetNormFn[l, m]*Cos[m*\[Phi]]*
LegendreP[l, m, Cos[\[Theta]]],
m > 0}, {Sqrt[2]*shGetNormFn[l, -m]*Sin[(-m)*\[Phi]]*
LegendreP[l, -m, Cos[\[Theta]]], m <>
shFunctions[level_] :=
MapIndexed[
Function[{list, currlevel}, (shGetFn[currlevel - 1, #1] & ) /@
list], shIndices[level]]
shGenCoeffs[shfns_, fn_] :=
Map[Integrate[#1*fn[\[Theta], \[Phi]]*Sin[\[Theta]], {\[Theta], 0,
Pi}, {\[Phi], 0, 2*Pi}] & , shfns, {2}]
shReconstruct[shfns_, shcoeffs_] :=
Simplify[Plus @@ (Flatten[shcoeffs]*Flatten[shfns]),
Assumptions -> {Element[\[Theta], Reals],
Element[\[Phi], Reals], \[Theta] >= 0, \[Phi] >= 0, \[Theta] <=
Pi, \[Phi] <= 2*Pi}]

shIsZonal[shcoeffs_, level_] :=
Plus @@ (Flatten[shIndices[level]] Flatten[shcoeffs]) == 0
shGetSymConvolveNorm[level_] :=
MapIndexed[
Function[{list, currlevel},
Table[Sqrt[(4 \[Pi])/(2 currlevel + 1)], {Length[list]}]],
shIndices[level]]
shGetSymCoeffs[shcoeffs_] :=
Table[#1[[Ceiling[Length[#1]/2]]], {Length[#1]}] & /@ shcoeffs
shSymConvolve[shcoeffs_, shsymkerncoeffs_,
level_] := (Check[shIsZonal[shsymkerncoeffs], err];
shGetSymConvolveNorm[level] shcoeffs shGetSymCoeffs[
shsymkerncoeffs])

(* tests.... *)

testnumlevels = 2
testfn[a_, b_] :=
Cos[a]^10*UnitStep[Cos[a]] (*symmetric on the z axis*)
(*testfn[a_,b_]:= (a/Pi)^4*)
shfns = shFunctions[testnumlevels]
testfncoeffs = shGenCoeffs[shfns, testfn]
shIsZonal[testfncoeffs, testnumlevels]
testfnrec = {\[Theta], \[Phi]} \[Function]
Evaluate[shReconstruct[shfns, testfncoeffs]]
SphericalPlot3D[{testfn[\[Theta], \[Phi]],
testfnrec[\[Theta], \[Phi]]}, {\[Theta], 0, Pi}, {\[Phi], 0, 2 Pi},
Mesh -> False, PlotRange -> Full]

testfn2[a_, b_] := UnitStep[Cos[a] Sin[b]](*asymmetric*)
testfn2coeffs = shGenCoeffs[shfns, testfn2]
testfn3coeffs =
shSymConvolve[testfn2coeffs, testfncoeffs, testnumlevels]
testfn2rec = {\[Theta], \[Phi]} \[Function]
Evaluate[shReconstruct[shfns, testfn2coeffs]]
testfn3rec = {\[Theta], \[Phi]} \[Function]
Evaluate[shReconstruct[shfns, testfn3coeffs]]
SphericalPlot3D[{testfn2[\[Theta], \[Phi]],(*testfn2rec[\[Theta],\
\[Phi]],*)testfn3rec[\[Theta], \[Phi]]}, {\[Theta], 0, Pi}, {\[Phi],
0, 2 Pi}, Mesh -> False, PlotRange -> Full]

13 July, 2009

Going to Vegas...

I'm going to Vegas tomorrow, unfortunately the blog has been very slow lately, that's not because I'm not experimenting, it's because I'm investigating a lot of things at work, and I can't blog about them...

I'll post the shader for the DOF effect soon tho, as it was not accepted for ShaderX8 (sadly, as I think it's way better than the article I wrote last time, and that was in SH7)

16 May, 2009

How the GPU works - appendix A

This is the first follow up article to the "how the GPU works" series, and this one is also related to GPU versus CPU

Note: if you already know how and why a GPU works and how it does compare to a modern CPU, and why the two are converging, you might want to skip to the last paragraph directly...

NVidia Cuda Architecture
Cuda is the NVidia API for General Purpose GPU computation. What does that mean? That is an API that allows us to access the GPU hardware, same as OpenGL or DirectX, but it was designed not allow rendering of shaded triangles, but to execute programs, generic programs. Well, generic... there are some restrictions. It's stream computing, you specify a smallish kernel code, that executes on a huge array of data, and can access memory resources in a very specific way (compared to what we can do on a CPU).
The API itself is not too interesting, there are a lot of similar ones (most notably OpenCL and DirectX11 compute shaders, but also Rapidmind, ATI CTM/FireStream and others). What is intresting is that a new API also sheds some new light on the underlying architecture, and that architecture is what really matters today (in that respect the recent NVAPI looks really promising too, and a good read are ATI open GPU documents as well).

Why does that matter?
For us of course, knowing the GPU architecture is fundamental. The more information we have, the better. Plus the compute shader model allows for some nifty new effects, especially on the postprocessing, and someone is even starting to map REYES (Pixar!).

But GPGPU has a relevance that is outside the realm of the graphics (so much that most Cuda documentation is not aimed at rendering engineers). It sheds some light on the future of CPUs in general.

The future of CPUs... Multicore! Multicore is the hype! But what it comes with it? How to fit all those cores in a single chip? What about power consumption?

Well it turns out, and it's no news, that you have to simplify... and if you've only experienced the Intel desktop line of processors, that's not so clear yet, even if the new Core architectures went back from the extremely power hungry Pentium 4 design, and looking at the Pentium M design again (i.e. some of the first Centrino chips).
For the Larabee chip, that will have many more cores than the current desktop CPUs, they used for each core something derived from the design of the original Pentium.

When it comes to computing power per watt, our current CPUs are faring pretty badly. It has been said that GPU computational power is outpacing Moore's law, while CPUs are respecting that trend. That's false, Moore's law is not about gigaflops, but number of transistors. The fact that desktop CPU processors have a Moore's law-like behaviour in terms of that, means they're wasting a lot of, because gigaflops should not take care only of number of transistors, but also of better use of those, and increased clock rate as well! That's why the smart guys at google, preferred to stick to older Pentium processors in their farms...

Where did all those transistors go, if not directly improving operations per second? Well, they went into increasing performance, that is a different thing. They increased the execution speed of the applications they were designed to execute, spending most of their effort into complex logic to handle caching, to predict branches, to decode and reschedule instructions... That's why now "the free lunch is over", because chips are going simpler, more gigaflops are out there but we have to do work to use them, we have to change our code.

That's really evident if you work on the current generation of gaming consoles. The processors powering the Xbox 360 and the Playstation 3, Xenon and Cell respectively, are very similar, in fact the former was derived from the design of the latter.

Multicore, in-order execution (no rescheduling of your assembly instructions!), no or minimal branch prediction, SIMD instructions with long pipelines, lots of registers (but you can't move data from one set to the other without passing from memory), a few cores, and a punishing penality for cache misses.

The idea is to minimize the power needed to achieve the result. That is also a rule in general we should follow when designing software!

How to squeeze performance out of those CPUs?
Well, it turns out that's not so hard. Actually it can be incredibly fast, if you have a small-ish function that iterates on a lot of data, and that data is in arranged in a good way. We use SIMD for sure, but that has a long instruction pipeline, and no reordering, so we need to push a lot of intructions that have no dependencies with previous ones in order to keep it happy.

How? Well we go even wider, by unrolling (and inlining) our loop! We try to avoid branches, using conditional moves for example. We exploit data-parallelism, splitting our loop into threads, and prefetching memory into the cache. That's why data is the king!

If, unluckily, your memory accesses are not linear, and thus prefetching does not work well you go even wider! The idea is to have something similar to fibers, having different execution contextes running in your thread and switching between them before a memory access. Of course, you first fire a prefetch, and then switch, otherwise when you switch back you won't have gained anything...

Usually you don't need explicit fibers, you organize your algorithm in an external loop, that prefetches the random memory accesses, and internal unrolled ones that provide enough ALU work to make the prefetch successful.

Note that in some way, this is something that many CPUs do implement internally too. Both Xenon and Cell do something similar to Intel Hyperthreading, they have more hardware threads than cores. That is done to ensure they have more independent code paths to execute per core, so if a core is waiting on a memory request, or on a pipeline stall, it can still execute something by switching to the other thread, and keep the pipelines happy.

How does a GPU map to a standard CPU?
Well, if you look at a GPU from a perspective of an older, non multicore CPU, it could seem that the two chips are totally unrelated. But nowadays, with many, simpler cores starting to appear in CPUs, and GPUs being more and more general purpose, the gap is becoming way smaller. Just take this CPU trend, and push it to the extreme, and you'll have something close to a GPU (as a sidenote, that's why some of us, rendering engineers, can seem to already have seen the future, in terms of how to do threading...).

Imagine a CPU core, and think about packing a LOT of those into a single chip.You'll pretty soon run out of resources, space and power. So you have to remove stuff, simplify.
We have already seen that we can give up many technologies that were made to squeeze performance out of sequential code, like fancy memory and branch prediction units, out of order instruction schedulers, long pipelines and so on. On the other hand, we want to maximize our computing power, so it makes sense to go wide, add ALUs and process a lot of data in parallel (in a SIMD fashion).

The picture we have so far is of a computational component (usually in a GPU there are many of those) that has a simple instruction decoding unit, some kind of memory access (usually in a GPU there are separate units for texture and vertex memory, with different caching policies and functionalities), and a lot of ALUs. This looks like a pain to program! As we saw, the problem with having such raw power, is that it works really well only if we don't have stalls, and not having stalls requires no istruction dependencies, and no memory accesses. That's impossible, so we have to find a way around it.
What did we say about modern CPU programming? The solution is to identify and unroll loops, to have less dependencies between instructions, and when things go really wrong, in case of memory stalls, we can switch execution contest alltogether, with fibers, in order to find new instructions that are not waiting on the stall.
In other words, we can always find instructions to process if our loops are way wider than our wide processing elements, so we do not only fill their SIMD instuctions, but "overflow" them. If our CPU has 4-way simd vectors, and our computation is unrolled so that it could use 16-ways ones, it means that we have 4 independent 4-way instruction paths the CPU can execute!
Well GPUs were made to process vertices and pixels, usually we have a lot of them, a lot of data to process with the same code, the same kernel. That's a big loop! There is hope to implement the same solution in hardware!

Here things become really dependent on the specific GPU, but anyway, in general we can draw a picture. Think about a single instruction decoder, that feeds the same instructions to a lot of ALUs. Those ALUs can be SIMD or not, it doesn't really matter because anyway the same instruction is shared among ALUs, thus achieving parallel execution (interestingly, even if shader code is capable of using vectors of four floats, and packing data correctly can make a big difference in performances, expecially when it comes to registers usage, not all GPU ALUs do work natively on those vectors).

NVIDIA calls that SIMT execution, but the same concept applies for ATI GPUs as well (r
ecent ATI units can process two instructions, one four way vector and one scalar in a single cycle, NVIDIA ones can do two scalars, and there are some other limitations on the operations that can be performed too). Intel Larrabee differs from those GPUs in that, as it will have only explicit SIMD, 16-wide, that means that there is not a control unit that dispatches the same instruction to different ALUs, on different registers, but it's all encoded into instructions that will access 16-float wide registers.
To keep the data for all those ALUs, there is a a big shared register space. This register space not only provides enough registers to unroll our computation, enabling the repetition of the same instruction for more than one clock cycle on different data (usually an instruction is repeated two or four times, thus masking dependencies between instructions!), but also enables us to keep data for different execution contextes of the same code, different fibers! As we're executing the same instruction on a lot of ALUs for differeny cycles, that means that our fibers are not independent, they are scheduled in small blocks that follow the same destiny (the same instruction flow).
The instruction decoder will automatically, in hardware, take care of switching context on stalls. A context has to record the state of all the (temporary) registers needed to execute the shader (or loop kernel) code, exactly as a fiber (that has its own function stack and trace of the CPU registers). So depending on the complexity of the code, the GPU can have more or less contextes in flight, and thus do a better or worse job at hiding big (memory) latencies.
Again, we need a lot of registers because we want our data to be way wider than our capability of executing it! Multiply that, for a lot of units made of the same components (instruction decode, memory interface, a lot of ALUs and even more registers, and you'll have a GPU (read this)!

How does Cuda map to our view of the GPU as a rendering device?
With all that in mind, we have now a pretty good picture of what a GPU is. And all this is really well explained in many Cuda tutorials and presentations as well, focusing more and specifically on the computational architecture of recent Nvidia GPUs.
The problem that I've found reading all that, is that it's not only not general (i.e. it's Nvidia specific) but it also leaves open the question of how all that architecture maps to the common concepts we are used to. All the terminology they use is totally different from the one of the graphics APIs, and that's understandable, as the concept of vertices, quads and texels do not make much sense in a general programming environment... But those are concepts that we're used to, we know how those entities work in a GPU, so knowing which is which in the Cuda terminology is useful.

What I'll do now is to try to explain the basic Cuda concepts mapping them to what I wrote before, and to common graphic programming terminology. Disclaimer: I've found no document that confirms my intuitions, so the following may be very wrong!

You might already know how Cuda looks like. It's a C-like programming language (and an API), you can write functions that look very similar to shaders and that execute on the GPU, using floats and simd vectors, but you also write in the same language code for the CPU. The CPU code will be mostly about allocating memory on the GPU and filling it with data, to then launch GPU functions to process that data, get the result back, rinse and repeat.

That's very similar to what we do, usually in C/C++ in our graphics engines, we create and set GPU resources, streams, textures and shader constants, and then we execute a shader that will use those resources and give us something back (in our case, pixels). What Cuda hides from you are the vertex and pixel shader stages, you can declare only a generic GPU function, that makes sense because modern GPUs are unified, they don't really have different units for the two kinds of shaders. That also means that you lose control over some fixed-function GPU stages, you don't control the rasterizer for example, nor the blending operator, and you can't use the vertex attributes iterpolators either.

The GPU functions are also called Cuda kernels, a kernel is executed in parallel, over many threads. Here, when you read "thread" you really have to think about a fiber, a lightweight execution context. Actually, threads are scheduled in "warps", units of 32 threads, all the threads in a warp execute the same instruction (thus, if one thread branches, all the ones in the warp has to follow the same branch too, and discard it later if it was not what they wanted). This is the very same thing that happens on vertices and quads when they get shaded, they are always processed in blocks, usually each block is a multiple of the number of ALUs a GPU core has as a single instruction get fetched and executed over all the ALUs for more than a single clock cycle (changing the register context the ALUs are operating with at each cycle).

A first interesting thing is that the threads get their own thread Id, from which they can decide which data to fetch. Compared to rendering, that means that you don't get linear data streams and an indexing of those, but you can do arbitrary fetches. That's somewhat similar to the xbox 360 custom vfetching functionality, where in a vertex shader you can receive only the vertex index, instead of the stream data, and get the data from the streams explicitly (i.e. for hardware instancing).
The thread Id can be organized in a linear fashion, or can be two or three-dimensional, that comes handy if you're executing a kernel over some data that has a grid topology.

The second, and probably the most important thing, is memory. In Cuda you have different kinds of GPU (device) memory.
Let's see them starting from the ones "closest" to the threads: the registers. Registers are the space used in a function to hold temporary results and do arithmetic. Unsurprisingly, knowing how shader works, you don't actually get a fixed number of them, they are allocated depending on the need of the function. Minimizing the number of registers used is fundamental because it controls how many threads a GPU core (what Cuda calls "streaming multiprocessor") can hold (and as usual, more threads we have, more opportunities to find one not stalled on a memory access we have).
Registers are bound to the unit ("streaming processor") that is executing a given thread, they can't be seen across threads.

A level above the registers we have the shared memory. Shared memory can be seen across threads, but only if they are executing on the same core, and is especially useful to implement data caching (that's also where having 2d and 3d thread identifiers comes handy, otherwise it could be impossible to have Cuda execute threads in a way where shared memory can be shared meaningfully). I'm not sure, but I do suspect that shared memory is what in the normal shader world would hold shader constants, if so the interesting addition that Cuda makes is that you can now write from a shader to that memory as well.

In order to control the thread execution on a given core, threads are organized in thread blocks, of a fixed size, and a kernel executes over them. Blocks also get their own Id (they are part of what Cuda calls a compute "grid"), and as for thread Id it can be one to three dimensional.
Now that is where things get complicated. In order to run a kernel, you have to specify the number and dimension of blocks you want to split it into, and the number and dimension of the threads inside the blocks. The problem is that those numbers can't be arbitrary, blocks are dispatched to cores and they can't be split, so a core has to have enough resources to hold all the threads in the block. That means that it has to have enough registers to create contextes to execute the kernel function on all the threads, and it has to have enough shared memory to hold the shared data of the function as well.

Last, but not least, there is the global device memory. This is split into constant memory, read only and cached, texture memory, read only again but with filtering and swizling for fast 2d cached access, and global memory that is read and write, but uncached.
Constant memory is most probably what rendering API do use for vertex streams data, texture memory is obviously associated with texture samplers, we're left with global memory. Well the only resource shaders can write generally are the render targets, also looking at the Cuda documentation we see that global memory writes are coalesced into bigger block writes, if a series of very restricting conditions are met.
This also smells of render target/stream out memory, so we can guess that's the GPU unit is used to implement that, again, the good news is that now this memory is accessible in read and write (render targets are only writable, even if you have the blend operation fixed stage that do read from them). Also, we can finally implement scattering, writing in multiple, arbitrary positions. Slowly, as it disables write coalescing, but it's possible!

As a last note, Cuda also has the concept of local memory. This is really just global memory, read/write uncached, that Cuda allocates per thread if the kernel function has local arrays or explictly tagged local variables. Even if it's per thread, it does not live on a GPU core, but outside, in the graphic card RAM.

06 May, 2009

New DOF technique test

2. almost there...


1. first working version, still needs refinement!

0. obviously it still does not work...