Search this blog

06 December, 2008

Which future? (plus, a shader optimization tip)

Update: poll closed. Results:

Cpu and Gpu convergence - 36
One to one vertex-pixel-texel - 7
Realtime raytracing - 14
More of the same - 9
We don't need any of that - 14

We have a clear winner. To my surprise, the realtime raytracing hype did not have a big influence on this, nor Carmak with its ID Tech 5 and "megatextures" (clipmaps) or John Olick with his voxel raycaster... IBM Cell CPU and Intel's Larabee seem to lead the way... We'll see.
---
Which future? Well, in the long term, I don't know. I don't know anything about it at all. In general our world is way too chaotic to be predictable. Luckily.


I'm happy to leave all the guesswork to hardware companies (this article by AnardTech is very intresting).

But we can see clearly the near future, there are a lot of technologies and ideas, trends, promising research... Of course we don't know which ones will be really succesful, but we can see some options, and even make some predictions, having enough experience. Educated guesses.

But what is even more intresting to me, is not to know which ones will succeed, that's not too exciting, I'll know anyway, it's a matter of time. I'd like to know which ones should succeed, which ones we are really waiting for. That's why I'm going to add a poll to this blog, and ask you.

I'm not writing a lot, the game I'm working on is taking a lot of my time now, so this blog is not as followed as it used to be, so running a poll right now could be not the most brightest idea ever. But I've never claimed to be bright...

Before enumerating the "possible futures" I've choosen for the poll, a quick shader tip, so I don't feel too guilty of asking without giving anything back. Ready? Here it comes:

Shader tip: struggling to optimize your shader? Just shuffle your code around! Most shader compilers can't find the optimal register allocation. That's why for complicated shaders, some times even removing code leads to worse performance. You should always try to pack your data together, to give the compiler all the possible hints, not to use more than you need (i.e. don't push vector3 of grayscale data) and follow all those kind of optimization best practices. But after you did everything right, try also just to shuffle the order of computations. It makes so much difference that you'll better structure your code so you have many separate blocks with clear dependencies between them...

Ok, let's see the nominations now:

Realtime raytracing.
Raytracing is cool. So realtime raytracing is cool squared! It looks good on a ph.d. thesis, it's an easy way to use multithreading and an intresting way to use SIMD and Cuda. You can implement it in a few lines of code, it looks good in any programming language, lets you render all kinds of primitives, is capable of incredibly realistic imagery.
It's so good that some people started to think that they need it, even if they don't know really why... There's a lot of hype around that, and a lot of ignorance (i.e. the logarithmic-complexity myth, raytracing is not faster than rasterization not even in the asymptotic limit, there are some data structures, applicable to both, that can make visibility queries faster under some conditions, even if building such structures is usually complicated).

CPU and GPU convergence, no fixed pipeline. Larrabee, the DirectX11 compute shaders, or even the Playstation 3 Cell processor... GPUs are CPUs wannabe and viceversa. Stream programming, massive multithreading, functional programming and side effects. Are we going towards longer pipelines, programmable but fixed in their role, or are we going towards a new, unified computing platform that could be programmed to do graphics, among other things? Software rendering again? Exciting, but I'm also scared by its complexity, hardly I can imagine people optimizing their trifillers again.

One to one vertex-pixel ratio (to texel).
Infinite detail, John Olick (ID software, Siggraph 2008) and Jules Urbach
(ATI Cinema 2.0 and OTOY engine) generated a lot of hype as they used raytracing to achieve that goal. But the same can be done right now with realtime tassellation, this is also the way that DirectX 11 took. REYES could be intresting too, for sure the current quad based rasterization pipelines are a bottleneck with high geometrical densities.
Another problem is how the generate and to load the data for all that detail. Various options are available, acquisition from real world data and streaming, procedural generation, compression, each with its own set of problems and tradeoffs.

Just more of the same. More processing power, but nothing revolutionary. We don't need such a revolution, there's still a lot of things that we can't do due to lack of processing power and the rising amounts of pixels to be pushed (HD, 60hz is not easy at all). Also, we still have to solve the uncanny valley problem, and this should shift our attention, even as a rendering enginners, outside the graphics realm (see this neat presentation of NVision08).

And the last one, that I had to add to have something to vote by myself:

Don't care, we don't need more hardware.
The next revolution won't be about algorithms or hardware, but about tools and methodologies. We are just now beginning to discover what are colors (linear color rendering, gamma correction etc), we still don't know what normals are (filtering, normalmapping), and we're far from having a good understanding of BRDF, of its parameters, of physical based rendering. We make too many errors, and add hacks and complexity in order to correct them, visually. Complexity is going out of control, we have too much of everything and too slow iteration times. There's much to improve to really enter a "next" generation.


26 October, 2008

off topic 2

Today I was walking on West Broadway to buy some camera goodies (this is a view from Cambie bridge, my appt is in the tallest building on the right)... It's winter but over the weekend it was sunny and nice. Turns out that I'm happy. Not that's strange, I'm usually happy, even if our minds are ruled by the Y combinator and our life's meaningless, we're human after all, and my meaningless life is quite good. It's just that not many times we stop to think about that, go for a walk, sing along, like a fool in a swimming pool.


20 October, 2008

Just blur

Note: this is an addendum to the previous post, even if it should be self-contained, I felt that the post was already too long to add this, and that the topic was too important to be written as an appendix...

How big a point is? Infinitesimal? Well, for sure you can pack two of them as close as you want, up to your floating point precision... But where does the dimension come to play a role in our CG model?

Let's take a simplified version of the scenario of my last post:
  • We want to simulate a rough, planar mirror.
  • We render-to-texture a mirrored scene, as usual.
  • We take a normalmap for the roughness.
  • We fetch texels from our mirrored scene texture using a screen-space UV but...
  • ...we distort that UV by an amount proportional to the tangent-space projection of the normalmap.
Simple... and easily too noisy, the surface roughness was too high frequency in my scenario, as it's easy when your mirror is nearly perpendicular to the image plane... So we blur... In my post I suggested to use a mipmap chain for various reasons, but anyway we blur, and everything looks better.

But let's look at that "blurring" operation a little bit closer... What are we doing? We blur, or pre-filter, because it's cheaper than supersampling, and post-filter... So, is it anti-aliasing? Yes, but not really... What we are doing is integrating a BRDF, the blur we apply is similar (even if way more incorrect) with the convolution we do on a cubemap (or equivalent) encoding the lights surrounding an object to have a lookup table for diffuse or specular illumination.

It's the same operation! In my previous post I said that I did consider the surface as a perfect mirror, with a Dirac delta, perfectly specular, BRDF. Now the reflected texture is exactly representing the lightsources in our scene, or some light-sources, the first bounce indirect ones (all the objects in the scene that directly reflect light from energy emitting surfaces). If we convolve it with a specular BRDF we get again the same image, indexed with the surface normals of the surface we're computing the shading on. But if we blur it, it's a way of convolving the same scene with a non perfectly diffuse BRDF!

In my implementation I used for various reasons only mipmaps, that are not really a blur... The nice thing would be, for higher quality, to use a real blur with a BRDF-shaped kernel that sits on the reflection plane (so it will end up in an ellipsoid when projected in image space)...

In that context, we need all those complications because we don't know another way of convolving the first bounce indirect lighting with our surfaces, we don't have a closed form solution of the rendering equation with that BRDF, that means, we can't express that shading as a simple lighting model (as is for example a Phong BRDF with a point light source).

What does that show? It shows us a dimension that we have in our computer graphic framework, that's implyied by the statistical model of the BRDF. We take our real world, physical surfaces, that are rough and imperfect if we look them close enough (but not too close, otherwise the whole geometrical optics theory does not apply), and we choose a minimum dimension and see how that roughness is oriented, we take a mean over that dimension, and capture that in a BRDF.
Note how this low-pass filter, or blurring, over the world dimensions is very common, in fact, is the base of every mathematical model in physics (and remember that calculus does not like disconinuties...). Models always have an implied dimension, if you look at your phenonema so that one becomes relevant, the model "breaks".
The problem is that in our case, we push that dimension to be quite large, not because we want to avoid to enter the quantum physics region, but because we don't want to deal with the problem of explicitly integrating high frequencies, and so we assume our surfaces to be flat, and capture the high frequencies only as an illumination issue, in the BRDF, in that way, the dimension that we choose depends on the distance we're looking at a surface, and it can become easily in the order of millimeters.

We always blur... We prefer to pre-blur instead of post-blurring, as the latter is way more expensive, but in the end what we want to do is to reduce all our frequencies (geometry, illumination etc) to the sampling one.

What does that imply, in practice? How is that relevant to our day to day work?

What if our surfaces have details of that dimension? Well things generally don't go well... That's why our artists in Milestone, when we were doing road shading, found impossible to create a roughness normal map for the tracks, it looked bad...
We ended up using normalmaps only for big cracks and for the wet track effect, as I explained.
It also means that even for the wet track, it's wise to use the normalmap only for the reflection, and not for the local lighting model... the water surface and the underlying asphalt layer have a much better chance to look good using geometric normals, maybe modulating the specular highlights using a noisy specular map, i.e. using the lenght of the Z component (the axis aligned to the geometric normal) of the (tangent space) roughness normalmap...

Note: If I remember correctly, in my early tests I was using the normalmap only for the water reflection, not using a separate specular for the water (so that layer was made only by the indirect reflection), using a specular without the normalmap for the asphalt layer), all those are details, but the interesting thing that I hope I showed here is why this kind of combination for shading that surface did work better than others...

18 October, 2008

Impossible is approximatively possible

I'm forcing myself to drive this blog more towards practical matters and less towards anti-C++ rants and how cool other languages are (hint, hint)... The problem with that is about my to do list, that nowdays, after work, is full of non programming tasks... But anyway, let's move on with today's topic: simuating realistic reflections for wet asphalt.

MotoGP'08, wet track, (C) Capcom

I was really happy to see the preview of MotoGP'08. It's in some way the sequel of the last game I did in Italy, SuperBike'07, it's based on the same base technology that my fellow collegues of Milestone's R&D group and I developed. It was a huge work, five people working on five platform, writing everything almost from scratch (while the game itself was still based on the solid grounds of our oldgen ones, the 3d engine and the tools started from zero).

One of the effects I took care of was the wet road shading. I don't know about the technology of the actual shipped games, I can guess it's an improved version of my original work, that's not really important for this post, what I want to describe is the creative process of approximating a physical effect...

Everything starts from the requirements. Unfortunately at that time we didn't have any formal process for that, we were not "agile", we were just putting our best effort without much strategy. So all I got was a bunch of reference pictures, the games in our library to look for other implementations of the same idea, and a lot of talking with our art director. Searching on the web I found a couple of papers, one a little bit old but geared specifically towards driving simulations.

The basics of the shader were easy:
  • A wet road is a two layer material, the dry asphalt with a layer of water on top. We will simply alpha blend (lerp) between the two.
  • We want to have a variable water level, or wetness, on the track surface.
  • The water layer is mostly about specular reflection.
  • As we don't have ponds on race tracks, we could ignore the bending of light caused by the refraction (so we consider the IOR of the water to be the same as the air's one).
  • Water will reflect the scene lights using a Blinn BRDF.
  • Water will have the same normals as the underlying asphalt if the water layer is thin, but it will "fill" asphalt discontinuities if it thick enough. That's easy if the asphalt has a normalmap, we simply interpolate that with the original geometry normal proportionally with the water level.
  • We need the reflection of the scene objects into the water.
I ended up using the "skids" texturemap (and uv layout) to encode in one of its channels (skids are monochrome, they require only one channel) the wetness amount. Actually our shader system was based on a "shader generator" tool where artists could flip on and off various layers and options in 3dsMax and generate a shader out of it, so the wetness map could be linked to any channel of any texture that we were using...

Everything here is seems straightfoward and can be done with various levels of sophistication, for example an idea that we discarded, as was complicated to handle by the gameplay, was to have the bikes dynamically interact with the water, drying the areas they passed over.

The problem comes when you try to implement the last point, the water reflections. Reflections from planar mirrors are very easy, you have only to render the scene transformed by the mirror's plane in a separate pass and you're done. A race track itself is not flat but this is not a huge problem, it's almost impossible to notice the error if you handle the bikes correctly (mirroring them with a "local" plane located just under them, if you use the same plane for all of them some reflections will appear to be detached from the contact point between the tires and the ground).

Easy, you can code that in no time, and it will look like a marble plane... The problem is that the asphalt, even when wet, has still a pretty rought surface, and thus it won't behave as a perfect mirror, it will more be like a broken one. Art direction asked for realistic reflections, so... for sure not like that.

Let's stop thinking about the hacks and let's think about what happens in the real world... Let's follow a ray of light that went from a light to an object, then to the asphalt and then to the eye/camera... backwards (under the framework of geometrical optics, that's what we use for compute graphics, you can always go backward, for more details see the famous ph.D. thesis by Eric Veach)!

So we start at the camera, we go towards the track point we're considering, from there it went towards a point on a bike. In which direction? In general, we can't know, any possible directon could make the connection if it does not have a BRDF value of zero, otherwise that connection will have no effect on the shading of the track and thus we won't be able to see it. After bouncing in that direction, it travels for an unknown distance, reaches the bike, and from there it goes towards a light, for which we know the location.

Now simulating all this is impossible, we have two things that we don't know, the reflection direction and the travelled light ray distance between the track and the bike, and those are possible to compute only using raytracing...
Let's try now to fill the holes using some approximations that we can easily compute on a GPU.

First of all we need the direction, that's easy, if we consider our reflections to be perfectly specular, the BRDF will be a dirac impulse, it will have only one direction for which it's non zero, and that is the reflected direction of the view ray (camera to track) around the (track) normal.

The second thing that we don't know is the distance it travelled, we can't compute that, it would require raytracing. In general reflections would require that, why are the planar mirror ones an exception? Because in that case the reflection rays are coherent, visibility can be computed per each point on the mirror using a projection matrix, but that's what rasterization is able to do!
If we can render planar mirrors, we can also compute the distance of each reflected object to the reflection plane. In fact it's really easy! So we do have a measure of the distance, but not the one that we want, the distance our reflected rays travels according to the rough asphalt normals, but the one it travels according to a smooth, marble-like surface. It's still something!

How to go from smooth, flat, to rough? Well the reflected vectors are not so distant, if we have the reflected point on a smooth mirror, we can reasonably think that the point the rough mirror will hit is more or less around the point the smooth mirror reflected. The idea is simple so, we just take the perfect reflection we have in the render-to-texture image, and instead of reading the "right" pixel we read a pixel around it, in a direction that will be the same as the difference vector between the smooth reflection vector and the rough one. But that's difference is the same that we have between the geometric normal and the normalmap one! Everything is going smooth... We only need to know how far to go in that direction, but that's not a huge problem too, we can approximate that with the distance between the point we would have hit with a perfectly smooth mirror and the mirror itself, that distance is straightforward to compute when rendering the perfect reflection texture or in a second pass, by resolving the zbuffer of the reflection render.

Let's code this:

// Store a copy of the POSITION register in another register (POSITION is not
// readable in pixel shader S.M. <>
float2 perfectReflUV = (IN.CopyPos.xy / IN.CopyPos.w)*float2(0.5f,-0.5f) + 0.5f;

// Fetch from the screenspace reflection map, the approximation of the track to
// reflected object distance... It has to be normalized between zero and one.
float reflectionDistance = tex2D(REFLECTIONMAP, perfectReflUV).a;

// Compute a distortion approximaton by scaling by a constant factor the normalmap
// normal (expressed in tangent space)
float2 distortionApprox = normalMapNormalTGS.xy * DISTORTIONFACTOR;

// Fetch the final reflected object color...
float2 reflUV = perfectReflUV + distortionApprox * reflectionDistance;
float3 reflection = tex2D(REFLECTIONMAP, reflUV).rgb;

That actually works, but it will be very noisy, especially when animated. Why? Because the frequency of our UV distortion can be very high, as it depends on the track normalmap, and the track is nearly parallel to the view direction, so its texture mapping frequencies are easily very high (that's why for racing games anisotropic filtering is a must). That's very unpleasing especially when animated.

How do we fight high frequencies? Well, with supersampling! But that's expensive... Other ideas? Who said prefiltering? We could blur our distorted image... well, but that's quite like blurring the reflection image... well, but that's quite possible by generating some mipmaps for it! We know how much we are distorting the reads from that image, so we could choose our mipmap level based on that...
Ok, we're ready for the final version of our code now... I've also implemented another slight improvement, I read the distance from a pre-distorted UV... That will cause some reflections of the near objects to leak into the far ones (i.e. the sky) but the previous version had the opposite problem, that was in my opinion more noticeable... Enjoy!

// Store a copy of the POSITION register in another register (POSITION is not
// readable in pixel shader S.M. <>
float2 perfectReflUV = (IN.CopyPos.xy / IN.CopyPos.w)
*float2(0.5f,-0.5f) + 0.5f;

// Compute a distortion approximaton by scaling by a constant factor the normalmap
// normal (expressed in tangent space)... 0.5f is an estimate of the "right"
// reflectionDistance that we don't know (we should raymarch to find it...)
float2 distortionApprox = normalMapNormalTGS.xy * DISTORTIONFACTOR * 0.5f;

// Fetch from the screenspace reflection map, the approximation of the track to
// reflected object distance... It has to be normalized between zero and one.
float reflectionDistance = tex2D(REFLECTIONMAP, perfectReflUV + distortionApprox).a;
distortionApprox = normalMapNormalTGS.xy * DISTORTIONFACTOR * reflectionDistance;
// we could continue iterating to find an intersection, but we don't...

// Fetch the final reflected object color:

float2 reflUV = perfectReflUV + distortionApprox;
float4 relfUV_LOD = float4(
float4(reflUV,0,REFLECTIONMAP_MIPMAP_LEVELS * reflectionDistance));
float3 reflection = tex2Dlod(REFLECTIONMAP, relfUV_LOD);

Last but not least, you'll notice that I haven't talked much about programmer-artist iteration, even if I'm kinda an "evangelist" of that. Why? It's simple, if you're asked to reproduce the reality, then you know what you want, if you do that by approximating the real thing you know which errors you're doing, hardly there's much to iterate. Of course the final validation has to be given by the art direction, of course they can say it looks like crap and they prefer a hack over your nicely crafted, physical inspired code... But that did not happen, and in that case, a physically based effect requires usually way less parameters, and thus tuning and iteration, than a hack-based one...

Update: continues here...
Update: some slight changes to the "final code"
Update: I didn't provide many details about my use of texture mipmaps as an approximation of various blur levels... That's of course wrong, it may be very wrong if you have small emitting objects (i.e. headlights or traffic lights) in your reflection map. In that case you might want to cheat and render those object with a halo (particles...) around them, to "blur" more without extra rendering costs, or do the right thing, use a 3d texture map instead of mipmap levels, blur each z-slice with different kernel widths, maybe consider some way of HDR color encoding...

14 October, 2008

Normals without Normals

Long time no write! Just a small post, I'll publish some sourcecode snippets for the Normals without Normals hack... More to come!

The main idea is that we can compute normals easily in a pixel shader using ddx/ddy instructions... The problem of that technique is that we'll end up with real normals, not the interpolated ones that we need for Gouraud shading... To solve this problem we render the geometry in two passes. In a first pass, we render the geometry to a texture, then we blur that texture, and access it in the standard forward rendering pass as a normalmap...

Note that the same ddx/ddy technique can be used to compute a tangent base, that's especially useful if you don't have it, or don't have the vertex bandwidth for one... You can find the details of that technique in ShaderX 5 (
Normal Mapping without Pre-Computed Tangents by Christian Schueler, the only catch is that there the tangent space is not reorthonormalized around the Gouraud-interpolated normal, but that's easy to do).

NormalBakeVS_Out NormalBakeVS(GeomVS_In In)
{
NormalBakeVS_Out Out;
Out.Pos = float4(In.UV * float2(2,-2) + float2(-1,1),0,1);
Out.NormPos = mul(
In.Pos,WorldViewM);
return Out;
}

float4 NormalBakePS(NormalBakeVS_Out In) : COLOR
{
float3 d1 = ddx(In.NormPos);
float3 d2 = ddy(In.NormPos);
float3 normal = normalize(cross(d1,d2)); // this normal is dp/du X dp/dv
// NOTE: normal.z is always positive as we bake normals in view-space
return float4(normal.xy * 0.5 + 0.5);
}


The model should have a suitable UV mapping. That mapping, in order for this technique to work well should respect the following properties (in order of importance...):

  • Two different point on the mesh should map to two different point in UV (COMPULSORY!)
  • No discontinuties: UV mapping should not be discontinous on the mesh (note that if UV are accessed with wrapping, the UV space is toroidal...)
  • No distortion: the shortest path between two points on the mesh should be the same as the distance in UV space up to a multiplicative constant
  • Any point in UV space should map to a point on the mesh

Dicontinuties are hard to avoid, if present they can be made less obvious by passing to the normal baking a mesh that is extended across the discontinuities. For each edge in UV space, you can extrude that edge out (creating a polygon band around it, that will be rendered only for baking) overlapping the existing mesh geometry but with a mapping adjiacent to the edge in UV space...

The "non full" UV space problem (last point) is addressed by discarding samples, during the blur phase, in areas that were not written by the mesh polygons. Another approach could be the use of pyramidal filters and "inpaiting" (see the work of Kraus and Strengert).

As ATI demonstrated with the subsurface scattering technique, it's possible to save some computations by discarding non-visible triangles in the render to texture passes using early-Z (see Applications of Explicit Early-Z Culling)

In the second rendering pass, we simply recover the normal stored in the render to texture surface, and that's it:

float4 GeomPS(GeomVS_Out In) : COLOR
{
float2 samp = tex2D(BakeSampler,In.UV.xy) * 2 - 1;
float3 normal_sharp = float3(samp, sqrt(1 - dot(samp,samp)));
...
}


Note: the main point is that there are a lot of different spaces we can express our computations into, often choosing the right one is the key of solving a problem, especially on the GPU where we are limited by its computational model. Don't take my implementation too seriously, it's just an experiment around an idea. Actually it's probably simpler to do the same in screenspace for example, devising a smart way to compute the blur kernel size, i.e. function of the triangle projected size (that can be estimated with the derivatives)...

14 August, 2008

Test-Driven-Development

  1. Test
  2. If it didn't compile, add some keywords
  3. Goto 1
This is what test driven development usually is in games. It's not that bad, we do (or should) prefer iteration and experimentation over any form of design. Yes, I know that unit tests are very useful for refactoring, and thus simplify some sorts of iteration, but still, it's not enough.
This doesn't mean that automated testing is not important, quite the contrary, you should have plenty of scripts to automate the game and gather statistics. But unit tests are good only for some shared libraries, I don't think they will be ever successful in this field.

I'm going to leave for Italy, dunno if I'll have time to post other articles, there's stuff from Siggraph that is worth posting, I have a nice code optimization tutorial to post, the "normals without normals" technique plus a few other code snippets. Probably those things have to wait until mid-September, when I'll be back from holidays...

12 August, 2008

Ribbons are the new cubes!

Are you making a demo? Don't forget your splines, they are the cool thing now...
Lifeforce
Nematomorpha
The Seeker
Atrium
Scarecrow
Invoke

The "progressively appearing" geometry trick is also commonly used to draw them:
Route 1066
Falling down
Media error
Tactical battle loop

Cubes seems to be cool only if you instance a crazy number of them now:
Debris
Momentum

Another cool trend: 2d metaballs
Nucleophile
Incognito (near the end, this one features ribbons too)

But plain old spheres are not forgotten too
Kindercrasher

Plain old particle systems are out...

10 August, 2008

Small update

I've finished reading the Larrabee paper, linked in the realtimerendering blog. Very nice, intresting in general, even if you're not doing rendering... And it has a few very nice references too.

It seems that my old Pentium UV pipes cycle-counting abilities will be useful again... yeah!

I'm wondering how it can succeed commercially... It's so different from a GPU that it will require a custom rendering path in you application to be used properly, wonder how many will do that as nothing that you can do on Larrabee is replicable on other GPUs... Maybe, if its price is in the rage of standard GPUs and its speed with DirectX (or a similar API) is comparable... or if they manage to include it in a consolle. Anyway, it's exciting, and a little bit scarying too. We'll see.

I've also found a nice, old article about Xenos (the 360 Gpu) that could be an intresting read if you don't have access to the 360 sdk.

Warning: another anti-C++ rant follows (I've warned you, don't complain if you don't like what you read or if you find it boring...)

Last but not least, I've been watching to a nice presentation by Stroustrup that he gave at university of Waterloo, on C++0x, it's not new but it's very intresting. It shows again how C++ is at an evolutionary end.

Key things you'll learn from it: C++ design process is incredibly slow and constrained, C++ won't ever deprecate features so it might only grow (even if Bjarne would like to do so, but he says that he was unable to convince the compiler vendors...), not change. That means that all the problems and restrictions imposed by the C compatibility and by straight errors in the first version of the language won't be addressed. That also means that C++ is almost at its end, as it's already enormous and it can't shrink, and there is a limit to the number of things a programmer can know about any language. C++ is already so complicated that some university professors use its function resolving rules as "triky" questions during exams...

You will also hear the word "performance" each minute or so. We can't do that because we care about performances, we are not stupid windows programmers! Well, Bjarne, if going "low level" means caring about performances, then why aren't we all using assembly? Maybe because writing programs in assembly was so painful that not only become impractical, but was also hampering performances, as it was hard enough to write a working program, let's not talk about profiling and optimizing it... Try today to write a complete program in assembly that's faster than the same written in C (on a modern out-of-core processor I mean, of course on C64 assembly is still a nice choice)... So the equation higher level languages == less performance is very simple and very wrong in my opinion, and we have historical proofs of that. C++ is dead, it's only the funeral that's long and painful (especially when incredilink takes five minute to link our pretty-optimized-for-build-times solution).

I can give C++ a point for supporting all the design-wise optimizations pretty well (i.e. mature optimizations, the ones you have to do early on, that are really the only ones that matter, for function level optimizations you could well use assembly in a few places, if you have the time, that is something that's more likely to happen in a language that does not waste all of it in the compile/link cycle), while other languages still don't allow some of them (i.e. it's hard to predict memory locality in C#, and thus to optimize a design to be cache efficient, and there's no easy way to write custom memory managers to overcome that too).

Still C++ does not support them all, and that's why when performances really matter, we use compiler specific extensions to C++, i.e. alignment/packing & vector data types... The wikipedia C++0x page does not include the C99 restrict keyword as a feature of the language but I did not do any further research on that, I hope it's only a mistake of that article... Even the multithreading support they want to add seems to be pretty basic (even compared to existing and well supported extensions like OpenMP), quite disappointing for a language that's performance driven, even more considering that you'll probably get a stable and widespread implementation of it in ten years from now...

P.S. it's also nice to know that the standard commitee prefers library functions to language extensions, and prefers to build an extensible language over giving natively a specific functionality. Very nice! It would be even a nicer idea if C++ was not one of the messiest languages to extend... Anyone that had the priviledge of seeing a error message from a std container should agree with me. And that is only the standard library that's been made together with the language, it's not really an effort of a third party to extend it... Boost is, and it's nice, and it's also a clear proof that you have to be incredibly expert to make even a trivial extension and kinda expert to use and understand them after someone, more expert than you, have made one! Well I'll stop there, otherwise I'll turn this "small update" post into another "c++ is bad" one...

07 August, 2008

Commenting on graphical shader systems

This is a comment on this neat post by Christer Ericson (so you're supposed to follow that link before reading this). I've posted that comment on my blog because it lets me elaborate more on that, and also because I think the subject is important enough...

So basically what Christer says is that graphical (i.e. graph/node based) shader authoring systems are bad. Shaders are performance critical, should be authored by programmers. Also, it makes global shader changes way more difficult (i.e. remove this feature X from all the shaders... now it's impossible because each shader is a completely unrelated piece of code made with a graph).

He proposes an "ubershader" solution, a shader that has a lot of capabilities built into, that then gets automagically specialized into a number of trimmed down ones by tools (that remove any unused stuff from a given material instance)
I think he is very right, and I will push it further…

It is true that shaders are performance critical they are basically a tiny kernel in a huuuge loop, tiny optimizations make a big difference, especially if you manage to save registers!

The ubershader approach is nice, in my former company we did push it further, I made a parser that generated a 3dsmax material plugin (script) for each (annotated) .fx file, some components in the UI were true parameters, others were changing #defines, when the latter changed the shader had to be rebuit, everything was done directly in 3dsmax, and it worked really well.

To deal with incompatible switches, in my system I had shader annotations that could disable switches based on the status of other ones in the UI (and a lot of #error directives to be extra sure that the shader was not generated with mutually incompatible features). And it was really really easy, it's not a huge tool to make and maintain. I did support #defines of “bool”, “enum” and “float” type. The whole annotated .fx parser -> 3dsmax material GUI was something like 500 lines of maxscript code.

We didn't have just one ubershader made in this way, but a few ones, because it doesn't make sense to add too many features to just one shader when you're trying to simulate two completely different material categories... But this is not enough! First of all, optimizing every path is still too hard. Moreover, you don’t have control over the number of possible shaders in a scene.

Worse yet, you loose some information, i.e. let’s say that the artists are authoring everything well, caring about performance measures etc... In fact our internal artists were extremely good at this. But what if you wanted to change all the grass materials in all your game to use another technique?

You could not, because the materials are generic selections of switches, with no semantic! You could remove something from all the shaders, but it's difficult to replace some materials with another implementation, you could add some semantic information to your materials, but still you have no guarantees on the selection of the features that artists chosen to express a given instance of the grass, so it becomes problematic.

That’s why we intended to use that system only as a prototype, to let artists find the stuff they needed easily and then coalesce everything in a fixed set of shaders!
In my new company we are using a fixed sets of shaders, that are generated by programmers easily usually by including a few implementation files and setting some #defines, that is basically the very same idea minus the early-on rapid-prototyping capabilities.

I want to remark that the coders-do-the-shaders approach is not good only because performance matters. IT IS GOOD EVEN ON AN ART STANDPOINT. Artists and coders should COLLABORATE. They both have different views, and different ideas, only together they can find really great solutions to rendering problems.

Last but not least having black boxes to be connected encourages the use of a BRDF called "the-very-ignorant-pile-of-bad-hacks", that is an empirical BRDF made by a number of phong-ish lobes modulated by a number of fresnel-ish parameters that in the end produce a lot of computation, a huge number of parameters that drive artists crazy, and still can't be tuned to look really right...

The idea of having the coders do the code, wrap it in nice tools, and give tools to the artists is not only bad performance-wise, it’s bad engineering-wise (you most of the time spend more resources into making and maintaining those uber tools than the one you would spend by having a dedicated S.E. working closely with artists on shaders), and it’s bad art-wise (as connecting boxes has a very limited expressive power).

31 July, 2008

GPU versus CPU

Some days ago, a friend of mine at work asked me what was the big difference in the way GPUs and CPUs operate. Even if I went into a fairly deep description of the inner workings of GPUs in some older posts, I want to elaborate specifically on that question.

Let's start with a fundamental concept: latency, that is the time that we have to wait, after submitting an instruction, to have its results computed. If we have only one computational stage, then effectively the reciprocal of the latency is the amount of instruction we can process in an unit time.

So we want them to be small right? Well it turns out, that they were in the last years growing instead! But still our processors seem to run faster than before, why? Because they are good at hiding those latencies!
How? Simple, let's say that instead of having a single computational stage, you have more stages, a pipeline of workers. Then you might move an instruction being processed from one stage to the other (conceptually) like on a conveyor belt, and while you're processing it the other stages can accept more instructions. Any given instruction will have to go through the whole pipeline, but the rate of instruction processing can be much higher than latency, and it's called throughput.

Why we did like those kinds of designs? Well, in the era of the gigahertz wars (that now has largely scaled back), it was an easy way of having higher frequencies. If a single instruction was split in a number of tiny steps, then each of them could be simpler, thus requiring less stuff to be done, thus enabling designers to have higher frequencies, as each small step required less time.

Unfortunately, if something stalls this pipeline, if we can't fetch more instructions to process to keep it always full, then our theorical performance can't be reached, and our code will run slower than on less deeply pipelined architectures.
The causes of those stalls are various, we could have a "branch misprediction", we were thinking some work was needed, but we were wrong, we started processing instructions that are not useful. Or we could not be able to find instructions to process that are not dependant on results of the ones that are currently being processed. The worse example of this latter kind of stall is on memory accesses. Memory is slow, and it's evolving at a slower pace than processors too, so the gap is becoming bigger and bigger (there wasn't any twenty years ago, for example on the Commodore 64, its processors did not need caches too).

If one instruction is a memory fetch, and we can't find any instruction to process after it that does not depend on that memory fetch, we are stalled. Badly. That's why hyper-threading and similar architectures exist. That's why memory does matter, and why cache-friendly code is important.

CPUs become better and better at this job of optimizing their pipelines. Their architectures, and decoding stages (taking instructions and decomposing them in stages, scheduling them in the pipeline and rearranging them, that's called out-of-order instruction execution), are so complicated that's virtually impossible to predict at a cycle level the behaviour of our code. Strangely, transistor numbers did evolve according to Moore's law, but we did not use those transistors to have more raw power, but mostly to have more refined iterations of those pipelines and of the logic that controls them.


Most people say that GPUs computational power is evolving at a faster pace than Moore's law predicted. That is not true, as that law did not account for frequency improvements (i.e. thinner chip dies), so it's not about computational power at all! The fact that CPUs computational power did respect that law means that we were wasting those extra transistors, in other words, that those transistors did not linearly increase the power.


Why GPUs are different? Well, let me do a little code example. Let's say we want to compute this:


for i=0 to intArray.length do boolArray[i] = (intArray[i] * 10 + 10) > 0


GPUs will actually refactor the computation to be more like the following (plus a lot of unrolling...):


for i=0 to intArray.length do tempArray[i] = intArray[i]
for i=0 to intArray.length do tempArray[i] = tempArray[i] * 10
for i=0 to intArray.length do tempArray[i] = tempArray[i] + 10
for i=0 to intArray.length do boolArray[i] = tempArray[i] > 0


(this example would be much easier in functional pseudocode than in imperative one, but anyway...)

Odd! Why are we doing this? Basically, what we want to do is to hide latency in width, instead of in depth! Having to perform the same operation on a huge number of items, we are sure that we always have enough to do to hide latencies, without much effort. And it's quite straightforward to turn transistors in computational power too, we simply will have more width, and more computational units working in parallel on the tempArray! In fact, that kind of operation, a "parallel for", is a very useful primitive to have in your multithreading library... :)

Many GPUs work exactly like that. The only big difference is that the "tempArray" is implemented in GPU registers, so it has a fixed size, and thus work has to be subdivided in smaller pieces.

There are some caveats.
The first one is that if we need more than one temp register to execute our operation (because our computation is as simple as the one of my example!) then our register array will contain less independant operating threads (because each one requires a given space), and so we will have less latency hiding. That's why the number of registers that we use in a shader is more important than the number of instructions (now we can clearly see them as passes!) that our shader needs to perform!
Second, this kind of computation is inherently SIMD, even if GPUs do support different execution paths on the same data (i.e. branches) those are still limited in a number of ways.
Another one is that our computations have to be independant, there's no communication between processing threads, we can't compute operations like:

for i=0 to boolArray.length do result = result LOGICAL_OR boolArray[i]

That one is called in the steam processing lingo, a gather operation (or if you're familiar with functional programming, a reduce or fold), the inverse of which is called a scatter operation. Lucily for the GPGPU community, a workaround to do those kinds of computations on the GPU exists and is to map our data to be processed into a texture/rendertarget, use register threads to process multiple pixels in parallel and use texture reads, that can be arbitrary, to gather data. Scatter is still very hard, and there are limitations to the number of texture reads too, for example that code will be processed usually by doing multiple reductions, from a boolArray of size N to one of size N/2 (N/4 really, as textures are bidimensional) until reaching the final result... but that's too far away from the original question...

Are those two worlds going to meet? Probably. CPUs already do not have a single pipeline, so they're not all about depth. Plus both CPUs and GPUs have SIMD data types and operations. And now multicore is the current trend, and we will see have more and more cores, that will be simpler and simpler (i.e. the IBM Cell or the Intel Larrabee). On the other hand, GPUs are becoming more refined in their scheduling abilities, i.e. the Xbox 360 one does not only hide latency in depth, but also can choose which instructions from which shader to schedule in order to further hide memory latencies across multiple passes (basically implementing fibers)... NVidia G80 has computational units with independent memory storages...

Still I think that GPU processing is inherently more parallel than CPU, so a specialized unit will always be nice to have, we are solving a very specific problem, we have a small computational kernel to apply to huge amounts of data... On the other hands, pushing too much the stream computing paradigm on the CPUs is not too useful, as there are problems that do not map well on it, because they don't work on huge amounts of data nor they perform uniform operations...

30 July, 2008

Celebration of light

It's summer, and as always in Vancouver there's a firework competition that lasts a week. It's called celebration of light.

I was there, with my friends, looking at the show. After a while I said to Fabio: "I wonder if they have some kind of software for prototyping that or if they spend a lot of money...", "I was thinking about writing that from the very beginning of the show" he interrupted me.

Geeks.

28 July, 2008

Commando

We are near a milestone of our project, I don't have any serious issue to fix, everything is going fine...

But one day I still had to enter the "commando" mode (see resign patterns if you don't know about them yet, they are important). There was a bunch of code, done by someone else, and never tested as we were lacking art assets to do that. Assets were finally delivered, game programmers try to enable that rendering feature, it fails, the problem fall back on me.

Now as, specifically, the game programmer that assigned the bug to me is a friend of mine too, I wanted to solve that as fast as possible, so he could stop slacking and return to coding asap. Problem was that I did not know the code, well, the entire subsystem really, it's a new technology that we're just integrating. I will design and probably implement the "correct" version of that functionality, but for now we wanted just to hack something to see, to evaluate GPU performance and let artists have something to work on.

Luckily I had the priviledge of working, in my previous company, with a guy, Paolo Milani, that could handle those kinds of situations perfectly. He is a powerful weapon (in the wrong hands), he can do almost anything (being good all round, at coding, hacking, designing, maths etc), but he was mostly used, due to lack of money, time, and too much ignorance, to do in a couple of hours the work of weeks. That of course resulted in code that no other human could ever understand, but still, sometimes those skills are helpful.

How you could notice him entering the commando mode? Simple:
  • The mouse wheel accellerated up to 10.000rpm.
  • The GPU fans started spinning because of the heat generated by Visual Studio text rendering.
  • With the other hand, while scrolling furiosly, code was being added "on the fly" all over the solution.
  • You could see the number of compile errors decrese in realtime, until reaching zero that marks the end of an iteration.
  • Looking at the Xbox 360 monitor, you could see over minutes the game taking shape... First flat shaded bounding boxes, then the bikes, the track, diffuse textures, animations, sound...

I'm not that good. I've never seen anyone else that good. Still, this morning, half asleep in bed, I was thinking about our (overly complex) math library, simd intrinsics, the wrapper we have for portability, the vector classes... then I turned on the other side, hugged the head of my girlfriend, and for a split second I surprised myself thinking if that head did inherit from our base vector class, were the data was, if it was properly aligned...

Vertex shader LOD for Pixel shaders

I already blogged a couple of times about LODs for pixel shaders, so this is a quick update on the subject. Very quick, I'll say everything in one sentece, so be prepared and don't miss it:

Having geometrical LODs (less vertices/polygons) has also a (not small) impact on pixel shader performance, as the GPUs always process pixels in 2x2 quads, and so partial quads of a rasterized polygon waste pixel shader resources (as "unused" pixels in the quad will be processed and the discarded)

24 July, 2008

Quick shader tip

Don't use the const keyword. It's broken in some compilers (i.e. it leads to bad code) and it's not helping at all in optimizing the shader. Const is only helpful for the programmer, not for the compiler anyway (this is also true for C++). The compiler is smart enough to find what it really const and what not, at it has access to the whole sourcecode (no linking). The only exception of course are the global variables, that being by default uniform parameters, are always assumed non-const even if the shader does not change them.

22 July, 2008

Kill the hype

Since the infamous Carmack's interview on PCPerspective, (some of) the realtime rendering world has been rediscovering voxels (as point based rendering is something that we weren't doing yet anyway).

Noone tells us why. Why having less information (about topology) should be better than having more? Well, if you have so much data that you can't fit in memory, I can easily see the advantage, but that doesn't seem to be our problem as of now in most cases.

And weren't we all excited about DX10 geometry shaders exactly because we could have access to that kind of data?

I simply hate the hype. I hope that soon someone (more influential than me) says in an interview how cool Nurbs are, so we will be the two opposite ends of the hype, fully parametric surfaces versus raw 3d points.

The other (and related) hype is about raytracing techniques. I consider most of the realtime raytracing research to be dangerous for raytracing. Why we love raytracing? Because it allows to answer random visibility queries. Why we love to be able to do that? Because it enables us to use more refined methods of integrating the rendering equation. Faster ones, more adaptive, if you want. That still did become popular in the non-realtime world just a few moments ago...

Realtime raytracing research is mostly focused on the opposite direction, restricting the queries to coherent ones, so restricting also the effects that we can simulate to the ones that rasterization already does so well.

It seems that the only thing that you gain is the ability of rendering more accurate specular reflections, very, very, very slowly
. Very useful, indeed, it's exactly the thing that artists ask me to bring them all day...

P.S. That was unfair, in fact just the ability of computing per pixel shadows in a robust way, without having to mess with shadow map parametrizations etc, is a very nice feature. But it's not enough.

18 July, 2008

ShaderX 6

Just finished to read it (very lazily, I'm also reading Geometric Algebra for Computer Science, that looks promising).
My picks from it:
  • Rendering Filtered Shadow with Exponential Shadow Maps (that you should already know...)
  • Interactive Global Illumination with Precomputed Radiance Maps (very nice extension to lightmaps...)
Also very intresting:
  • Stable Rendering of Cascaded Shadow Maps (a lot of nasty, useful details)
  • Practical Methods for a PRT-Based Shader Using Spherical Harmonics (even more nasty details)
  • Care and Feeding of Normal Vectors (that, again, you should already know...)
  • Computing Per-Pixel Object Thickness in a Single Render Pass (easy and nice)
  • Deferred Rendering Using a Stencil Routed K-Buffer
  • A Fexible Material System in Design

17 July, 2008

Normals without normals preview


It's 0.55, that is kinda late for me, as tomorrow I have to work and I won't ever sleep less than 8-9 hours per night. That's also why I you won't see me at work before 10:30.

Anyways I've finished the very first sketch in FXComposer of a nice-ish idea I had to compute smooth normals on a surface, when you don't have them (HINT: that might happen because you're displacing the surface with a non differentiable function for example... that might happen if you're computing that function using numerical methods...)

You probably won't be able to see anything intresting in the attached screenshot, but as I said, it's late, so you have to believe me, it's kinda cool, and I think it could have various uses... If it turns out to be a good solution, I'll publish the details :)

14 July, 2008

Hue shifting code snippet (with trivia)

Recently, to add variety to instances of a crowd system, I experimented with cheap methods to do hue shifting (as the pixel shader is very cheap at the moment and has a few ALU cycles to spare, as they are hidden by color texture access latency)... After 3 mostly-failed attempts I ended up with the following (actually, it's a test I did in the weekend, I'm not 100% sure it's error-free as I didn't test it much yet... LOL!):

float2 sc;
sincos(IN.random_recolor, sc.x, sc.y);
sc.y = 1.f - sc.y;
sc /= float2(sqrt(3.f), 3.f);
float3 xVec = float3(1.f, sc.xx * float2(-1,1)) + (sc.yyy * float3(-2,1,1));
float3x3 recolorMatrix = float3x3(xVec.xyz, xVec.zxy, xVec.yzx);
float3 recolored = mul(tex2D(colorTexture, IN.UV), recolorMatrix);

Have you figured out what it does? Try, even if the code is kinda cryptic, you should be able to understand the underlying idea... I've changed the names of variables to make is less obvious and protect the innocent (in my real code, I don't waste an interpolator only for the recoloring random float for example)... (hint: start from the bottom...)

Done? Well, if your guess was along the lines of "a rotation on the 45° positive axis in RGB space" then you're right! Mhm if it was not, then either you're wrong, or I did a mistake in the code :)

Bonus question: what kind of errors does it make (compared to a real hue-shift i.e. the one that photoshop implements)? Hints follow... What kind of errors could it make? Being a hue shift, it's wrong if it changes the other two components of the HSL space instead of the hue (we could argue that is an error even if it's non-linear in the hue, but as we want to encode a random shift we don't care much, even if a strong non-linearity if feed with an uniform random variable leads to a preference towards some colors). So we have to see if it changes the saturation or the luminosity... Which of the two is more a probably going to be a problem? Which of the two that code gets more wrong?

Second bonus question: how many ALU cycles does that technique take?

13 July, 2008

Slowdown

The blog will slowdown a little bit this month, I'm rather busy with photography, two small code projects (shaders), and other personal matters. But there's plenty to read, I posted some links to interesting RSS feeds in the past, and if you haven't read all the old posts, that is the perfect occasion to do that...

09 July, 2008

Note to myself

I've spent 1.5 days to find that this line:
return &GetPrevReadBuffer(mWritePose)[mWritePose];
should have been this one instead:
return &GetPrevReadBuffer(mWritePose)[mWritePose * NUMVECTOR4PERPOSEMATRIX];

And I previously had other problems with that too, basically I'm using a double-buffered (and interleaved... in a complicated way) three-dimensional vector4 array to store animation data...

Note to myself
: DON'T ever do that again, DON'T use pointer arithmethic, never. Wrap arrays in classes. I didn't do that because (to make things more complicated) I have three different representations of that data, the "simulation" side sees them as scale-quaternion-translation classes, the replay sees them as compressed versions of the same, and the rendering expands the compressed versions into affine matrices...

Now I have to go back debugging, because there's still a problem in the interleaving and interplation code that even if I've added debug asserts and test cases everywhere, is still hiding itself somewhere. AAAAAAAAAAAAAAAAAAAAA!!!!

p.s. Direct access in arrays is bad from a performance standpoint too. If you wrap your arrays with setters and getters then it's easier to change the in-memory layout of your elements later to optimize for cache misses... There are many cases where good code design also helps performances, not directly but making changes after profiling more easy!

29 June, 2008

Fermat's principle

Searching for topics
I've noticed lately that four major topics have found their way through my blog posts. That wasn't something that I did plan, it just happened, like it happens to a person of going through different periods in his music listening habits. Those are:
  • C++ is premature optimization
  • Shader programming is physical modelling (and artists are fitting algorithms)
  • Iteration time is the single most important coding metric
  • Huge, data-driven frameworks, are bad
We explore things moving our interests more or less randomly, until we find something that requires a spot, a local mininum in the space of things, were we spend some time and then eventually evade, starting another random walk searching for another minima.

And while that it's a good, smart, simple way to search (see metropolis random walks, that lead naturally to simulated annealing, that in turn is a very simple implementation of the tabu search ideas... a nice application of metropolis-hastings montecarlo methods is this one...), it's probably not a good way to write articles, as the result is non uniform, and I've found that the information that I think it's important is scattered among different posts.

As I'm not happy with the way I explained some concepts, I tend to increase the redundancy of the posts, they become longer, some ideas are repeated more times in slightly different perspectives, hoping that the real point I wanted to make eventually is perceived.
That eventually helps myself to have a clearer view of those ideas, I'm not pretending to write this blog as a collection of articles for others to read, I write on the things that I'm interested as writing helps me first and foremost, and then if someone else finds that interesting, it's just an added bonus.

Be water my friend
One of the things that I still don't feel to have clearly espressed is my view of data-driven designs.
Let's look at the last two items of my recurring-topics list: "Iteration time is the single most important coding metric", "Huge, parametric frameworks, are bad". The problem here is that most of the times, huge, parametric frameworks are made exactly to cut iteration times. You have probably seen them. Big code bases, with complex GUI tools that let you create your game AI / Rendering / Animation / Shading / Sounds / whatever by connecting components in a graph or tree, mhm usually involving quite a bit of XML, usually with some finite state machine and/or some badly written, minimal scripting language too (because no matter what, connecting components turns out not to be enough)

How can they be bad, if they are fulfilling my most important coding metric? There is a contraddiction, isn't it?

Yes, and no. The key point lies in observing how those frameworks are made. They usually don't grow up from generalizations made on an existing codebase. They are not providing common services that your code will use. They are driving the way you code instead. They fix a data format, and force your code to be built around it. To fix a data format, you have to make assumptions on what you will need. Assumptions of the future. Those, always, fail, so sooner or later, someone will need to do something that is not easily represented by the model you imposed.
And it's at that point that things go insane. Coders do their work, no matter what, using something close to the Fermat's principle (the basic principle our-rendering-engineers interpretation of light is built on). They try to solve problems following paths of minimal design change (pain minimization as Yegge would call it). And not because they are lazy (or because the deadlines are too tight, or not only anyways), but most of the times because we prefer (questionabily) uniformity to optimal solutions (that's also why a given programming language usually leads to a given programming style...).
So things evolve in the shape that the system imposes them, requirements change, we change solutions to still be fitting in that shape, until our solutions are so different from the initial design that the code looks like a twisted, bloated, slow pile of crap. At that point, a new framework is built, in the best case. In the worst one, more engineers are employed to manage all that crap, usually producing more crap (because they can't do any better, is the design that is rotten, not only the code!).

A common way to inject flexibilty in a rotten overdesigned framework is employing callbacks (i.e. prerendercallback, postrendercallback, preupdate, postendframe etc, etc etc...) to let users add their own code in the inner workings of the system itself. That creates monsters that have subshapes, built on and hanging from a main shape, something that even Spore is not able to manage.

What is the bottom line? That the more a library is general, the more shapeless it has to be. It should provide common services not shape future development. That's why for example when I talk about data-driven and fast iteration, most of the times I also talk about the virtues of reflection and serialization. Those are common services that can be abstracted and that should find a place in our framework, as it's a very safe assumption to say that our solutions will always have parameters to be managed...
Rigid shapes should be given as late as possible to our code.

Simple example
I still see that many rendering engines built on scenegraphs, and worse, using the same graph for rendering and for coordinate frame updates (hierarchical animations). Why? Probably because a lot of books show such ways of building a 3d engine, or because it maps so easily to a (bloated) OOP design that could be easily be an excercise in a C++ textbook.
Hierarchical animations are not so common anyway, they should not be a first-class item in our framework, that is an unreasonable assumption. They should be one of the possible coordinate frame updating subsystems, they should live in the code space that is as near as possible to user rendering code, not rooted in the structure of the system. Heck, who says that we need a single coordinate frame per object anyway? Instacing in such designs is made with custom objects in the graph that hide their instancing coordinates, making everything a big black box, that becomes even worse if you made the assumption that each object has to have a bounding volume. Then instancing objects will have a single volume encompassing all the instances, but that's suboptimal for the culler so you have to customize it to handle that case, or write a second culler in the instaced object, or split the instances in groups... And you can easily start to see how easily things are starting to go wrong...