Search this blog

07 August, 2017

Tiled hardware (speculations)

Over at Siggraph I had a discussion with some mobile GPU engineers about the pros and cons of tiled deferred rasterization. I prompted some discussion over Twitter (and privately) as well, and this is how I understand the matter of tiled versus immediate/"forward" hardware rasterization so far...

Thanks for all who participated in said discussions!

Disclaimer! I know (almost) nothing of hardware and electronics, so everything that follows is likely to be wrong. I write this just so that people who really know hardware can laugh at the naivity of mere mortals...



Tiled-Based Deferred Rendering.

My understanding of TBDR is that it works by dividing into tiles all the incoming dispatches aimed at a given rendertarget.

For this to happen, you'll have at the very least to invoke the part of the vertex shader that computes the output screen position of the triangles, for all dispatches, figure out which tile(s) a given triangle belongs to, and memorize in a per-tile storage the vertex positions and indices.
Note: Considering the number of triangles nowadays in games, the per-tile storage has to be in main memory, and not on an on-chip cache. In fact, as you can't predict up-front how much memory you'll need, you will have to allocate generously and then have some way to generate interrupts in case you end up needing even more memory...

Indices would be rather coherent, so I imagine that they are stored with some sort of compression (probably patented) and also I imagine that you would want to try to already start rejecting invisible triangles as they enter the various tiles (e.g. backface culling).

Then visibility of all triangles per tile can be figured out (by sorting and testing against the z-buffer), the remaining portion of the vertex shader can be executed and pixel shaders can be invoked.
Note: while this separation of vertex shading in two passes makes sense, I am not sure that the current architectures do not just emit all vertex outputs in a per-tile buffer in a single pass instead.

From here on you have the same pipeline as a forward renderer, but with perfect culling (no overdraw, other than, possibly, helper threads in a quad - and we really should have quad-merging rasters everywhere, don't care about ddx/ddy rules!).

Vertex and pixel work overlap does not happen on single dispatches, but across different output buffers, so balancing is different than an immediate-mode renderer.
Fabian Giesen also noted that wave sizes and scheduling might differ, because it can be hard to fill large waves with fragments in a tile, you might have only few pixels that touch a given tile with a given shader/state and more partial waves wasting time (not energy).

Pros.

Let's start with the benefits. Clearly the idea behind all this is to have perfect culling in hardware, avoiding to waste (texture and target) bandwidth for invisible samples. As accessing memory takes a lot of power (moving things around is costly), by culling so aggressively you save energy.

The other benefit is that all your rendertargets can be stored in a small per-tile on-chip memory, which can be made to be extremely fast and low-latency.
This is extremely interesting, because you can see this memory as effectively a scratch buffer for multi-pass rendering techniques, allowing for example to implement deferred shading without feeling too guilty about the bandwidth costs.

Also, as the hardware always splits things in tiles, you have strong guarantees of what areas of the screen a pixel-shader wave could access, thus allowing to turn certain vector operations (wave-wide) into scalar ones, if things are constant in a given tile (which would be very useful for example for "forward+" methods).

As the tile memory is quite fast, programmable blending becomes feasible as well.

Lastly, once the tile memory that holds triangle data is primed, in theory one could execute multiple shaders recycling the same vertex data, allowing further ways to split computation between passes.

Cons.

So why do we still have immediate-mode hardware out there? Well, the (probably wrong) way I see this is that TBDR is really "just" a hardware solution to zero overdraw, so it's amenable to the same trade-offs one always have when thinking of what should be done in hardware and what should be programmable.

You have to dedicate a bunch of hardware, and thus area, for this functionality. Area that could be used for something else, more raw computational units.
Note though that even if immediate renderers do not need the sophistication of tiling and sorting, they still need space for rendertarget compression which is less needed on a deferred hardware.

Immediate-mode rasterizers do not have to overdraw necessarily. If we do a full depth-prepass for example then the early-z test should cull away all invisible geometry exactly like TBDR.
We could even predicate the geometry pass after the prepass using the visibility data obtained with it, for example using hardware visibility queries or a compute shader. We could even go down to per-triangle culling granularity!

Also, if one looks at the bandwidth needed for the two solutions, it's not clear where the tipping point is. In both cases one has to go through all the vertex data, but in one case we emit triangle data per tile, in the other we write a compressed z-buffer/early-z-buffer.
Clearly as triangles get denser and denser, there is a point where using the z-buffer will result in less bandwidth use!

Moreover, as this is a software implementation, we could always decide for different trade-offs, and avoid doing a full depth pass but just heuristically selecting a few occluders, or reprojecting previous-frame Z and so on.

Lastly I imagine that there are some trade-offs between area, power and wall-time.
If you care about optimizing for power and are not limited much by the chip area, then building in the chip some smarts to avoid accessing memory looks very interesting.
If you only care about doing things as fast as possible then you might want to dedicate all the area to processing power and even if you waste some bandwidth that might be ok if you are good at latency hiding...
Of course that wasted bandwitdh will cost power (and heat) but you might not see the performance implications if you had other work for your compute units to do while waiting for memory.

Conclusions.

I don't quite know enough about this to say anything too intelligent. I guess that as we're seeing tiled hardware in mobiles but not on the high-end, and vice-versa, tiled might excel at saving power but not at pure wall-clock performance versus simpler architectures that use all the area for computational units and latency hiding.

Round-tripping geometry to main RAM seems to be outrageously wasteful, but if you want perfect culling you have to compare with a full-z prepass which reads geometry data twice, and things start looking a bit more even. 

Moreover, even with immediate rendering, it's not that you can really pump a lot of vertex attributes and not suffer, these want to stay on chip (and are sometimes even redistributed in tile-like patterns) so practically you are quite limited before you start stalling your pixel-shaders because you're running out of parameter space...

Amplification, via tessellation or instancing though can save lots of data for an immediate renderer, and the second pass as noted before can be quite aggressively culled and in an immediate renderer allows to balance in software how much one wants to pay for culling quality, so doing the math is not easy at all.

The truth is that for almost any rendering algorithm and rendering hardware, there are ways to reach great utilization, and I doubt that if one looked at that the two architectures were very far apart when fed appropriate workloads. 
Often it's not a matter of what can be done as there are always ways to make things work, but how easily is to achieve a given result.

And in the end it might even be that things are they way they are because of the expertise and legacy designs of the companies involved, rather than objective data. Or that things are hard to change due to myriads of patents, or likely a bit of all these reasons...

But it's interesting to think of how TBDR could change the software side of the equation. Perfect culling and per-tile fast memory would allow some cute tricks, especially in a console where we could have full exposure of the underlying hardware... Could be fun.

What do you think?

Post Scriptum.

Many are mentioning NVidia's tiled solution, and AMD has something similar as well now. I didn't talk about these because they seem to be in the end "just" another way to save rendertarget bandwidth.
I don't know if they even help with culling (I think not for NVidia, while AMD mentions they can do pixel-shading after a whole tile batch has been processed) but certainly they don't allow to split rendering passes more efficiently via an on-chip scratch, which to me (on the software side of things...) is the most interesting delta of TBDR. 

Of course you could argue that tiles-as-a-cache instead of tiles-as-a-scratch might still save enough BW, and latency-hide the rest, that in practice it allows to do deferred for "free". Hard to say, and in general blending units always had some degree of caching...

Lastly, with these hybrid rasters, if they clip triangles/waves at tile boundaries (if), one could still in theory get some improvements in e.g. F+ methods, but it's questionable because the tile sizes used seem too big to allow for the light/attribute screenspace structures of a F+ renderer to match the hardware tile size.

21 July, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (slides)

Made some slides for a lunch'n'learn following the content of my previous blog posts on machine learning and deep neural networks.


Unfortunately, had to censor practically all of the last part which was talking about some applications which we haven't yet published or might have details we didn't publish (and yes, I could have just taken them out, but it was fun to paint black splotches instead).

Also, when I do these things I tend to spice things up by trying new tools and styles (often the same for one-man programming projects). This time I made the slides (in one day) using Paper on an iPad Pro, was quite fun!

Enjoy!

20 May, 2017

The technical interview.

I like interviewing people. I like going to interview. This is how I do it.

1 - Have your objectives clear.

There is no consensus on what an interview is for. Companies work in different ways, teams have different cultures, we have to accept this, there is no "one true way" of making software, of managing creative teams and of building them.

Some companies hire very slowly, very senior staff. Others hire more junior positions more often. Some companies are highly technical, others care more about creativity and the ability for everyone to contribute to the product design. Some prefer hyper-specialized experts, others want to have only generalists and so on.

You could craft an interview process that is aimed at absolutely having zero false positives, probing your candidates until you know them more than they know themselves, and accept to have a high rate of false negatives (and thus piss some candidates) because the company is anyhow very slow in creating new openings.
Or you could argue that a strength of the company is to give a chance to candidates that would otherwise be overlooked, and thus "scout" gems.

Regardless, this means that you have to understand what your objectives are, what the company is looking for and what specifically your role in the overall interview process. 
And that has to be tailored with the specific opening, the goals in an interview for a junior role are obviously not the same as the goals you have when interviewing a senior one.

Personally, I tend to focus on the technical part of the interview process, that's both my strongest suit and also I often interview on the behalf of other teams to provide an extra opinion. So, I don't focus on cultural fit and other aspects of the team dynamics.

Having your objectives clear won't just inform you on what to ask and how, but also how to evaluate the resulting conversations. Ideally one should always have an interview rubric, a chart that clearly defines what is that we are looking for and what traits correspond to what levels of demonstrated competence.

It's hard to make good standardized interviews as the easiest way to standardize a test is to dumb it down to something that has very rigid, predefined answers, but that doesn't mean that we should not strive to engineer and structure the interview in a way that makes it as standard, repeatable and unbiased as possible.

2 - What (usually) I care about.

Almost all my technical interviews have three objectives:
  • Making sure that the candidate fulfills the basic requirements of the position.
  • Understanding the qualities of the candidate: strengths and weaknesses. What kind of engineer I'm dealing with.
  • Not being boring.
This is true for basically any opening, what changes are the specific things I look for.

For a junior role, my objective is usually to hire someone that, once we account for the time spent by the senior staff for mentoring, still manages to create positive value, that doesn't end up being a net loss for the team.

In this case, I tend not to care much about previous experience. If you have something on your resume that looks interesting it might be a good starting point for discussion, but for the most part, the "requirement" part is aimed at making sure the candidate knows enough of the basics to be able to learn the specifics of the job effectively.

What university where you in (if any), what year, what scores, what projects you did, none of these things matter (much) because I want to directly assess the candidate's abilities.

When it comes to the qualities a junior should demonstrate, I look for two characteristics. First: its' ability to unlearn, to be flexible. 
Some junior engineers tend (and I certainly did back in my times) to misjudge their expertise, coming from high education you get the impression that you actually know things, not that you were merely given the starting points to really learn. That disconnect is dangerous because inordinate amounts of time can be spent trying to change someone's mindset.

The second characteristic is, of course, to be eager and passionate about the field, trying to finding someone that really wants to be there, wants to learn, experiment, grow.

For a senior candidate, I still will (almost) always have a "due diligence" part, unless it would truly be ridiculous because I do truly -know- already that the candidate fulfills it. 
I find this to be important because in practice I've seen how just looking at the resume and discussing past experience does not paint the whole picture. I think I can say that with good certainty. In particular, I've found that:
  • Some people are better than others at communicating their achievements and selling their work. Some people might be brilliant but "shy" while others might be almost useless but managed to stay among the right people and absorb enough to present a compelling case.
  • With senior candidates, it's hard to understand how "hands on" one still is. Some people can talk in great detail about stuff they didn't directly build.
  • Some people are more afraid of delving into details and breaking confidentiality than others.
  • Companies have wildly different expectations in given roles. In some, even seniors do not daily concern themselves with "lower level" programming, in others, even juniors are able to e.g. write GPU assembly.
When it comes to the qualities of a senior engineer, things are much more varied than juniors. My aim is not to fit necessarily a given stereotype, but to gain a clear understanding, to then see if the candidate can fit somewhere.
What are the main interests? Research? Management? Workflows? Low-level coding? The tools and infrastructure, or the end-product itself?

Lastly, there is the third point: "not be boring", and for me, it's one of the most interesting aspects. To me, that means two things:
  • Respect the candidate's time.
  • Be aware that an interview is always a bi-directional communication. The interviewer is under exam too!
This is entirely an outcome that is guided by the interviewer's behavior. Nonetheless, it's a fundamental objective of the interview process: you have to demonstrate that you, and your company, are smart and thoughtful.

You have to assume that any good candidate will have many opportunities for employment. Why would they chose you, if you wasted their time, made the process annoying, and didn't demonstrate that you put effort into its design? What would that telegraph about the quality of the other engineers that were hired through it?

You have to be constantly aware of what you're doing, and what you're projecting.

3 - The technical questions.

I won't delve into details of what I do ask, both because I don't want to, and because I don't think it would be of too much value, being questions catered to the specific roles I interview for. But I do want to talk about the process of finding good technical questions, and what I consider to be the main qualities of good questions.

First of all, always, ALWAYS use your own questions and make them relevant to the job and position.  This is the only principle I dare to say is universal, regardless of your company and objectives, should be adhered to.

There is no worse thing than being asked the same lazy, idiotic question taken from the list of the "ten most common questions in programming interviews" off the internet.

It's an absolute sin. 
It both signals that you care so little about the process that just recycled some textbook bullshit, and it's ineffective because some people will just know the answer and be able to recite it with zero thought.
That will be a total waste of time, it's not even good for "due diligence" because people that just prepared and learned the answer to that specific question are not necessarily really knowledgeable in the field you're trying to assess.

The "trick" I use to find such questions is to just be aware during my job. You will face time to time interesting, general, small problems that are good candidates to become interview questions. Just be aware not to bias things too much to reflect your specific area of expertise.

Second, always ask questions that are simple to answer but hard to master. The best question is a conversation starter, something that allows many follow-ups in many different directions.

This serves two purposes. First, it allows people to be more "at ease", relaxed as they don't immediately face an impossibly convoluted problem. Second, it allows you not to waste time, as you established a problem setting, asking a couple follow-ups is faster than asking two separate questions that need to be set-up from scratch.

Absolutely terrible are questions that are either too easy or too hard and left there. On one extreme you'll get no information and communicate that your interview is not aimed at selecting good candidates, on the other you will frustrate the candidate which might start closing up and missing the answers even when you scale back the difficulty.

Also, absolutely avoid trick questions: problems that seem should have some very smart solution but really do not. Some candidates will be stuck thinking there ought to be more to the question and be unable to just dare try the "obvious" answer.

Third, avoid tailoring too much to your knowledge and expectations. There are undoubtedly things that are to be known for a given role, that are necessary to be productive and that are to be expected from someone that actually did certain things in the past. But there are also lots of areas where we have holes and others are experts.

Unfortunately, we can't really ask questions about things we don't know, so we have to choose among our areas of expertise. But we should avoid specializing too much, thinking that what we know is universal.

A good trick to sidestep this problem (other than knowing everything) is to have enough options that is possible to discuss and find an area where the both you and your candidate have experience. Instead of having a single, fixed checklist, have a good pool of options so you can steer things around.
This also allows you to make the interview more interactive, like a discussion where questions seem a natural product of a chat about the candidate's experience, instead of being a rigid quiz or checklist.

Lastly, make sure that each question has a purpose. Optimize for signal to noise. There are lots of questions that could be asked and little time to ask them. A good question leaves you reasonably informed that a given area has been covered. You might want to ask one or two more, but you should not need tens of questions just to probe a single area.

This can also be achieved, again, by making sure that your questions are steerable, so if you're satisfied e.g. with the insight you gained about the candidate's mastery of low-level programming and optimization, you could steer the next question towards more algorithmic concerns instead.

Conclusions.

I do not think there is a single good way of performing interviews. Some people will say that technical questions are altogether bad and to be avoided, some will have strong feelings about the presence of whiteboards in the room or computers, of leaving candidates alone solving problems or being always there chatting.

I have no particular feeling about any of these subjects. I do ask technical questions because of the issues discussed above, and for the record, I will draw things on a whiteboard and occasionally have people collaborating on it, if it's the easier way to explain a concept or a solution.

I do not use computers because I do not ask anything that requires real coding. The only exception is for internships, due to the large number of candidates I do pre-screen with an at-home test. That's always real coding, candidates are tasked to make some changes to a small, and hopefully fun and interesting codebase, which either I make ad-hoc from scratch or extract from suitable real-world programs).

I do not think any of this matters that much. What it really matters, is to take time, study, and design a process deliberately. In the real world the worst interviews I've seen were never because they chose a tool or another, but because no real thinking was behind them.

11 May, 2017

Where do GPUs come from.

A slide deck for a introduction to CG class.



PPTX - PDF (smaller)

Note: this is not really a tutorial in this form, there are no presenter's notes. But if you want to use this scheme to teach something similar, feel free. The CPU->GPU trajectory is heavily inspired by the brilliant work Kayvon Fatahalian did.

07 May, 2017

Privacy, bubbles, and being an expert.

Privacy is not the issue.

Much has been said about the risk of losing our privacy in this era of microwaves that can turn into cameras and other awful "internet of things" things. It seems that today there is nothing you can buy that does not both intentionally spy on your behaviors and is also insecure enough to allow third parties to spy on your behavior.

It's the wrong problem.

Especially when dealing with big, reputable companies, privacy is taken really quite seriously, there is virtually zero chance of anyone spying on you, as an individual. Even when it comes to anonymized data, care is taken to avoid singling out individuals, it might be unflattering, but big companies do not really care about you.

What they care about is targeting. Is being able to statistically know what various groups of people prefer, in order to serve them better and to sell them stuff.


CV Dazzle
The dangers of targeting.

Algorithmic targeting has two faces. There is a positive side to it, certainly. Why would a company not want to make its customers happier? If I know what you like, I can help you find more things that you will like, and yes, that will drive sales, but it's driving sales by effectively providing a better service, a sort of digital concierge. Isn't that wonderful? Why would anyone not opt in in such amazing technology...

But there is a dark side of this mechanism too, the ease with which algorithms can tune into the easiest ways to keep us engaged, to provide happiness, rewards. We're running giant optimizers attached to human minds, and these optimizers have access to tons of data and can do a lot of experiments on a huge (and quite real) population of samples, no wonder we can quickly converge towards a maximum of the gratification landscape.

Is it right to do so? Is it ethical? Are we really improving the quality of life, or are we just giving out quick jolts of pleasure and engagement? Who can say?
Where is the line between, for example, between keeping a player in a game because it's great, for some definition of great, and doing it so because it provides some compulsion loops that tap into basic brain chemistry the same way slot machines do?

Will we all end up living senseless lives attached to machines that provide to our needs, farmed like the humans in the Matrix?


Ellen Porteus

Pragmatically.

I don't know, and to be honest there are good reasons not to be a pessimist. Even with just a look at the history behind us, we had similar fears for many different technologies, and so far we always came up on top.

We're smarter, more literate, more creative, more productive, happy, healthy, pacific, rich that we ever were, globally. It is true that technology is quickly making leaps and opening options that were unthinkable even just a decade ago, but it's also true that there is not too much of a reason to think we can not adapt.

And I think if we look at newer generations, we can already see how this adaption is taking place, even observing product trends, it seems to be becoming harder and harder to engage people with the most basic compulsion loops and cheap content, acquiring users is increasingly hard, and the products that in practice make it onto the market doing so by truly offering some positive innovation.

Struggling with bubbles.

Even if I'm not a pessimist though, there is something I still struggle with: the apparent emergence of radicalization, echo chambers, bubbles. I have to admit, this is something hard to quantify on a global scale, especially when it comes to placing the phenomenon in a historical perspective, but it just bothers me personally, and I think it's something we have to be aware of.

I think we are at a peculiar intersection today. 

On one hand, we have increasingly risen out of ignorance and starting to be concerned with the matters of the world more. This might not seem to be the case looking at Trump and so on, but it's certainly true if we look at the trajectory of humanity with a bit more long-term historical perspective.

On the other hand, the kind of problems and concerns we are presented with increased in complexity exponentially. We are exposed to the matters of the world, and the world we live in is this enormous, interconnected beast were cause and effect get lost in the chaotic nature of interactions. 

Even experts don't have easy answers, and I think we know that because we might be experts in a field or two, and most big questions I believe would be answered with "it depends".
There are a myriad of local optima in the kind of problems we deal with today, and which way to go is more about what can work in a given environment, with given people, than what can be demonstrably proven to be the best direction.


Echo Chambers

The issue.

And this is where a big monster rears its head. In a world with lots of content and information, with systems that allow us to quickly connect to huge groups of similar-minded people, algorithms that feed contents that agree with our views seeking instant satisfaction over exploration, true knowledge and serendipity, when faced with increasingly complex issues, how attractive the dark side of the confirmation bias becomes?

We have mechanisms built-in all of us, regardless of how smart, that were designed not to seek the truth but to be effective when navigating the world and its social interactions. Cognitive biases are there because they serve us, they are tools stemmed from evolution. But is our world changing faster than our brain's ability to evolve?

Pragmatically again though, I don't intend to look too much at the far future (which I believe is generally futile, as you're trying to peek into a chaotic horizon). What annoys me is that even when you are aware of all this, and all these risks today, it's becoming hard to fight the system.
There is simply too much content out there, and too many algorithms around you (even if you isolate yourself from them) tuned to spread it in different groups that finding good information is becoming hard.

Then again, I am not sure of the scale of this issue, because if again, we look at things historically, probably we are still on average better informed today and less likely to be deceived than even just a few decades ago, where most people were not informed at all and it was much easier to control the few means of mass communication available.

Yet, it unavoidably irks me to look around and be surrounded by untrustworthy content and even worse, content that is made to myopically serve a small world view instead of trying to capture a phenomenon in all its complexity (either with malice, or just because it's simpler and gets clicks).
Getting accurate, global data is incredibly hard, as it's increasingly valuable and thus, kept hidden for competitive advantage.

John W.Tomac

Being an expert, or just decent.

I find that similar mechanisms and balances affect our professional lives, unsurprisingly. I often say that experience is a variance reduction technique: we become less likely to make mistakes, more effective, more knowledgeable and able to quickly dismiss dangerous directions, but we also risk of becoming less flexible, rooted in beliefs and principles that might not be relevant anymore.

I find no better example of these risks than in the trajectory of certain big corporations and how they managed to become irrelevant, not due to the lack of smart, talented people, but because at a given size one risks to have a gravity of its own, and truly believe in a snapshot of the world that meanwhile has moved on. How so many smart people can manage to be blinded.

Experience is a trade-off. We can be more effective even if we might be more wrong. Maybe, more importantly, we risk losing the ability to discover more revolutionary ideas.
How much should we be open to exploration and how much should we be focused on what we do best? How much should we seek diversity in a team, and how much should we value cohesion and unity of vision. I find these to be quite hard questions.

I don't have answers, but I do have some principles I believe might be useful. The first has to do with ego, and here it helps not to have a well-developed one, to begin with, because my suggestion is to go out and seek critique, "kill your darlings".
This has been taught to me at an early age by an artist friend of mine who was always critical of my work and when I protested made me notice how the only way to be better is to find people willing to trash what you do.

In practice, I think that we should be more severe, critical, doubtful of what we love and believe than anything else. We should set for our own ideas and social groups a higher standard of scrutiny than what we do for things that are alien to us.

The second principle that I believe can help is to encourage exploration, discovery, experimentation and failure. Going outside our comfort zones is always hard but even harder is to face failure, we don't like to fail, obviously and for good reasons.
So one cannot achieve these goals without setting some small, safe spaces where exploration is easier and not burdened (I would say unconstrained, but certain other constraints actually do help) by too much early judgment.

Lastly, beware that for how much you know about all this, and are willing to act, many times you will not. I don't always follow my own principles and I think that's normal. I try to be aware of these mechanisms though. And even there, the keyword is "try".

Epilog: affecting change.

I believe that polarizing, blinding, myopic forces are at work everywhere, in our personal and professional lives, in the society at large, and being aware of them is important even just to try to navigate our world.

But if instead of just navigating the world, one wants to actually affect change, then it's imperative to understand the fight that lies ahead.

The worst thing that can be ever done is to feed the polarization forces, cater to our own and scare away people who might have been willing to consider our ideals. It does not help, it damages.

Catering to our own enclaves, rallying our people, is easy and tempting and fulfilling. It's not useless, certainly, there is value in reaffirming people who are already inclined to be on our side, but there is much more to be gained in even just instilling a doubt reaching out to someone who is on the opposite side, or is undecided in the middle, than solidifying beliefs of people that share ours already.

You can even look at current events, elections and the way they are won.

Understanding people who think differently than us, applying empathy, extending reach, is so much harder. But it's also the only smart choice.



06 May, 2017

Shadow mistery

Can shadows cause the texture UVs to shift?

This was a bug assigned to one of our engineers. Puzzling. Instead of being really useful, I started to investigate with some offline rendering.


Look at the shadows of the disappearing pillar

Wow! Mental Ray is broken :)

So apparently yes, you can easily create this optical illusion. It's pretty easy to understand why, especially on a constant albedo the texture of the surface comes entirely from shading. If the light moves, so will the highlights traverse the surface, and that creates an illusion similar to a texture shift, of just a few pixels.

The effect is much stronger when the ambient light creates an highlight opposite to the main light, so when the main light is shadowed the ambient highlight dominates and the shift becomes apparent.

On a render where the ambient and highlight are on the same side, the effect is much less pronounced.



The nail on the coffin was though when I managed to reproduce the same effect in real-life, so, it's definitely an optical illusion that can happen, but it's probably made worse by realtime rendering unshadowed ambient/GI on normalmaps and by the fact that shadows abruptly cancel the sun/sky light, instead of gradually blocking only some rays and rolling out the highlight direction in the penumbra region.

01 April, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 2)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

Part 1 here!

- Curse of dimensionality

Classical machine learning works on datasets of small or moderate number of dimensions. If you ever tried to do any function approximation or to work with optimization in general, you should have noticed that all these problems suffer from the "curse of dimensionality".

Say for example that we have a set of outdoor images of a fixed size: 256x128. And let's say that this set is labeled with a value that tells at what time of the day the image was taken. We can see this dataset as samples of an underlying function that takes as input 98304 (WxHxRGB) values and outputs one.
In theory, we could use standard function fitting to find an expression that in general can tell from any image the time it was taken, but in practice, this approach goes nowhere: it's very hard to find functional expressions in so many dimensions!


More dimensions, more directions, more saddles!
The classic machine learning approach to these problems is to do some manual feature selection. We can observe our data set and come up with some statistics that describe our images compactly, and that we know are useful for the problem at hand.
We could, for example, compute the average color, thinking that different time of day does strongly change the overall tinting of the photos, or we could compute how much contrast we have, how many isolated bright pixels, all sorts of measures, then use or machine learning models on these.


A scatterplot matrix can show which dimensions of a dataset have interesting correlations.
The idea here is that there is a feedback loop between the data, the practitioner, and the algorithms. We look at the data set, make hypotheses on what could help reduce dimensionality, project, learn; if it didn't work well we rinse and repeat. 
This process is extremely useful, and it can help us understand the data better, discover relationships, invariants.

Multidimensional scaling, reducing a 3d data set to 1D
The learning process can sometimes instruct the exploration as well: we can train on a given set of features we picked, then notice that the training didn't really use much some of them (e.g. they have very small weights in our functional expression), and in that case we know these features are not very useful for the task at hand.
Or we could use a variety of automated dimensionality reduction algorithms to create the features we want. Or we could use interactive data visualization to try to understand which "axes" have more interesting information...

- Deep learning

Deep learning goes a step forwards and tries to eliminate the manual feature engineering part of machine learning. In general, it refers to any machine learning technique that learns hierarchical features. 

In the previous example we said it's very hard to learn directly from image pixels a given high-level feature, we have to engineer features first, do dimensionality reduction. 
Deep learning comes and says, ok, we can't easily go from 98304 dimensions to time-of-day, but could we automatically go from 98304 dimensions to maybe 10000, which represent the data set well? It's a form of compression, can we do a bit of dimensionality reduction, so that the reduced data still retains most of the information of the original?

Well, of course, sure we can! Indeed we know how to do compression, we know how to do some dimensionality reduction automatically, no problem. But if we can do a bit of that, can we then keep going? Go from 10000 to 1000, from 1000 to 100, and from 100 to 10? Always nudging each layer so it keeps in mind that we want features that are good for a specific final objective? In other words, can we learn good features, recursively, thus eliminating the laborious process of data exploration and manual projection?

Turns out that with some trickery, we can.

DNN for face recognition, each layer learns higher-order features
- Deep learning is hard

Why learning huge models fails? Try to picture the process: you have a very large vector of parameters to optimize, a point in a space of thousands, tens of thousands of dimensions. Each optimization step you want to choose a direction in which to move, you want to explore this space towards a minimum of the error function. 
There are just so many possible choices, if you were to explore randomly, just try different directions, it would take forever. If before doing a step we were to try any possible direction, we would need to evaluate the error function at least once per dimension, thousands of times.

Your most reliable guide through these choices is the gradient of the error function, a big arrow telling in which direction the error is going to be smaller if a small step is taken. But computing the gradient of a big model itself is really hard, numerical errors can lead to the so-called gradient diffusion.



Think of a neural network with many, many layers. A choice (change, gradient) of a weight in the first layer will change the output in a very indirect way, it will alter a bit the output of the layer but that output will be fed into many others before reaching the destination, the final value of the neural network.
The relationship between the layers near the output and the output itself is more clear, we can observe the layer inputs and we know what the weights will do, but layers very far from the output contribute in such an indirect way!


Imagine wanting to change the output of a pachinko machine.
Altering the pegs at the bottom has a direct result on the how the balls will fall
into the baskets, but changing the top peg will have less predictable results.
Another problem is that the more complex the model is, the more data we need to train it. If we have a model with tens of thousands of parameters, but we have only a few data point, we can easily "overfit": learn an expression that perfectly approximates the data points we have, but that is not related to the underlying process that generated these data points, the function that we really wanted to learn and that we know only by examples!


Overfitting example.
In general, the more powerful we want our model to be, the more data we need to have. But obtaining the data, especially if we need labeled data, is often non-trivial!

- The deep autoencoder

One idea that worked in the context of neural networks is to try to learn stuff layer-by-layer, instead of trying to train the whole network at once: enter the autoencoder.

An autoencoder is a neural network that instead of approximating a function that connects given inputs to some outputs, it connects inputs to themselves. In other words, the output of the neural network has to be the same as the inputs (or alternatively we can use a sparsity constraint on the weights). We can't just use an identity neural network though, the kink in this is that the autoencoder has to have a hidden layer with fewer neurons than the number of inputs!


Stacked autoencoder.
In other words, we are training the NN to do dimensionality reduction internally, and then expand the reduced representation back to the original number of dimensions, an autoencoder trains a coding layer and a decoding layer, all at once. 
This is perhaps surprisingly not hard, if the hidden layer bottleneck is not too small, we can just use backpropagation, follow the gradient, find all the weights we need.

The idea of the stacked autoencoder is to not stop at just one layer: once we train a first dimensionality-reduction NN we keep going by stripping the decoding layer (output) and connecting a second autoencoder to the hidden layer of the first one. And we can keep going from there, by the end, we'll have a deep network with many layers, each smaller than the preceding, each that has learned a bit of compression. Features!

The stacked autoencoder is unsupervised learning, we trained it without ever looking at our labels, but once trained nothing prevents us to do a final training pass in a supervised way. We might have gone from our thousands of dimensions to just a few, and there at the end, we can attach a regular neural network and train all the entire thing in a supervised way.

As the "difficult" layers, the ones very far from the outputs, have already learned some good features, the training will be much easier: the weights of these layers can still be affected by the optimization process, specializing to the dataset we have, but we know they already start in a good position, they have learned already features that in general are very expressive.

- Contemporary deep learning

More commonly today deep learning is not done via an unsupervised pre-training of the network, instead, we often are able to directly optimize bigger models. This has been possible via a better understanding of several components:

- The role of initialization: how to set the initial, randomized, set of weights.
- The role of the activation function shapes.
- Learning algorithms.
- Regularization.

We still use gradient descent type of algorithms, first-order local optimizers that use derivatives, but typically the optimizer uses approximate derivatives (stochastic gradient descent: the error is computed only with random subsets of the data) and tries to be smart (adagrad, adam, adadelta, rmsprop...) about how fast it should descend (the step size, also called learning rate).

In DNNs we don't really care about reaching a global optimum, it's expected for the model to have myriads of local minima (because of symmetries, of how weights can be permuted in ways that yield the same final solution), but reaching any of them can be hard. Saddle points are more common than local minima, and ill conditioning can make gradient descent not converge.

Regularization: how to reduce the generalization error.

By far the most important principle though is regularization. The idea is to try to learn general models, that do not just perform well on the data we have, but that truly embody the underlying hidden function that generated it. 

A basic regularization method that is almost always used with NNs is early stopping. By splitting the data in a training set (used to compute the error and the gradients) and a validation set (used to check that the solution is able to generalize): after some training iterations we might notice that the error on the training set keeps going down, but the one on the validation set starts rising, that's when overfitting is beginning to take place (and we should stop the training "early").

In general regularization it can be done by imposing constraints (often can be added to the error function as extra penalties) and biasing towards simpler models (think Occam's razor) that explain the data; we accept to perform a bit worse in terms of error on the training data set if in change we get a simpler, more general solution.

This is really the key to deep neural networks: we don't use small networks, we construct huge ones with a large number of parameters because we know that the underlying problem is complex, it's probably exceeding what a computer can solve exactly.
But at the same time, we steer our training so that our parameters try to be as sparse as possible; it has to have a cost to use a weight, to activate a circuit, this, in turn, ensures that when the network learns something, it's something important. 
We know that we don't have lots of data compared to how big the problem is, we have only a few examples of a very general domain.

Other ways to make sure that the weights are "robust" is to inject noise, or to not always train with all of them but try cutting different parts of the network out as we train (dropout).


Google's famous "cat" DNN
Think for example a deep neural network that has to learn how to recognize cats. We might have thousand of photos of cats, but still, there is really an infinite variety, we can't even enumerate all the possible cats in all possible poses, environments and so on, these are all infinite. And we know that in general we need a large model to learn about cats, recognizing complex shapes is something that requires quite some brain-power.
What we want though is to avoid that our model just learns to recognize exactly the handful of cats we showed it, we want it to extract some higher-level knowledge of what cats looks like.

Lastly, data augmentation can be used as well: we can always try to generate more data from a smaller set of examples. We can add noise and other "distractions" to make sure that we're not learning too much the specific examples provided. 
Maybe we know that certain transforms are still valid examples of the same data, for example, a rotated cat is still a cat. A cat behind a pillar is still a cat or on different backgrounds. Or maybe we can't generate more data of a given kind, but we can generate "adversarial" data: examples of things that are not what we are seeking for.

- Do you need all this, in your daily job?

Today there is a ton of hype around deep learning and deep neural networks, with huge investments on all fronts. DNNs are groundbreaking in terms of their representation power and deep learning is even guiding the design of new GPUs and ad-hoc hardware! But, for all the fantastic accomplishments and great results we see, I'd say that most of the times we don't need it...

One key compromise we have with deep learning is that it replaces feature engineering with architecture engineering. True, we don't need to hand-craft features anymore, but this doesn't mean that things just work!
Finding the right architecture for a deep learning problem is hard, and it's still mostly and art done with experience and trials and errors.

This might very well be a bad tradeoff. When we explore data and try to find good features we effectively learn (ourselves) some properties of the data. We make hypotheses about what might be significant and test them. 
In contrast, deep neural networks are much more opaque and impenetrable (even if some progress has been made). And this is important, because it turns out that DNN can be easily fooled (even if this is being "solved" via adversarial learning).

Architecture engineering also has slower iteration times, we have each time to train our architecture to see how it works, we need to tune the training algorithms themselves... In general building deep models is expensive, both in terms of human and machine time.


Decision trees, forests, gradient boosting are much
more explainable classifiers than DNNs
Deep models are in general much more complex and expensive than hand-crafted feature-based ones when it's possible to find good features for a given problem. In fact nowadays, together with solid and groundbreaking new research, we also see lots of publications of little value, that simply take a problem and apply a questionable DNN spin to it, with results that are not really better than the state of the art solution made with traditional handcrafted algorithms...

And lastly, deep models will always require more data to train. The reason is simple: when we create statistical features ourselves, we're effectively giving the machine learning process some a-priori model. We know that some things make sense, correlate with our problem and that some others do not. 
This a-priori information acts like a constraint: we restrict what we are looking for, but in exchange, we get fewer degrees of freedom and thus fewer data points can be used to fit our model.

In deep learning, we want to discover features by using very powerful models. We want to extract very general knowledge from the data, and in order to do so, we need to show the machine learning algorithm lots of examples...

In the end, the real crux of the issue is that most likely, especially if you're not already working with data and machine learning on a problem, you don't need to use the most complex, state of the art weapon in the machine learning arsenal in order to have great results!

- Conclusions

The way I see all this is that we have a continuum of choices. On one end we have expert-driven solutions: we have a problem, we apply our intellect and we come with a formula or an algorithm that solves it. This is the conventional approach to (computer) science and when it works, it works great.

Sometimes the problems we need to solve are intractable: we might not have enough information, we might not have an underlying theoretical framework to work with, or we might simply not have in practice enough computational resources.

In these cases, we can find approximations: we make assumptions, effectively chipping away at the problem, constraining it into a simpler one that we know how to solve directly. Often it's easy to make somewhat reasonable assumption leading to good solutions. It's very hard though to know that the assumptions we made are the best possible.

On the other end, we have "ignorant", black-box solutions: we use computers to automatically discover, learn, how to deal with a problem, and our intelligence is applied catering to the learning process, not the underlying problem we're trying to solve. 
If there are assumptions to be made, we hope that the black-box learning process will discover them automatically from the data, we don't provide any of our own reasoning.

This methodology can be very powerful and yield interesting results, as we didn't pose any limits, it might discover solutions we could have never imagined. On the other hand, it's also a huge waste: we are smart! Using our brain to chip away at a problem can definitely be better than hoping that an algorithm somehow will do something reasonable!

In between, we have an ocean of shades of data-driven solutions... It's like the old adage that one should not try to optimize a program without profiling, in general, we should say that we should not try to solve a problem without having observed it a lot, without having generated data and explored the data.

We can avoid making early assumptions: we just observe the data and try to generate as much data as possible, capturing everything we can. Then, from the data, we can find solutions. Maybe sometimes it will be obvious that a given conventional algorithm will work, we can discover through the data new facts about our problem, build a theoretical framework. Or maybe, other times, we will just be able to observe that for no clear reason our data has a given shape, and we can just approximate that and get a solution, even if we don't know exactly why...

Deep learning is truly great, but I think an even bigger benefit of the current DNN "hype", other than the novel solutions it's bringing, is that more generalist programmers are exposed to the idea of not making assumptions and writing algorithms out of thin air, but instead of trying to generate data-sets and observe them. 
That, to me, is the real lesson for everybody: we have to look at data more, now that it's increasingly easy to generate and explore it.

Deep learning then is just one tool, that sometimes is exactly what we need but most of the times is not. Most of the times we do know where to look. We do not have huge problems in many dimensions. And in these cases very simple techniques can work wonders!  Chances are that we don't even need neural networks, they are not in general that special.

Maybe we needed just a couple of parameters and a linear regressor. Maybe a polynomial will work, or a piecewise curve, or a small set of Gaussians. Maybe we need k-means or PCA.

Or we can use data just to prove certain relationships exist to then exploit them with entirely hand-crafted algorithms, using machine learning just to validate an assumption that a given problem is solvable from a small number of inputs... Who knows! Explore!

Links for further reading.

This is a good DNN tutorial for beginners.
NN playground is fun.
- Keras is a great python DNN framework.
- Alan Wolfe at Blizzard made some very nice blog posts about NNs.
- You should in general know about data visualization, dimensionality reduction, machine learning, optimization & fitting, symbolic regression...
- A good tutorial on deep reinforcement learning, and one on generative adversarial networks.
- History of DL
- DL reading listMost cited DNN papers. Another good one.
- Differentiable programming, applies gradient descent to general programming.


Thanks to Dave Neubelt, Fabio Zinno and Bart Wronski 
for providing early feedback on this article series!