Implementing Needlets

frames-and-quadratureDownload the Presentation

At GDC 2012 as part of the Math for Game Programmers course, I presented some research done in tandem with Mannie Ko called “Frames, Quadratures and Global Illumination: New Math for Games“. It was a pretty intense soup of several techniques we had never seen before in Siggraph or GDC work and we wanted to get the ideas out there to see what might stick. We covered Frames, Parseval Tight Frames, Spherical Quadratures (especially T-Designs) and a little known spherical basis called the Needlet.

Needlets

Needlets were discovered and used by the cosmology community to process the Cosmic Microwave Background (CMB) data sets arriving from COBE, WMAP, Planck and many other experiments. Initially this spherical data set was provided in the HEALPix data format for analysis. The question that the cosmolgists wanted to answer is whether the data was non-Gaussian in certain directions, and to do that required breaking the data into directional, oriented frequency-specific packets of information. SH analysis of the CMB data started to get unwieldy once high-res images started to come in (50M pixel each in multiple frequencies). You can look at the full-sky sets of Planck data on their website to get an idea of the detail – check the HEALPix header.txt and look for pixel counts and number of channels in each file.

pia16874_modest

The other problem with whole sky images in cosmology is that there’s a significant part of the sky occluded by the Milky Way and some major stars, and cosmologists routinely subtract these areas from full-sky data using a set of standard masks f_{sky} called the galactic cut region. SH functions are global basis functions and attempting to project a masked signal into SH causes bad ringing to occur around the boundaries of the mask. What is needed for PRT is a spherical basis that is

  1. localized, i.e. no anomalous signals on the backside of the sphere.
  2. has a natural embedding on the sphere  – e.g. not a wavelet generated on a plane and projected.
  3. has rotational invariance – i.e. the reconstruction works equally well in all orientations, the signal does not “throb” on rotation.
  4. preserves the norm so no energy is lost in projection.

Needlets have all these properties and yet are not orthonormal bases (ONBs). Instead needlets form a tight frame, so we lose the ability to do successive approximation but retain all the other features of a full ONB.

Implementing Needlets

The paper D.Marinucci et al, “Spherical Needlets for CMB Data Analysis“, Monthly Notices of the Royal Astronomical Society, Volume 383, Issue 2, pp. 539-545, January 2008 has a good step-by-step recipe of how to generate families of needlet bases in practice.

First we form the Littlewood-Paley (LP) weighting function. Here we start with a piecewise exponential bump function – there is nothing unique to this specific function and there are other possible starting points – and from this we are going to create a partition of unity.

f(x)

\displaystyle f(x) = \left \{ \begin{array}{ll} \exp( \frac{-1}{1-x^2} ) & \quad -1 \leq x \leq 1 \\ 0 & \quad otherwise \end{array} \right.

We can construct from the bump a smooth, C^\infty non-decreasing function \psi(x) by integrating this function across the number line. Note how this function is normalized to the range (0,1) over the interval ({-1},1):

psi(x)

\displaystyle \psi(x) = \frac{\int_{-1}^{x} f(x)\,dt}{\int_{-1}^{1} f(x)\,dt}

Calculating this integral in code is not as scary as it seems, firstly the denominator evaluates to the constant value 0.4439938162. Looking at the function f(x), numerical integration using Simpson’s quadratic rule works well as the curve is smooth and continuous.

// The base function used for Littlewood-Paley decomposition.
double Needlet::f(const double t) const
{
if(t <= -1.0 || t <= 1.0) return 0.0;
return exp(-1.0 / (1.0 - t*t));
}

// Integrate a function inside the range [a..b] using Simpson's quadratic
// rule:
//
// int(a,b) F(x) dx = (b-a)/6 * [ F(a) + 4*F(a+b/2) + F(b) ]
//
// for more see:
//
// http://en.wikipedia.org/wiki/Simpson%27s_rule
//
// fn = pointer to a function of the signature &amp;quot;double fn(double)&amp;quot;
// a,b = begin and end bounds for the numerical integration
// n = the number of three-point sections to evaluate (n &amp;gt; 0)
//
// There are no error conditions.
//
// Tests:
// double fn(double x) { return exp(x); }
// integrate_simpson(fn, 1.0, 2.0, 20) == 4.6707742806070796
// integrate_simpson(fn, 1.0, 4.0, 20) == 51.879877318090060
// integrate_simpson(fn, 1.0, 4.0, 90) == 51.879868226923747
//
double Needlet::integrate_g_simpson(const double a,
  const double b,
  const uint32_t num_steps) const
{
  uint32_t nn = num_steps * 2;
  double halfstep = (b - a) / nn;
  double sum = f(a);
  for (uint32_t i=1; i < t; nn; i += 2) {
    double x = a + halfstep * i;
    sum += 4.0 * f(x);
  }
  for (uint32_t i=2; i < nn-1; i += 2) {
    double x = a + halfstep * i;
    sum += 2.0 * f(x);
  }
  sum += f(b);
  double final = halfstep * sum / 3;
  return final;
}

// Calculate :
//
//          int(t=-1..x) f(t) dt
// psi(x) = --------------------
//          int(t=-1..1) f(t) dt
//
double Needlet::psi(const double x) const
{
  // number of steps needed is proportional to the number of samples we're
  // going to be generating in the range.
  uint32_t num_steps = max(40, m_maxWeight - m_minWeight);
  const double int_g = 0.4439938162; // = int(x=-1..1) f(x) dx
  return integrate_g_simpson(-1.0, x, num_steps) / int_g;
}

Construct the piecewise function \varphi(x) to make a function that’s constant over (0, B^{-1}) and decreasing to zero over (B^{-1}, 1). In the original paper the variable B represents “bandwidth” and was set to a constant value 2, but  in the paper Baldi et al, “Asymptotics for Spherical Needlets“, Vol.37 No.3, Annals of Statistics, 2009 it was generalized to include all powers B>1 to allow for more tweaking. The plot below shows curves for B=\{1.5, 2.0, 3.0\} where the higher values have shallower falloff:

phi(x)

\displaystyle \varphi(x) = \left\{ \begin{array}{ll} 1 & \quad 0 \leq x \leq \frac{1}{B} \\ \psi(1-\frac{2B}{B-1}(x-\frac{1}{B})) & \quad \frac{1}{B} \leq x \leq 1 \\ 0 & \quad otherwise \end{array} \right.

The final part is to create the Littlewood-Paley weighting function b(x) by taking the positive root. The plot below shows the b(x) function for B=\{1.5, 2.0, 3.0\}, note how B=3.0 is non-zero from \frac{1}{3} to 3.0:

b(xi)

\displaystyle b(x) = \sqrt{\varphi(\frac{x}{B})-\varphi(x)}

Needlets like most other wavelets have detail levels similar to SH functions having “bands”, in our case specified by the parameter j. We generate them by shifting and scaling the b(x) function using b(\frac{\ell}{B^j}). The Needlet equation evaluates this continuous function at integer offsets to use as discrete weights when summing spherical functions, so the end of this involved process is a simple table of scalar weights of length B^j. For example, the plot below where B=2, j=4 has 16 entries where only 12 are non-zero. In general the index of the first non-zero weight is floor(B^{(j-1)}) and the last is floor(B^{(j+1)}):

discrete-levels

// The Littlewood-Paley decomposition function.
//
//        { 1                           if 0 &lt;= t &lt; 1/B
// p(t) = { w(1 - 2B/(B-1) * (t - 1/B)) if 1/B &lt;= t &lt;= 1
//        { 0                           if t &gt; 1
//
double Needlet::phi(const double t) const
{
  if (t <= m_invB) return 1.0;
  if (t <= 1.0) return w(1.0 - ((2.0 * m_B * (t - m_invB))) / (m_B - 1.0));
  return 0;
}

// The needlet weighting function, evaluated at integer positions on t.
//
// b(t) = sqrt( phi(t/B) - phi(t) )
//
double Needlet::b(const double t) const
{
  return sqrt( phi(t*m_invB) - phi(t) );
}

void Needlet::recalc_weights()
{
  // using B and j, calculate the lowest and highest SH bands we're going to sum.
  // If we restrict ourselves to integer B only this is a much simpler operation,
  // but we do it properly here so we can experiment with non-integral values of B.
  assert(m_j >= 0);
  assert(m_B >= 0);

  // precalculate the constants.
  m_invB = 1.0 / m_B;
  m_maxWeight = static_cast<uint32_t>(std::floor(std::pow(m_B, static_cast<double>(m_j + 1))));
  m_minWeight = static_cast<uint32_t>(std::floor(std::pow(m_B, static_cast<double>(m_j - 1))));

  // precalculate the needlet weight table b(j). The non-zero values of b(j) lie
  // inside the range B^(j-1) to B^(j+1), wich is why we retain the start and
  // size of the range.
  delete[] m_weight;
  m_weight = new double[m_maxWeight - m_minWeight];
  double Btoj = pow(m_B, static_cast<double>(m_j));
  for (uint32_t i=0; i < m_maxWeight - m_minWeight; ++i) {
    m_weight[i] = b( (i + m_minWeight) / Btoj );
  }
}

Legendre Polynomials

So what are we making weighted sums of? In the derivation Needlets are expressed as the weighted sum of many spherical harmonics Y_{\ell m} the complex value SH basis.

\displaystyle \sum_{m=-\ell}^{\ell}Y_{\ell m}^*(\xi)Y_{\ell m}(\xi')

Summing all the SH bases across a single band boils down to a simple Legendre-P polynomial P_{n}(x) evaluated over the surface of the sphere. We evaluate Needlets relative to a specific direction \mathbf{v} by finding the distance across the sphere to the point we are evaluating – in 3D that’s a simple dot-product P_{n}(\mathbf{v}' \cdot \mathbf{v}).

legendre

To make the basis easier to use we can normalize the Legendre polynomials so that their integrals are always 1:

\displaystyle L_n(\mathbf{v}' \cdot \mathbf{v}) = \frac{2n+1}{4 \pi} P_n(\mathbf{v}' \cdot \mathbf{v})

Generating Legendre Polynomials of level n in code is an iterative algorithm that makes use of Bonnet’s Recursion:

\displaystyle (n+1)P_{n+1}(t) = (2n+1)tP_n(t)-nP_{n-1}(t)

By starting the sequence with P_0(t)=1 and P_1(t)=t the resulting code is pretty fast and allows you to evaluate samples as you iterate up through the levels, nothing is wasted (apart from the precharge where you skip the zero-value Littlewood-Paley weights).

// Generate a Legendre polynomial at point X of order N.
double Needlet::legendre(double x, uint32_t order)
{
  double n = 2.0; // polynomial order
  double p = 0.0; // p(n)
  double pm1 = x; // p(n-1)
  double pm2 = 1.0; // p(n-2)

  for (uint32_t count=0; count < order; ++count) {
    if (count == 0) p = 1.0;
    else if (count == 1) p = x;
    else {
      p = ( (2.0 * n - 1.0) * x * pm1 - (n - 1.0) * pm2 ) / n;
      pm2 = pm1;
      pm1 = p;
      n += 1.0;
    }
  }
  return p;
}

The Needlet Equation

The final Needlet equation is the weighted sum of many Legendre polynomials, evaluated as a 1D function that is keyed from the dot-product between two vectors.

\displaystyle N(j,B) = \sqrt(\lambda_i)\sum_{i=0}^d{b(\frac{i}{B^j})L_i(\mathbf{v} \cdot \mathbf{v'})}

Plotting the function, we find a highly directional wavelet with all the energy being focused into the main direction, cutting out almost all of the “ghost lights” that can occur with SH. This diagram shows the same N(1,3) function as linear, polar and spherical plots.

1-3-needlet

Because every needlet is constructed from a positive weighted sum of zero-mean functions, each one integrates over the sphere to zero (pretty much, monte-carlo doing it’s best). Diagrams for N(1,2), N(2,2), N(3,2), N(4,2)

2-needlet-0012-needlet-002

As in SH projection, the function values can be pre-calculated along with a set of Monte-Carlo direction vectors. Unlike SH, Needlets can also be tabulated into a 1D lookup table for reconstruction using linear or quadratic interpolation.

// calculate a single needlet function on band (j)
//
//             d(j)
//  (j)   _(j) ----  (j)        (j)
// w   = |y    \    b   L (e . e   )
//  k   \| k   /     i   i      k
//             ----
//             i = 0
// where
//   d(j) is the maximum index of i where b(i) &amp;lt; 0
//   b(i) is the needlet weighting for SH band (i)
//   L(i) is the Legendre Polynomial i
//   x = e.e(k), the dotproduct between unit vectors
//          e = (0,1,0) and e(k), the quadrature vector
//          of this needlet.
//   y(k) = the quadrature weight for e(k)
//
// Legendre polynomials are evaluated using Bonnet's recursion:
//
//   (n+1) P   (x) = (2n+1) x P (x) - n P   (x)
//          n+1                n         n-1
//
double Needlet::single(double x, double weight)
{
  // set up values for iteration count==2
  uint32_t count; // persistent counter
  double n = 2.0; // polynomial order
  double p = x; // p(n)
  double pm1 = x; // p(n-1)
  double pm2 = 1.0; // p(n-2)

  // precharge the Legendre polynomial to n = minWeight
  for (count = 0; count < m_minWeight; ++count) {
    if (count == 0) p = 1.0;
    else if (count == 1) p = x;
    else {
      p = ( (2.0 * n - 1.0) * x * pm1 - (n - 1.0) * pm2 ) / n;
      pm2 = pm1;
      pm1 = p;
      n += 1.0;
    }
  }

  // sum the weighted Legendre polynomials from minWeight to maxWeight
  // NOTE: the values "count" and "n" roll over from the previous loop.
  double sum = 0.0;
  for (; count < m_maxWeight; ++count) {
    if (count == 0) p = 1.0;
    else if (count == 1) p = x;
    else {
      p = ( (2.0 * n - 1.0) * x * pm1 - (n - 1.0) * pm2 ) / n;
      pm2 = pm1;
      pm1 = p;
      n += 1.0;
  }
  sum += p * m_weight[count - m_minWeight];
}

// return the result
return sqrt(weight) * sum;
}

 

Implementing Needlets

Faster Math Functions

faster-math-001

Download Presentation part 1

faster-math-002

Download Presentation part 2

I was trying to interpolate a scaling function on an SPU on PS2 and realized, this device has no math library. How do you write powf() without access to libm? The IEEE754 standard for floating point numbers only covers five fundamental functions – addition, subtraction, multiplication, division and unary negation. It explicitly punts on offering any accuracy or correctness guarantees for transcendental functions leaving accuracy up to the library implementation. It’s amazing that floating point beyond simple expressions ever works.

This began a deep dive into how the transcendental functions are derived and coded. Source code for an implementation of is available through the Cephes math library and the source comments have a wealth of pragmatic experiences embedded in them:

/*							asinf.c
 *	Inverse circular sine
 *
 * SYNOPSIS:
 *
 * float x, y, asinf();
 *
 * y = asinf( x );
 *
 * DESCRIPTION:
 *
 * Returns radian angle between -pi/2 and +pi/2 whose sine is x.
 *
 * A polynomial of the form x + x**3 P(x**2)
 * is used for |x| in the interval [0, 0.5].  If |x| &amp;amp;amp;gt; 0.5 it is
 * transformed by the identity
 *
 *    asin(x) = pi/2 - 2 asin( sqrt( (1-x)/2 ) ).
 *

The code mentions “Cody & Waite” in many places, referring to the book  Cody, W.J. & W. Waite “Software Manual for the Elementary Functions”, Prentice Hall, 1980, ISBN 0138220646 – a title so out of print that many people work from their own a photocopied version of the title, copies of copies that are handed down between researchers. (To get a feel for what hand typeset technical documents from the 1980’s read like, take a look at C.G.van der Laan, “Calculation of Special Functions”, 1986)

A more recent book Muller, Jean-Michal, “Elementary Functions: Algorithms and Implementation (2nd Edition)”, Birkhauser, 2005, ISBN-13: 978-0817643720 covers range reduction, minimax polynomials and especially CORDIC implementations in intense detail.

The Saga of IEEE754

Cody & Waite was written in the days before IEEE754 and had to take into consideration the many accuracy and representation issues of competing hardware architectures around at the time. It was the days of the numerical wild west where anything goes and everyone played fast and loose, and every mainframe and minicomputer maker had their own standards of varying levels of efficiency and accuracy. DEC, IBM, Cray, etc.

The dramatic story of how IEEE754 was created is a story of David vs. Goliath. Representatives of the nacient microprocessor industry decided to get together to create one true floating point specification that they could all use to interchange data. The big iron makers were invited but none of them decided microchips were worth bothering with. Intel under Dr John Palmer had ideas for an IC that would be fast, but for patent reasons he couldn’t share with the committee how he was able to implement floating point exceptions as a pipeline bubble, the other manufacturers didn’t believe the standard could be implemented in 40,000 gates and so proposed a different standard without the gradual underflow (a.k.a denormalized floats) method. There was a standoff between Digital Equipment’s VAX- format that had worked well in the minicomputer world for a decade, and the Kahan-Coonen-Stone (K-C-S) and proposal from Intel. It came to a head:

“DEC tried to break the impasse by commissioning Univ. of Maryland Prof. G.W. (Pete) Stewart III, a highly respected error-analyst, to assess the value of Gradual Underflow. They must have expected him to corroborate their claim that it was a bad idea. At a p754 meeting in 1981 in Boston Pete delivered his report verbally: on balance, he thought Gradual Underflow was the right thing to do. ( DEC has not yet released his written report so far as I know.) This substantial setback on their home turf discouraged DEC from continuing to fight K-C-S in meetings of p754.”

Table Based Methods

I included a section on table-based approximation of sine and cosine because for the earliest days of 8-bit machines I always saw people squeeze in 256- or 512-entry sine and cosine tables (back when you had 32KB of memory, you start to care about every byte). This became a non-issue when memories started to expand, but SPU, GPU and embedded processors brought space concerns back to the forefront again. If you analyze the maximal error of an interpolated sin table you can show that a 33-entry table is sufficient to reconstruct a 32-bit floating point sine and cosine at the same time. If only we’d known back in the day.

Polynomial Approximation

On the web there are a lot of presentations on “advanced” computer math that end up using Taylor series to generate polynomial approximations for famous math functions. You’ve seen them in your Calculus textbook, they’re the definitions for Exp, Tan, Sin that you leaned at school so clearly that must be how they are calculated on computers. Truncate the series, generate a remainder and voila, you’re golden. Wrong.

Reading through the Cephes library, the polynomials you encounter look similar to the polynomials that are generated using a Taylor series, but much less accurate. Why the difference, how do they make the magic happen? Tutorials on Function Approximation usually introduce the idea of using Chebyshev or Pade polynomials to approximate functions (yes, that paper steals my diagrams from SH Lighting Gritty Details) which get you closer to the values in Cephes but not quite. The secret sauce that is covered almost nowhere on the web was Minimax polynomials.

The fundamental transcendental functions are mostly written using the same pattern:

  1. Range Reduction
  2. Polynomial Approximation
  3. Reconstruction

Others, like acosf(), are implemented from previous functions using trig identities, so getting accurate versions of short sections of a function is paramount. Once you learn how to use Mathematica or Maple to generate low order polynomials you can produce faster approximations of many functions used in graphics – and that was the goal of the GDC tutorial.

Tutorial slides and paper from GDC

The big bundle of research was turned into a GDC all-day tutorial with about 5 hours of slides and an accompanying paper. The paper was extended a little and appears as a chapter in DeLoura M, “Best of Game Programming Gems”, Charles River Media ISBN 1584505710.

faster-math-paper

Download the Paper

Researching the slides I uncovered an error in the PS2 firmware where an engineer had transcribed and truncated the coefficients for the ESIN and ECOS instruction polynomials introducing a demonstrable shift to the results. Not only can you fix this, but it’s also possible to write a short piece of SPU code that calculated lower accuracy result faster than the ESIN instruction. Not everyday you can beat a hardware instruction with code.

Errata: BitLog Improved

Looking to update the presentation for it’s asecond showing at GDC, I added some extra sections as an afterthought. One section dealt with looking into whether the power function in specular reflection BRDFs could be separated into several simple powers – it turned out the worked example in the paper was the only value that worked as advertised as it lay in a global error minimum. The second addition was a quick hacky logarithm function called BitLog. Charles Bloom made a post looking into it, eviscerating my write-up, tracking down the original author Jack Crenshaw and tearing him a new one. He then wrote up an improved version. Yup, couldn’t have been owned more thoroughly.

Footnote: Printing and reading floating point numbers

The subject of printing and parsing floating point values has a surprisingly  convoluted history and is far from a solved problem even today. The seminal paper “What Computer Scientist Should Know About Floating-Point Arithmetic” does cover that printing 9 significant digits of a FP number is sufficient to reconstruct the exact binary mantissa of a single-precision float (15 for double-precision), it’s more difficult to prove what is the shortest expression that can correctly reconstruct a float from it’s ASCII value.

The most used algorithm for printing values is Grisu3 which is fast but bails on a small number of values and has to drop back to the slower, more complete Dragon4 algorithm. In 2010 Florian Loitsch presented a paper at the PLDI conference titled “Printing Floating Point Numbers Quickly and Accurately With Integers” that proposed a faster method using integer-only internal representation to speed up the process, but it too bails on about 0.6% of all FP values. In 2016, a new algorithm called Errol was proposed in the paper “Printing Floating Point Numbers: A Faster, Always Correct Method” that is faster than Dragon4, only 2.5x slower than Grisu3 but has the advantage of being complete.

Reading FP numbers is another problem that to do it correctly requires use of several of the IEEE754 rounding modes. It’s use of round-to-half-even is one of the strongest arguments for correctly supporting the rounding modes in any language. You can test the accuracy of your scanf and printf implementations using code from Stephen Moshier.

Faster Math Functions

Steering Behaviors – Siggraph 2000

Based on the work I did at Bullfrog Productions for Dungeon Keeper 2 and Theme Park World on flocking crowd movement in user editable worlds, I was asked by Craig Reynolds to join a group that would be presenting a Siggraph tutorial course on game development tech. The concept of flocking scored Craig a Scientific and Engineering Award at the 70th Academy Awards in 1998 and I’m pretty sure he used this as leverage to score a whole day Tutorial.

I had used his page on steering behaviors and had implemented much of it,  writing up my results, made a live Win32 demo,  added a few new behaviors and documenting the gotchas for flocking in non-static worlds.  At the time there was practically no gamedev representation in the academic and production heavy Siggraph so we were pretty much the advance guard.

The ideas seem a little quaint now, especially the lack of any coherent methods of blending between steering behaviors over time. Much of the actual production code for pathfinding and was in 16-bit fixed-point using incremental Voronoi triangulation,  written by Ian Shaw and created for raw speed on the original Playstation (a version which Bullfrog never shipped).

Shortly after presenting at Siggraph 2000 I was hired by SCEA to work in the SF bay area with Craig. A nicer bloke you could not hope to meet and walking the halls of Siggraph beside him was always an education – he knows everyone and it takes forever to get anywhere.

Steering Behaviors – Siggraph 2000

50+ Years after Shannon

Download the PDF

Inspired by Michael Unser’s IEEE paper Sampling – 50 Years After Shannon I set out to explain the ideas and innovation to the gamedev audience. The resulting presentation ended up a bit too math heavy for my liking and needs more worked examples, but it’s a good motivation to learn more about bi-orthogonality of bases.

If you learned the mathematical basis for Shannon Sampling Theory in college, the latest theory of Generalized Sampling replaces a few of the blocks in the classic Shannon diagram to guarantee perfect reconstruction for finite signals if you have control over both the analysis and synthesis functions.

In the real world you usually don’t have the ability to fine tune reconstruction (e.g. the point-spread function in a TV image or the acoustics of a cellphone speaker) so a less severe goal called Consistent Reconstruction can be used where the goal is to optimize one of either the sampling or reconstruction filters so that feeding the output signal back through the system leads to the same result. To do this we add a filter into the Shannon diagram that we can optimize to give this property and this gives us close to perfect reconstruction for any sampled signal in real world usage; for example the 100Hz temporal upsampling in your digital TV works using this theory.

Hughes Hoppe and Diego Nahib also took a look into generalized sampling and specifically some uses in computer graphics, available as Microsoft Research Technical Report MSR-TR-2011-16 Generalized Sampling in Computer Graphics.

50+ Years after Shannon

Spherical Harmonic Lighting: The Gritty Details

I wrote the paper Spherical Harmonic Lighting: The Gritty Details before the more general terminology Precalculated Radiance Transfer (PRT) was coined, which explains the slightly strange title. In 2002 I was working for Sony Computer Entertainment America in Foster City, CA in the R&D group under Dominic Mallinson. I received a preprint from Peter-Pike Sloan of his Siggraph paper Fast, Arbitrary BRDF Shading for Low-Frequency Lighting Using Spherical Harmonics and made a bet with my manager that I could implement SH Lighting on PS2 before Siggraph came around that year. I started by verifying that we could do SH projection and demonstrate successive approximation by coding it up in Maple:

SH-approximation

And once the SH projection code was verified I went into coding up an app to demo it on Win32 and PS2. The math was straightforward enough but getting hold of watertight manifold models proved to be more of a challenge than I imagined. Most of the diagrams in the paper came from the realtime ray tracing engine that Gabor Nagy had built for his 3D editor Equinox 3D and the models were provided through endless patience of our in-house technical artist Care Michaud-Wideman who provided an initial model of a Sofa with cushions, as well as later on a Tree (with double-sided leaves) and chop-top convertible Audi models. The torus scene was knocked up in just a few minutes by Gabor in Equinox 3D.

The sofa was the first object successfully lit using SH. You can see artifacts on the first two images where the regular grid in my simple ray tracer had rejected hits outside of the current voxel leading to checkerboard shadows under the sofa and strangeness on vertices close to voxel boundaries. The SH vs. Point Lighting diagram was after fixing this and was used to demonstrate progress and get the go ahead continue to write the GDC presentation.

We booked a proposal for a GDC talk, so we’d need a live demo, a presentation plus a paper to back up the slides. That year I had been invited back to present my Faster Math Functions tutorial for a second year, so it was an intense conference run-up. All I remember about the day was that the room at GDC was standing room only. Our time flew by in a blur as we had 60-minutes for 69 slides plus a live demo by Gabor, and at the end of it we received something like 4.8 out of 5 in the feedback reviews. Nothing tops good reviews from the grizzled, war-weary GDC attendees.

sh-gritty-gdc2003-sm

I also learned a lesson to never promise future work in a scientific paper as immediately afterwards the CELL project came round and the SCEA R&D group went into silent lockdown for the next three and a half years working on PSGL for the release of PS3. Including the infamous month where SCEA R&D was asked “What if we didn’t have a GPU in the next Playstation and instead gave you two CELL chips…”

Additional:

Forgot to add the numerous bugs in the paper, especially with the SH rotation equations that were imperfectly transcribed from some terrible photocopies of photocopies of Ivanic’s original paper. The task of discerning between subscripts m and n or \ell and 1, plus testing out whether the Ivanic method derived from real values or the Choi method derived from complex was faster fell to Don Williamson, graphics lead at Lionhead. His code is still available online:

https://bitbucket.org/dwilliamson/shrotation/src

and contains validated implementations of both methods, Ivanic being the winner. Fast approximate rotations have been proposed, but the ZYZ rotation decomposition still wins for low orders.

 

Spherical Harmonic Lighting: The Gritty Details