Fast fast kd-tree building

Practical and theoretical implementation discussion.

Fast fast kd-tree building

Postby tbp » 09 Apr 2006, 16:05

I've been waving hands on that topic a tad too much to my taste lately, so i'm going to try to clarify what i really meant.

In his note Havran advises the use of double linked lists, i think that's overkill if careful. Then, like Wald, he proposes to work with boundaries (events in Wald terminology) as the basic unit; again i think that's not wise because such events/boundaries are always paired and we'd be losing that connection.

That leads us to a structure like...
Code: Select all
   struct lbox_t {
      struct boundary_t {
         float   pos;
         ptr_t   next;
      } dim[3][2];   // 3 axis, 2 sides.

      union flags_t { ... } flags;
      lbox_t *clone;
      uint_t triangle_id;
      char padding[4];
   };

... which happens to be 64 bytes (2 cache lines) wide on 32 bit platforms.

We have a triangle id to allow for 'perfect clips', a bit field (flags) for transient tagging and some padding - let's put aside the clone pointer for now -.

Now we have to decide how we're going to handle pointers from one boundary to the other, keeping in mind that our goal is to keep the whole lbox_t structure in sight. As our structure is now 64 bytes wide, if aligned we could safely deduce from, say, &dim[1][1].next the underlying lbox_t structure address; sadly that wouldn't work on 64 bit platforms unless we pad the whole thing to 128 bytes. Unacceptable.
Instead we're going to encode sideness into pointers, within the lowest bit, because boundaries are uniquely identified as the triplet lbox_t, axis, side; so lbox_t::boundary_t::ptr_t really is a 'lbox_t *' pointer and a side bit. Extracting either the side or a proper pointer only requires a logical and.

Fast scoring.

Code: Select all
for each axis
   open = 0; close = 0;
   for each linked boundaries
      open += local_open;   local_open = 0;
      close += local_close; local_close = 0;
      
      num_planars = 0;
      split = boundary.pos;
      
      // basically, scan ahead for all boundaries with a matching position.
      while (true) {
         (lbox, is_left) = boundary;
         if (lbox.is_planar(axis, split))
            ++num_planars;
         else {
            local_open += is_left ? 1 : 0;
            local_close += is_left ? 0 : 1;
         }
         
         if (boundary.next.pos != split)
            break;
         boundary = boundary.next;
         split = boundary.pos;
      }

      // completly on the left side.
      num_left   = close+local_close;
      // crossing, local_open(s) end up on the right side.
      num_cross   = open - num_left;
      // completly on the right side.
      num_right   = total_count - (num_left + num_cross + num_planars);
      
      <assign planars to either side>
      <update score>
   end
end

That's the gist of it, in bogus pseudo-code: we recognize 3 zones (< split, == split, > split) and try to minimize the number of score evaluations; storing the point where we transition from '== split' to '> split' allows for some shortcuts when building the left/right node further down the code (but also leads to ugly spaghetti code).

Split clipping & building the left/right lists.
Suppose we have a winning split, we now have to categorize our triangles and build left/right lists for 3 axis.
A naive approach would walk all those triangles 4 times (1 to categorize, 3 to build left/right sub-lists) or worse.
The fact is, if we're careful, we can categorize and build the left/right sub-lists for the winning axis at once (with no merging whatsoever); we just need another property for our boundary lists: a left side should always come before a right side for a given lbox_t. That require an adjusted sort phase and to maintain that order thereafter.

Code: Select all
// k = winning axis, ku/kv = other axis.
for each boundaries on the winning axis
   (is_left, lbox) = boundary;
   if (is_left) {
      // categorize this lbox_t for the given axis as: left/right cross/planar
      // if crossing, both cross and left tags are set.
      // update its flags accordingly.
      categorize(lbox, axis, winning_split);
      
      if (lbox.tag_cross) {
         lboxr = allocate();
         lbox.clone = lboxr;   // store a reference, we'll come back to it.
         <rewire lboxr as following lbox on axis ku and kv>
         lbox.dim[k][side_right].pos = split;
         lboxr.dim[k][side_left].pos = split;
         
         sublist_left.append(lbox, k, side_left);
         sublist_right.append(lboxr, k, side_left);
      }
      else if (lbox_tag_left)
         sublist_left.append(lbox, k, side_left);
      else
         sublist_right.append(lbox, k, side_left);         
   }
   else {
      if (lbox.tag_cross) {
         park_left_end.append(lbox, k, side_right);
         sublist_right.append(lbox->clone, k, side_right);
      }
      else if (lbox_tag_left)
         sublist_left.append(lbox, k, side_right);
      else
         sublist_right.append(lbox, k, side_right);         
   }
end

park_left_end.close();
sublist_left.append(park_left_end);
sublist_right.close();


A bit hairy, but no merging and we categorize only once.
It's then simpler for axis ku & kv, we just build sub-lists as we go according to tags set previously.

You'll notice i'm not updating/clipping those lbox proxy against voxels, first because i'm still tinkering with that code and then it doesn't introduce any new boundaries; it may only update values, in a known order (more precisely increasing a left boundary and decreasing a right one).

Voilà, i hope it's clear enough.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 10 Apr 2006, 14:13

OK, I have a ton of problems even understanding what you wrote. I will begin at the start of your post and quit at 1/3rd. :) Tons of quoting ahead:

i think that's not wise because such events/boundaries are always paired and we'd be losing that connection


You mean that each 'triangle starts here' is paired with a 'triangle ends here' event? Obviously that's true, but why would that be important? I personally keep track of an 'opposing side', so I can easily detect straddling. Is that also why you want to keep the connection between events?

We have a triangle id to allow for 'perfect clips'


Is that used to reclip the triangles in the next iteration?

Now we have to decide how we're going to handle pointers from one boundary to the other, keeping in mind that our goal is to keep the whole lbox_t structure in sight


Now you've lost me completely. Define 'keeping the lbox in sight' and explain why that would be a goal.

As our structure is now 64 bytes wide, if aligned we could safely deduce from, say, &dim[1][1].next the underlying lbox_t structure address


You mean you could deduce the next lbox_t by looking at the address of the first data member, in this case &dim[0][0]? Why [1][1]? Why .next? Why is there a problem on 64bit architectures? Ah I have the answer to that last question: Because the struct grows to 68bytes in that case. Why not make it a 32bit offset then?

OK I'll stop at this point. The rest is dependent on the first part.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 10 Apr 2006, 14:58

Glad you stopped by to point at the gibberish :)

First things first: why the hell do i insist on not losing the connection between events and the underlying triangle.

We are really trying to partition triangles, we track extremae on each axis for that but ultimately we're massaging a whole triangle all the time. If we keep related data together, avoiding unnecessary data duplication or indirection, that's a win both for locality and memory footprint. It also simplifies some phases.

Phantom wrote:You mean that each 'triangle starts here' is paired with a 'triangle ends here' event? Obviously that's true, but why would that be important? I personally keep track of an 'opposing side', so I can easily detect straddling. Is that also why you want to keep the connection between events?

It is important to keep both 'events' close because you always acces both: if they are distinct then you pay a) to store a reference to the other side b) you pay for a near random access pattern.

There's no redundancy in my structure: cross pointers are gone, sideness is explicit etc...
Compare that to an Havran-like Data3D taking (3+3+1+1) pointers + a side bit, let's round that to 32 bytes on 32 bit platforms; you need 2 of them per triangle plus a 24 bytes aabb (addressed by the data pointer), that's about 88 bytes scattered over 3 distinct memory locations.

When you're massaging 10M+ triangles it really makes a difference; i think i get a ~3x speedup that way.


Phantom wrote:
We have a triangle id to allow for 'perfect clips'

Is that used to reclip the triangles in the next iteration?

Yes, those events being just some transient proxy for the real thing (triangle).

Phantom wrote:
As our structure is now 64 bytes wide, if aligned we could safely deduce from, say, &dim[1][1].next the underlying lbox_t structure address


You mean you could deduce the next lbox_t by looking at the address of the first data member, in this case &dim[0][0]? Why [1][1]? Why .next? Why is there a problem on 64bit architectures? Ah I have the answer to that last question: Because the struct grows to 68bytes in that case. Why not make it a 32bit offset then?


I've eluded some gory details. My bad.
I have a single structure holding some triangle related data and 2 linked 'events' per axis. I also have an ordered list per axis with those events all linked together.
The problem is how exactly i link them all in a list.
If using something like (ammended Havran's structure) as the basic block
Code: Select all
struct lbox_t {
   struct Data3D {
     bool leftOrRight; // either left or right
     void *data; // the object data
     Data3D *next[3]; // on the x/y/z-axis the next boundary when sorted
   };
};

... then when walking the list, how could i figure out the address of the enclosing lbox_t structure out of a 'Data3D *' pointer? I need to figure this out because i have no other means to access the other side of that pair.
It happens that my lbox_t structure is 64 bytes long on 32 bit platforms, and if aligned then there wouldn't be any ambiguity: just mask some of the lower bits from that pointer and be done.
But that wouldn't work on 64 bit platforms because lbox_t would need to be padded to the next power of 2, 128 and there's too much waste.

An event is fully qualified with a triplet: lbox_t, axis, side. At any given stage we massage a given axis, so it's known. That means we only need to carry a lbox_t pointer and a side bit on top of that to always be able to figure out where we stand.

So my per axis ordered lists are in fact represented like that: { lbox_t *, side } -> { lbox_t *, side } -> { lbox_t *, side } etc...
I encode the side bit into the lowest bit of the pointer.
That's just one way to represent it.

Oh and i'm not forced to use single link lists, it's just that there's no need for double linked lists to begin with. That saves 3*2 pointers (and code to maintain them).

Is that clearer?


EDIT: You can't use offsets either, because you'll need to introduce new lbox_t when crossing a split plane and indexing would become a real pain.
Note: My scoring can be used with Havran-like boundaries, it doesn't really matter.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 10 Apr 2006, 19:08

Phantom wrote:OK, I have a ton of problems even understanding what you wrote. I will begin at the start of your post and quit at 1/3rd. :) Tons of quoting ahead:

I've re-read my previous post, and i guess it's just as obscure as the first one. I'm trying again but this time precisely answering Jacco's.


Phantom wrote:
i think that's not wise because such events/boundaries are always paired and we'd be losing that connection


You mean that each 'triangle starts here' is paired with a 'triangle ends here' event? Obviously that's true, but why would that be important? I personally keep track of an 'opposing side', so I can easily detect straddling. Is that also why you want to keep the connection between events?

It's important not only because they are paired, but you also (mostly) access them both each time; if you store them one next to the other you enhance locality and you save an indirection. Win-win.

Phantom wrote:
We have a triangle id to allow for 'perfect clips'


Is that used to reclip the triangles in the next iteration?

Yes.

Phantom wrote:
Now we have to decide how we're going to handle pointers from one boundary to the other, keeping in mind that our goal is to keep the whole lbox_t structure in sight


Now you've lost me completely. Define 'keeping the lbox in sight' and explain why that would be a goal.

Because that's the only structure we work on and what gives us more local memory access. For each axis, a given lbox_t structure will be linked twice, once with a left side ('triangle starts here') once with a right side ('triangle ends here').

Phantom wrote:
As our structure is now 64 bytes wide, if aligned we could safely deduce from, say, &dim[1][1].next the underlying lbox_t structure address


You mean you could deduce the next lbox_t by looking at the address of the first data member, in this case &dim[0][0]? Why [1][1]? Why .next? Why is there a problem on 64bit architectures? Ah I have the answer to that last question: Because the struct grows to 68bytes in that case. Why not make it a 32bit offset then?

Yes and no. We're going to link each lbox_t twice on each axis (start/end events), one way or another.
We could use pointers to members, but we'd need at least a way to deduce the other boundary/side (remember we don't have any other_boundary pointer). If the underlying structure is aligned and has a power of 2 size, you can wipe some lsb from such a pointer an obtain a pointer to said underlying structure (and then access the other side); but that's not an option for 64 bit platforms. So i use pointers to the lbox_t structure + a side bit. As the axis is always known, we have the required triplet.

I can't use offsets because split plane clipping introduces new events/boundaries/lbox_t, indexing would be a pain.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby fpsunflower » 18 Apr 2006, 02:51

I think I understand the main points of your method. Main problem I see is the extension to 64bit archs. Do you have any plans for that ? Or is there a way to force memory allocators to work in the lower 4Gb - or always in the same "segment" - so you can safely chop the top bits? I doubt we'll get a real scene with enough triangles anytime soon anyway (alignement forces only 28 usable bits in the ptr_t - correct ?) Until someone decides to share the Boeing dataset ;)


lbox_t *clone; is only needed for overlapping triangles correct ? Could you not use a small hash table to keep track of those? In practice it would always be fairly small and that way you'd only pay for the extra bits when you needed them as opposed to once per tri. It messes up your alignment though =)


My compiler is relatively close in how it stores things (ie: tries to be minimal), except it only needs the boundary_t's (which are expressed as packed 64 bit integers). Since I don't use linked lists I use the extra bits of ptr_t to store type (open/close/planar) (2 bits), axis(2 bits) and primitive_id (28 bits). The whole thing can be sorted quite fast (and only once) using radix as you've figured out by now I guess.


The only other memory I need is an array to encode which triangles should go left/right/both at each step. which is just 2 bits per primitive. I don't think I can do it in one pass since I don't have access to the segments - unless I query the triangle itself which would be expensive (?). So I do a total of 3 passes over the splits: cost, categorizing, splitting. All axes are sorted into a single array.


So my total is 48 bytes+2bits per triangle. Of course the arrays grow a little as you go due to overlap, but so do the linked lists. And you can free the array of splits before recursing - so memory usage goes down nicely as your recursion moves to the right of the tree (in practice of course the garbage collection is stupid and hangs on to everything until the last minute).


The main drawback of my method is that it requires allocating new arrays at each step to store the result of the splitting pass. Although you're free to delete the old one right after you've filled up the 2 new ones, there's still that moment when you're using twice the memory that you should be. Clipping is also alot trickier to do - but I've let that aside for now since Wald and co report gains of only ~30% max from clipped kd trees.


Using extra bit-twidling and loosing some speed in how things are fetched/decoded I could actually store the split values only once and only pass around the 32 prim+flag bits which would save some more space. But then clipping becomes impossible.



I hope that made some sense.
fpsunflower
 
Posts: 193

Postby tbp » 18 Apr 2006, 04:24

fpsunflower wrote:I think I understand the main points of your method.

Yes you do. I was beginning to think i was on crack :)

fpsunflower wrote:Main problem I see is the extension to 64bit archs. Do you have any plans for that ?

Only that it's the other way around, my pointers are augmented with a side bit and that works as long as the underlying structure size is even. So as is, it's fine on 64bit archs. What i was ranting about was that it would have been simpler, and a tad faster, to go with the just-chop-some-lsb method; that method requires the use of power-of-2 size for the structure you point at.

So, in fact i'm padding my structure to 64 bytes (at least on 32bit archs) for another reason: cache lines (32 bytes on P4, 64 bytes on K8 ).

fpsunflower wrote:Or is there a way to force memory allocators to work in the lower 4Gb - or always in the same "segment" - so you can safely chop the top bits? I doubt we'll get a real scene with enough triangles anytime soon anyway (alignement forces only 28 usable bits in the ptr_t - correct ?) Until someone decides to share the Boeing dataset ;)

Yes it's possible, but that lower-memory space is a bit cramped :)
Once various dll, your binary etc get mapped you don't have a contiguous 4G chunk anymore. Gcc tries to fit stuff there too, and then also tries to shrink pointers to 32 bits; there's an option to force that i think.

So, you're right. I could say i'm only going to address the lowest 4G, only use 32 bit pointers and go with the lsb-choping method on all archs.
But i've already coded the trickier alternative way (straight pointers + side bit) which doesn't have any addressing limitation and is only marginally slower: all that is really needed is a logical and to extract either the pointer or the side bit.


fpsunflower wrote:lbox_t *clone; is only needed for overlapping triangles correct ? Could you not use a small hash table to keep track of those? In practice it would always be fairly small and that way you'd only pay for the extra bits when you needed them as opposed to once per tri. It messes up your alignment though =)

Yep i only need it for split-clipped triangles, and then only because i try to do everything at once (categorizing + building the main axis list) with only singly linked lists.
The fact is i have bytes to burn because of the cache-line alignment, and that code is supposed to be fast; i mean all that trouble wouldn't be justified if i had to rely on more complex structures like a hash.
As i'm modifying that cache line anyway, i'm happy to put some of that padding to use for transient stuff if i can then take some shortcuts.


fpsunflower wrote:My compiler is relatively close in how it stores things (ie: tries to be minimal), except it only needs the boundary_t's (which are expressed as packed 64 bit integers). Since I don't use linked lists I use the extra bits of ptr_t to store type (open/close/planar) (2 bits), axis(2 bits) and primitive_id (28 bits). The whole thing can be sorted quite fast (and only once) using radix as you've figured out by now I guess.
The only other memory I need is an array to encode which triangles should go left/right/both at each step. which is just 2 bits per primitive.

I don't think I can do it in one pass since I don't have access to the segments - unless I query the triangle itself which would be expensive (?). So I do a total of 3 passes over the splits: cost, categorizing, splitting. All axes are sorted into a single array.


So my total is 48 bytes+2bits per triangle. Of course the arrays grow a little as you go due to overlap, but so do the linked lists. And you can free the array of splits before recursing - so memory usage goes down nicely as your recursion moves to the right of the tree (in practice of course the garbage collection is stupid and hangs on to everything until the last minute).

I also use tags when categorizing triangles (under the flags_t union), so, in fact we use pretty much the same representation, modulo the linked-lists/arrays choice; on 32bit archs i could shrink my structure quite a bit, but then not enough to not pad it to a cache line.

I cream up one pass because i categorize and split one axis at once, but that's a linear cost anyway.

Now for GC, i've thought a bit about it. But then i lump together all the initial lbox_t in one big chunk, so atm garbage collecting those isn't an option; that would be possible for those allocated later on (due to clipping), but i haven't bothered for sure :)

All in all, we're tuned.

fpsunflower wrote:The main drawback of my method is that it requires allocating new arrays at each step to store the result of the splitting pass. Although you're free to delete the old one right after you've filled up the 2 new ones, there's still that moment when you're using twice the memory that you should be. Clipping is also alot trickier to do - but I've let that aside for now since Wald and co report gains of only ~30% max from clipped kd trees.

Ah, i've missed that 30% figure. But yes, it seems to reasonably match what i've experienced with my older compiler.
Now if i were to build a non-clipping compiler, i'd use a much straighter approach. I mean either you bother coding a semi-slow 'correct' (read clipping) compiler or you don't and start cutting corners, toxie proved it's effective. On the other hand, while a pain to code, clipping doesn't slow down an already not-that-fast compiler :)


fpsunflower wrote:I hope that made some sense.

Oh yes it did. Funny to see how people can converge to, more or less, the same solution on their own.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 18 Apr 2006, 12:12

I think I'm finally starting to see what you're doing. Holy crap, that's some data structure. :)

Allow me to describe it in my own words, so you can verify that I got it right this time, and so subsequent readers have less problems grasping it:

I'm building from the ground up here.
- For a single axis, one could simply build a linked list of events, with three members: float pos, int flags, and a next pointer. 'Flags' contains either 'tri start', 'planar' or 'tri end'. This list could easily be sorted, with pos as the primary key, and flags as secondary so that 'tri end' events happen before 'planar' and 'tri start'.
- Stepping up to "Havran's addendum style": The three axii are now stored in a single structure. For each event, we need three next pointers, one for each axis, so that events can now be linked differently for each axis. The event position is thus also an array of three floats, one for each axis.
- TBP style: This time we pair events. We store a full bounding box (that's 6 floats), and also 6 next pointers. The first three pointers are used when this event is interpreted as a 'tri start' event, the other three if this event is a 'tri end' event. A polygon that is planar on one axis, has two equal pointers for that axis.

Building the list is done as follows:
- First, for each triangle, an event structure is allocated and initialized:
* Each of the three 'tri start' next pointers is initialized to point to 'this';
* Each of the three 'tri end' next pointers is initialized to point to the next event that is allocated.
* If the polygon is planar for a given axis, the 'tri end' pointer is copied to the 'tri start' pointer for that axis.
* The 6 extremae are set to the extends of the bounding box of the triangle.

At this point, a huge trick is introduced. We can't just let the next pointers point to a full structure instance, since we wouldn't know which side of that structure we want to use. Allow me to clarify that: A 'tri start' event could point to the opposing 'tri end' event of the same primitive, but it could also point to a 'tri start' event of another primitive that overlaps the first primitive. There's no way to know which side we intended to point to. So, we point directly to the 'next' pointer for the desired axis and side.

This introduces a new problem: How do we determine the address of the structure that this pointer belongs to? Assuming the structure is aligned to an address that is a multiple of two, we can simply 'and' out some bits to get the structure base address. The structure proposed by tbp is 64bytes (after padding) on 32bit machines, but it's larger on 64bit machines, so that's a problem.

An alternative is to encode the side the next pointer is pointing to in the pointer, without changing it's address. So, the pointer (AND 0xFFFFFFFE) points to the start of the structure instance, and the lowest bit (the bit that is masked away) determines wether we want the left side or right side next pointer.

At this point, we can treat this list as three separate lists (one for each axis) by following next pointers; only oddity is that each allocated entry basically represents TWO items in the linked list. I.e., if there's no polygon overlap for a particular axis, we would have to follow the next pointer TWICE to get to the next primitive.

- Next, for each axis, this unsorted list is sorted:
* The list can be treated as a normal linked list, so standard merge sort for linked lists should work. Better sorting algorithms exist (especially radix sort).

The cool part so far is that no pointers are needed to get from one side of a triangle to another side, which is useful in many cases. Also, memory usage is very optimal.

Before I dive into scoring and sublist scoring I would first like to hear 'affirmative' on the above outline.
If the above is correct, I can see why you had problems explaining it. And I also believe you should really write a paper on this approach, as it's vastly different from Wald/Havran's approach and also clearly a lot better, both in speed and in memory usage. I can help you writing it if you wish.
Last edited by Phantom on 18 Apr 2006, 12:15, edited 1 time in total.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby Phantom » 18 Apr 2006, 12:14

@fpsunflower:

If my interpretation is correct (see above), I don't see how you could apply the same idea to a structure that has no linked lists?
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby fpsunflower » 18 Apr 2006, 13:31

Phantom wrote:If my interpretation is correct (see above), I don't see how you could apply the same idea to a structure that has no linked lists?


I think what you said is correct (quick diagonal read - just woke up).

In mine I don't have lists so events aren't paired closely together at all. They are all sorted into one giant list (interleaved axes) as Wald's newest tech report suggests.

That saves me considerable amounts of pointer trickery over tbp's method. The main common point is just that the data is packed pretty tightly (8 bytes per split).

I'll post my code tonight if anyone is interested.
fpsunflower
 
Posts: 193

Postby Phantom » 18 Apr 2006, 13:38

Code is always nice. Is it an O(NlogN) implementation? Perhaps you should post it in a new thread, for easy reference.
BTW, if I grasped tbp's idea correctly, I think it's quite awesome. It's definitely a great way of reducing data and structuring memory access.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 18 Apr 2006, 13:49

Phantom wrote:I think I'm finally starting to see what you're doing. Holy crap, that's some data structure. :)
...
An alternative is to encode the side the next pointer is pointing to in the pointer, without changing it's address. So, the pointer (AND 0xFFFFFFFE) points to the start of the structure instance, and the lowest bit (the bit that is masked away) determines wether we want the left side or right side next pointer.


Bingo ;)
More precisely at any given moment you need the Holy Trinity { lbox_t, axis, side } to walk those lists, tinker with an event, etc... 2 of them come from the pointer itself while the axis is always known.

Phantom wrote:At this point, we can treat this list as three separate lists (one for each axis) by following next pointers; only oddity is that each allocated entry basically represents TWO items in the linked list. I.e., if there's no polygon overlap for a particular axis, we would have to follow the next pointer TWICE to get to the next primitive.


You lost me a bit there.
So let me re-state that while you can always easily access the other part of a paired event, because you always come from the underlying structure, when you look at a list for a given axis, a given triangle/lbox_t will appear twice; there will one entry/pointer for the left event then later on another for the right event (in that order, that's a requirement for the one pass categorization+split).


Phantom wrote:The cool part so far is that no pointers are needed to get from one side of a triangle to another side, which is useful in many cases. Also, memory usage is very optimal.

On top of that, the fact that you get a direct relationship between the triangle, the aabb proxy and its constituing 'events' exposes some simplifications; it would be really unpractical to only use singly linked lists otherwise. Unpractical and inneficient.

Phantom wrote:Before I dive into scoring and sublist scoring I would first like to hear 'affirmative' on the above outline.
If the above is correct, I can see why you had problems explaining it. And I also believe you should really write a paper on this approach, as it's vastly different from Wald/Havran's approach and also clearly a lot better, both in speed and in memory usage. I can help you writing it if you wish.

You got it. But it's not fundamentally different from what Havran has described and Wald rephrased: you sort once and upfront and then strive to maintain that order. My minimalist approach merely further exposes an inherent order.
As you all witnessed my pedagogical-fu is weak, the only thing i remotly do well is coding. So if you think that's worth an article beside those posts, i'd gladly participate; at this point an unexposed mind is required to drive people through the maze :)

edit: quit posting in my back, will you?
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 18 Apr 2006, 13:58

fpsunflower wrote:In mine I don't have lists so events aren't paired closely together at all. They are all sorted into one giant list (interleaved axes) as Wald's newest tech report suggests.

That saves me considerable amounts of pointer trickery over tbp's method. The main common point is just that the data is packed pretty tightly (8 bytes per split).

It's tight, but ill suited for clipping. I don't see anything better than lists at this point, arrays are too expensive to massage.

Ok, i'm biased. I guess if there's only some casual clipping they're fine with the added value of simpler code to begin with.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 18 Apr 2006, 14:11

@tbp:

A couple of things I would like more info on:

- Clipping: I can't imagine any cases where you can't get away with simply adjusting the aabb for straddling tris (after cloning; the clone will obviously get a different aabb, or rather: The complimentary aabb).
- Sorting: How on earth did you use radix sort to sort this data nightmare?

And I still think we should do a paper. Your approach is significantly faster and better for the cache. It's also non-trivial to deduct from Havran's/Wal's approach. And, a 3x speedup (which comes from a better algorithm in this case, not just a more optimized implementation) is VERY significant considering the fact that SAH kd-trees are only *just* too slow to use on dynamic scenes. This approach makes the grid based approach insignificant. So I do think it deserves a paper. Besides, it would be really cool to have that paper floating around the internet. :)
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 18 Apr 2006, 14:36

Phantom wrote:- Clipping: I can't imagine any cases where you can't get away with simply adjusting the aabb for straddling tris (after cloning; the clone will obviously get a different aabb, or rather: The complimentary aabb).

Ambigous double negation in the sentence, but i'll answer by the affirmative. Yes, it's pretty trivial: first you cut with the plane on the winning axis (and get 2 complementary bits) and then you may have to shrink segments on the other axis. The only catch is that you only have #!?@!! singly linked lists to work with and it's clearly not the most efficient structure to do that.
I was tinkering with that aspect when i got diverted into radix sortage (that saved me from a cranial explosion).

Phantom wrote:- Sorting: How on earth did you use radix sort to sort this data nightmare?

There's some handy properties when you start, that is before additionnal clipped thingies get in the way. Basically you address initial lbox_t as an array because there's a 1:1 mapping between them and triangles. I hear you say: but you sort events! True, but the same trick that worked for pointers works for indicies.
So that gets us through the histogram production and first few passes.
Then comes the final pass where we produce a singly linked list; the trick is to recognize that somehow counters are really some kind of pointers into sub-lists. Merge/append them all and you're done.

Phantom wrote:And I still think we should do a paper. Your approach is significantly faster and better for the cache. It's also non-trivial to deduct from Havran's/Wal's approach. And, a 3x speedup (which comes from a better algorithm in this case, not just a more optimized implementation) is VERY significant considering the fact that SAH kd-trees are only *just* too slow to use on dynamic scenes. This approach makes the grid based approach insignificant. So I do think it deserves a paper. Besides, it would be really cool to have that paper floating around the internet. :)

True, it's not that trivial. I've sweated a bit. Cough. But it doesn't make grids or simplified (ie binning non-clipping) kd-tree compilers obsolete.

As much as think my crap is kinda fast for what it does, it's not that well suited to small problems. Admitedly that's at least partially because i implemented it with large sets in mind. In fact i wouldn't use anything else for large sets: it scales gracefully, handles tricky topologies with relative ease etc...
But then it takes like 100ms to massage 26k triangles. That's not *slow*. But that's not fast enough either.

There's still some evil details to iron out, but as i've already typed a lot about it (without much effect) i'd rather see it used by others now. If that takes paper, so be it.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby fpsunflower » 18 Apr 2006, 15:21

I agree with phantom. This would be perfect for a small paper at the upcoming rtrt-fest in utah. I might be able to provide a couple "production" scenes (~1-2 million polys) from the movie we're doing at work. Might not be able to post any pics, but it'll provide some interesting datapoints at least.

We should also run through Wald's animation test scenes and see how close we can get to his grid paper.

Oh and I guess there's a way of multi-threading the whole thing ? ;) (*sound of tbp's head exploding*)
fpsunflower
 
Posts: 193

Postby Phantom » 20 Apr 2006, 12:17

It's kinda quiet here all of a sudden, and I'm rather busy with the new game academy, but I have managed to do some coding nevertheless. Current status: First iteration of the NlogN compiler now produces the exact same results as the Nlog2N compiler, so it appears to work. List splitting is the next step, but that shouldn't be too hard, I hope.

Here are my data structures:

Code: Select all
struct ebox
{
   ebox* next( int a ) { return (ebox*)(n[a] & (0xffffffff - 3)); }
   void next( int a, ebox* p ) { n[a] = (n[a] & 3) + (unsigned long)p; }
   int side( int a ) { return n[a] & 3; }
   void side( int a, int s ) { n[a] = (n[a] & (0xffffffff - 3)) + s; }
   unsigned long n[3];
   float pos[3];
};
struct EBox
{
   enum { END, PLANAR, START };
   ebox side[2];      // 48
   Primitive* prim;   // 4
   int flags;         // 4
   int pad1, pad2;      // 8, total = 64
};


Naming is arbitrary, I like these short names, 'E' stands for event.
Note that I am not embedding the ebox thingy in the EBox struct, as I mainly use ebox'es during sorting and building. The Encapsulating EBox is only used when I need to access triangle data (which isn't happening so far, so perhaps I'll change some things later on).

Using these structs, everything is back to normal basically; only thing that has changed is that two eboxes are instantiated simultaneously by instantiating an EBox. And of course there is the pointer/sideness arithmetic, for which I added some get and set methods.

The list of eboxes is encapsulated by a EBoxList class:

Code: Select all
class EBoxList
{
public:
   void sort( int a );
   ebox* head[3], *tail[3];
};


Sorting of the list is done by Tatham's mergesort for linked lists:

Code: Select all
void sort( int a ) // mergesort, by S. Tatham
{
   ebox* p, *q, *e;
   int insize = 1, nmerges, psize, qsize, i;
   while (1)
   {
      p = head[a]; head[a] = tail[a] = 0; nmerges = 0; 
      while (p)
      {
         nmerges++, q = p, psize = 0;
         for ( i = 0; i < insize; i++ ) { psize++; q = q->next(a); if (!q) break; }
         qsize = insize;
         while (psize > 0 || (qsize > 0 && q))
         {
            if (psize == 0) { e = q; q = q->next(a); qsize--; }
            else if (qsize == 0 || !q) { e = p; p = p->next(a); psize--; } else
            {
               float v1 = p->pos[a], v2 = q->pos[a];
               if ((v1 < v2) || ((v1 == v2) && (p->side(a) < q->side(a)))) { e = p; p = p->next(a); psize--; }
               else { e = q; q = q->next(a); qsize--; }
            }
            if (tail[a]) tail[a]->next(a, e); else head[a] = e;
            tail[a] = e;
         }
         p = q;
      }
      tail[a]->next( a, 0 );
      if (nmerges <= 1) break;
      insize *= 2;
   }
}


Radix sort might be better, as mergesort is NlogN and radix approaches O(N).

More code later on.
As you can see, the above code hides the intricacies of tbp's weird structures, while maintaining it's advantages. Basically the problem is reduced to normal linked list sorting, scoring and splitting.

- Jacco.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 20 Apr 2006, 20:35

You're an heretic. The One True Nomenclature is where you postfix anything looking remotly like a type with '_t', à la C99.

Not much to comment, so i'll digress into nitpicking. There 2, exclusive, sides so that's one bit not two; if you decide to also pack in a planar flag then that's 2 bits total. Also you can only hide so much details with syntactic sugar: in that pointer jungle there's lots of potential aliasing and/or undue reloads. Thusly the 'restrict' idiom is useful, and also a method - let's call it 'extract' - with a prototype like lbox_t *extract(ptr_t p, sideness_t &side) that produces both the pointer & side bit at once.

You directly cast pointers to 'unsigned long', that's a bit brutal as C++ comes with ptrdiff_t & intptr_t to help with portability. What is more worrysome is your use of constants: 0xffffffff won't fly on a 64bit arch, on the other hand the shorter equivalent ~0 when properly casted will behave (or directly ~3 in your case).
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 23 Apr 2006, 11:31

I got working output from my NlogN compiler, but it gets stuck sometimes (iterates to max tree depth on scene6), causing severe artifacts, probably because it sometimes (often?) puts far more than the max of 31 prims in a leaf. Did you encounter the same problem? Could it because of the clipping?

I'm going to implement full clipping first, to see if this is the problem. What I intend to do is this: I will remove data for straddling prims from the lists, then clip the prims and reinsert the events. That should give the same output as the nlog2n compiler. If I still get problems, I must have done something wrong... Problem is, the first couple of splits appear to be identical to the output of the nlog2n compiler, so the basis seems to be right.

And another problem: The mergesort is painfully slow, but I don't yet see how radix sort can be implemented on top of the current data structures... My understanding of radix sort is lacking though, so perhaps I should just read up on that first. Right now, quicksort is vastly outperforming the linked list merge sort. Perhaps I should use an intermediate data structure for sorting, then convert it to linked lists afterwards to gain some performance... Tree builds for scene6 are now 'sub-second' even in debug mode on a slow machine, so that looks promising.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 24 Apr 2006, 03:01

With partial clipping, that is with cloning and clamping to the split plane on the winning axis but no 'fixing' on the other 2, on scene6 i get:
compiler_t::compile(): 4 ms
max_lvl 22
num_nodes 3083
num_max_ids 12 (3.98 avg/leaf)
num_leaves 1542
num_leaves_empty 364 (23.6 %)
num_space_cuts 0
num_ids 4683
... and that seems right; of course it would be better with perfect-clipping.

If you do no clipping, or only partially, you have to be more careful in your scoring otherwise it's easy to get stuck in useless cycles where you always produce the same split. It sounds like that's what you're experiencing.

I'd suggest explicitely checking split candidates that shouldn't get scored: splits should observe cell.min < split < cell.max and that's not warranted if you don't clip. More precisely skip scoring when split <= cell.min and consider the scan for the axis being done if split >= cell.max. That should fix most of your problems.

Some references for radix sort:
http://en.wikipedia.org/wiki/Radix_sort
Skip Tardiman's contrived article and code and jump directly into http://www.stereopsis.com/radix.html
That's much clearer and faster.

Like i've said somewhere else, i've benched various versions and checked the effect of the number of passes. On my machine that gives: 3, 2, 4 when sorting a simple float array.
Now that doesn't address the fact that we need to produce singly linked lists as the final output. While going for 3 passes is the fastest option (because it fits caches, i guess) it lacks some useful symetry that reduces the requirement in auxilary storage. So i've gone with 2 passes and only need to temporally hold N indices.
That nice for larger sets but certainly have a negative impact on small scenes (heck i have more counters than there's triangles in toxie's q2 scene).
I'll post code for the final pass later on if you will.

EDIT: forgot to say there's a simple method to really see what's happening with those crowded nodes, export all those triangles + cell for visualization (i export a triangulated box + triangles to .obj); then it's pretty clear when you're having clipping issues
Last edited by tbp on 24 Apr 2006, 04:39, edited 1 time in total.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 24 Apr 2006, 03:22

fpsunflower wrote:I might be able to provide a couple "production" scenes (~1-2 million polys) from the movie we're doing at work. Might not be able to post any pics, but it'll provide some interesting datapoints at least.

Gimme Gimme Gimme!
Is that pristine modelled geometry (as opposed to something algorithmically generated)?

fpsunflower wrote:We should also run through Wald's animation test scenes and see how close we can get to his grid paper.

Ah, right. Need to download all of them. I don't have much hope of getting anywhere near tho.

fpsunflower wrote:Oh and I guess there's a way of multi-threading the whole thing ? ;) (*sound of tbp's head exploding*)

Grr. Yeah it's multithreadable and tactics like those Carsten exposed or what we've discussed are in order.
I haven't checked, but i think i'll get some performance hit with my structure for the sorting phase because of cache synchronisation traffic (at least on NUMA), but that's it.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 24 Apr 2006, 05:12

I've uploaded a scene known to annoy my frail older compiler which somehow gets stuck on it (or takes an innordinate amount of time, i have no patience) and produce crappy tree if handled without perfect clips: http://ompf.org/forum/viewtopic.php?t=95

381064 triangles
Code: Select all
compiler_t::compile(): 4717 ms
        max_lvl 39
        num_nodes 3201043
        num_max_ids 149 (2.98 avg/leaf)
        num_leaves 1600522
        num_leaves_empty 375962 (23.5 %)
        num_space_cuts 0
        num_ids 3648466
kdlib::writer_t::init: allocated 24.422M for nodes, 13.918M for ids.


Ugly. 39 levels to end up with 149 triangles in some leaves. Of course it performs like <censored>.
Image
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby fpsunflower » 24 Apr 2006, 05:41

tbp wrote:Is that pristine modelled geometry (as opposed to something algorithmically generated)?


Its all modelled (some parts better than others). But thats kinda how it goes in production. Background environments tend to be less carefully modelled since they are static and just need to "look good". Stuff that moves (characters) typically has clean(er) topology.

Hair is procedurally generated but its not rendered as triangles so not much sense in including that. Lots of ways to drive your kdtree crazy with those primitives though. Which opens up the question of triangle-only vs. several base primitives and how that should all work.



I'm afraid I can't post any scenes at this point. =( Maybe after the movie releases I could get legal clearance ? I will let you know. You could always come work for us ;)
fpsunflower
 
Posts: 193

Postby tbp » 24 Apr 2006, 06:18

Damn, 1M not-so-trivially-compressible triangles to play with, i'm already drooling.

Not being an artist i've never thought about hairs, cue in jokes about my hair-dresser being dead, but i can see how they could be a major pain to get through a kd-tree once triangulated and even then raytraced efficiently (and that's not even addressing shading). Seems like an interesting problem.



fpsunflower wrote:I'm afraid I can't post any scenes at this point. =( Maybe after the movie releases I could get legal clearance ? I will let you know. You could always come work for us ;)

I saw you wink, you dirty sadist.
Could you privately hand me a copy of such a hairless scene if i swear to turn my monitor off while tinkering with it? I also promise to gouge out the eye of anyone getting too close from the locked closet where i'll put the machine it will reside on.
Pretty please.

Also i'm no private detective as i'm still pretty clueless as to what is that mysterious company as of yet :)
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Uneducated question ...

Postby Shadow007 » 24 Apr 2006, 08:23

I'm a lurker here and don't know nothing about Raytracing apart from Phantom's articles, but I've got a lingering question : why should the KD-tree compiler output be a linked list ? why not an array ?

Thank U, and please feel free to ignore the question if it feels to annoying/trivial...
Shadow007
 

Postby tbp » 24 Apr 2006, 08:58

Hello lurker,

the output of a kd-tree compiler is a kd-tree :)
If you want to acheive compilation in n.log(n) time you have to sort your primitives once and then maintain that order throughout the rest of the process. As clipping against node's cells will introduce redundancy that means you'll face, one way or another, insertions. That's why lists are favored.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

KDTree output

Postby Shadow007 » 24 Apr 2006, 13:04

Yes, I should have made myself clearer : I had indeed at least understood that the KDtree compiler output would be a KDTree :) , What I now understand is that the Sorting is done as a "pre-processing step for the actual compilation. Have I got that right ?

But do the final KD-tree's leafs need to contain a linked-list of primitives, or is it just the way they are obtained after the compilation (and all the included insertions) ?
Would they not be better arranged in arrays for the actual ray-tracing ?


PS thanks for the quick and clear reply
Shadow007
 

Postby tbp » 24 Apr 2006, 14:05

Yes, that initial sort on the whole set can be viewed as preprocessing: if you sort at every step you'll fall in the n.log²(n) category. Now one could argue that building a kd-tree is nothing but sorting as a kd-tree is a structure allowing faster searches (in our case spatial queries) to begin with ;)


Your other point makes the - false - assumption that you'll be using the same tree used to build at run-time. In fact there's not many compelling reasons, in my opinion, to do that:
. memory required to store a kd-tree is some order of magnitude smaller than the data set it indexes, there would be no point in using such structure otherwise.
. translating/copying such tree is done in linear time.
. the data you need to build a tree have little in common with what you need to traverse it, plus given how tight is the final representation it would be cumbersome to say the least.
. also, the final representation might be quite funky regarding its memory layout.
. in fact while building you don't need to refer to earlier/later nodes, those could be just streamed out to disk.
. but then you also may want to run various passes on the tree once totally built (and might need cross references or auxillary data etc), ie to optimize away silly splits.

So, really, it doesn't matter how your tree is layed out while you build it as you'll cheaply lower the representation before rendering (8 bytes per node, adjacent childs etc).
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 25 Apr 2006, 07:13

If you do no clipping, or only partially, you have to be more careful in your scoring otherwise it's easy to get stuck in useless cycles where you always produce the same split. It sounds like that's what you're experiencing.

I'd suggest explicitely checking split candidates that shouldn't get scored: splits should observe cell.min < split < cell.max and that's not warranted if you don't clip. More precisely skip scoring when split <= cell.min and consider the scan for the axis being done if split >= cell.max. That should fix most of your problems.


I don't get it. How could cell.min > split? Ah I think I just saw the light: Clamping could lead to a negative side on an axis... I'll check for that. Thanks.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby Phantom » 25 Apr 2006, 22:19

Good news: I got a fully working nlogn compiler now with full clipping. Adding the clipping was a bit of a pain, by the way. Right now I have many passes, so I need to do a lot of optimizations, but that will have to wait till tomorrow.

List of passes:

(note: Each primitive is represented by a bounding box structure, like tbp suggested. The bounding box has a flag that denotes wether the primitive is completely to the left, completely to the right or straddling)

- First, there is the normal cost calculation. I do the usual walk of all three axii, nothing special.
- Next, I set all bounding boxes to 'straddling'.
- Then, I walk the list again to identify primitives that are completely left or right of the split plane. The remainder is 'straddling' (thanks to the previous pass). Planars on the split plane are added to left or right, depending on the cost calculation.
- Then, I walk the list again to split it to a left and right list. No clipping is performed; straddlers go to both lists (a clone of the bounding box is created to allow clipping in the next pass).
- Next pass: Clipping. For this I first walk the generated left list, then the generated right list. Left and right primitives are ignored, but straddlers are clipped, their associated events are removed and reinserted as they may have changed. I also test for new planars here.
- In a final pass, I store primitives in the newly created kd-tree nodes. This step will be dropped later on, as this needs to be done only for leafs, but right now it was easier to keep the old structures.

That's a total of 6 passes, of which 2 (cost calculation and list splitting) are performed for all three axii. At least one pass can easily be removed, but that still leaves me with 5. And I am ignoring the node deletion/insertion, which also involves walking the lists, thanks to the single linked list structure...

So there's plenty of work to do, but at least it gives correct results. I need to check if it gives the exact same results as the nlog2n compiler, but theoretically, everything is in place.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 26 Apr 2006, 02:33

It's quite entertaining to only have one way lists, eh? I hope you enjoy the ride :)

I wouldn't expect an exact match if i were you, for instance on the same arch i get a slight diff (a couple of nodes generally) between gcc and msvc binaries. That's better than what it used to be but i'm still a bit depressed, i thought i was careful enough; anyway i'm not going to require compilers to be stricter wrt optimizations.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 26 Apr 2006, 05:20

Well when 'different tree' means 'slower tree', I'm not that forgiving. The nlogn compiler is producing results within 2 or 3% of the nlog2n compiler (in terms of rays per second), so it might be a simple system hick-up. Now that I mention this, I just realized I ad an exceptional large set of background tasks last night.

I'll compare node counts today; if it's close enough I'll declare this compiler 'working' and then it will be optimized, at least slightly. I'm not striving for the fastest compiler, but I do want to be able to build larger sets in reasonable time. Besides, this algorithm hell is fun. ;)
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 26 Apr 2006, 06:39

I haven't formally done an optimizing pass on my compiler either, yet. I mean i've taken care not to write obviously slow code but that's it. From a cursory dissasemblage, it's a branch jungle out there; i think i can weed out some.
Now what i really ought to do is to profile it proper (and maybe try to collect some meaningful cache hit/miss data).
Hmm.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 26 Apr 2006, 13:12

OK, speed is back to normal now (was indeed caused by background processes). Stats for the nlog2n compiler (Toxie's scene6):

Code: Select all
Total node count: 12641
Leaf node count:  6321
Empty leaf count: 2300
Min leaf depth:   2
Max leaf depth:   52
Average depth:    20.540262


Sadly there's no timing in there. It's something like 1 or 2 seconds for the subdivision, but that's mainly caused by drawing splitplanes and updating the window. Stats for the nlogn compiler:

Code: Select all
Sorting took 0.000000 seconds
Subdividing took 0.047000 seconds
Creating triaccels took 0.000000 seconds
Assigning triaccels took 0.000000 seconds

Total node count: 12509
Leaf node count:  6255
Empty leaf count: 2243
Min leaf depth:   2
Max leaf depth:   52
Average depth:    20.501358


Must be something wrong with my sort code timing, but apart from that, things like node and leaf count are close enough to declare this compiler 'working'. Besides, I consider the compile time for scene6 (47ms for the current unoptimized, but fully clipping) compiler on my lowly 1.5Ghz P-M quite decent.

That's all, I'll report back when I have something that's worth showing. :)
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 26 Apr 2006, 13:59

Max leaf depth: 52 ???
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 26 Apr 2006, 14:02

Ehm yeah. :)

That's probably not good, but the point is, it is using the same scoring as the nlog2n compiler, and it's producing virtually the same result. So I'm happy. Improving the compiler means tweaking the scoring; all bookkeeping is working (and still rapidly improving at the moment).

I must say that an nlogn compiler requires a rather daunting amount of hardcore code btw.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 26 Apr 2006, 14:56

I think there's a bit more than just sub-optimalities, here is the result side by side from my partially clipping compiler vs your fully clipping one. There's some clipping in that scene (table/chair vs walls etc), so a perfect clipping compiler definitely should have an easier job.

Total node count: 12509 / 3083
Leaf node count: 6255 / 1542
Empty leaf count: 2243 (35.8%) / 364 (23.6 %)
Min leaf depth: 2
Max leaf depth: 52 / 22
Average depth: 20.501358

Other useful stats with no match,
num_max_ids 12 (3.98 avg/leaf) - that is max triangles in a leaf and average triangle/non-empty-leaf -

I mean what my compiler produce is not even remotly good by any standard.
So, something's still a bit fishy in your scoring (i'll put my bet again on useless split cycles of some sort).

Speaking of kd-tree quality metric, i think it's time to talk again about a way to trade our trees so we can bench via rendering while removing our respective renderer bias.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 27 Apr 2006, 10:23

I hit a pretty major snag with my compiler. For clipping, I need to move events around: The 'left' side of a triangle will shift to the right, and the 'right' side of a triangle will shift to the left. Initially, I implemented this as an event deletion/reinsertion, but this is way too slow, even for the fairy forest scene. So I did it slightly different now: First, I simply clip the primitives, and store the clipped extremae in the events, without moving them. When a primitive gets clipped away, I flag the events as 'INVALID'. This is all in O(N), obviously.

Next, I walk the list of events, and remove any INVALID events. This can only be done in a second pass, as clipping is done while walking one axis (always the x-axis in my case), but it invalidates events in random positions on the other axii. The second pass is O(N) though, and single-linked list disasters are prevented by keeping track of a 'next' pointer.

Finally, some events are in the wrong place, because of the clipping. This is fixed by resorting using mergesort. And that's the problem: While mergesort is pretty fast for lists with only a couple of unsorted nodes, it still makes my compiler nlog2n, strictly speaking...

I believe this problem is mainly introduced by the lack of a double linked list: Clipping would normally require mvoing extremae back and forth through the list, without deletion/insertion or resorting. With a single linked list, insertion is expensive.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby Phantom » 27 Apr 2006, 10:45

So, something's still a bit fishy in your scoring (i'll put my bet again on useless split cycles of some sort).


Here are stats for 'MAXPRIMSPERLEAF=2':

Code: Select all
Total node count: 7125
Leaf node count:  3563
Empty leaf count: 642
Min leaf depth:   2
Max leaf depth:   33
Average depth:    19.837215


Despite the fact that I still have twice the number of total nodes, and 11 extra levels, my average depth is in fact slightly LOWER than yours. So apparently, my compiler is messing up in a few cases, where it suddenly produces deep, but meaningless, trees. How many ray/tri intersections do you get on average for scene6? Mine is at 1.6, currently. I think that's a good indication of final tree quality (besides speed, but that depends on many more factors).
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 27 Apr 2006, 10:56

Phantom wrote:For clipping, I need to move events around: The 'left' side of a triangle will shift to the right, and the 'right' side of a triangle will shift to the left.
...
I believe this problem is mainly introduced by the lack of a double linked list: Clipping would normally require mvoing extremae back and forth through the list, without deletion/insertion or resorting. With a single linked list, insertion is expensive.


Of course using singly linked lists is a hinderance when you try to implement it all, but then like for the other aspects of the compiler the trick is to exploit inherent order.

Ascii art for an axis list where clipping happened with a to-be-clipped pair o(open)/c(close) and its derivatives; this order will always be observed
Code: Select all
.......|....>|....|<....|.....
       o     o'   c'    c


... with a corner where case o and o' and/or c and c' may be counfonded (that is stay at the same place).

There's many ways to do that with relative efficiency. Note that you can also accumulate those clipped segments that need to be moved around.
Last edited by tbp on 27 Apr 2006, 11:15, edited 1 time in total.
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 27 Apr 2006, 11:03

Phantom wrote:Despite the fact that I still have twice the number of total nodes, and 11 extra levels, my average depth is in fact slightly LOWER than yours. So apparently, my compiler is messing up in a few cases, where it suddenly produces deep, but meaningless, trees. How many ray/tri intersections do you get on average for scene6? Mine is at 1.6, currently. I think that's a good indication of final tree quality (besides speed, but that depends on many more factors).


That looks better. Now i dunno how you can say "my average depth is in fact slightly LOWER than yours" because i don't compute such a stat ;)

Another thing is that i don't enforce anything like MAXPRIMSPERLEAF, i check for a minimum number of triangles (runtime tweakable and set at 1 by default) before trying to split but that's it. So i guess that's why you produce 11 more lvls because you insists on splitting stuff. But that's just a guess.
I can't measure anything at rendering time atm, as i'm fooling around with postprocessing and it's all borked :P
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 27 Apr 2006, 11:21

That looks better. Now i dunno how you can say "my average depth is in fact slightly LOWER than yours" because i don't compute such a stat


You mentioned an average depth in a previous post, and didn't put your own score next to it. I assumed your score was the same or very close.

The max depth of 33 I showed earlier today is for a 'MAXPRIMSPERLEAF=2' setting, so I don't consider that to be optimal. I'll check sunflowers cost calculation to see what can be improved.

this order will always be observed


Not true, I think. Suppose you have some planars at the split plane, and some straddling primitives. Now consider the left child list: The straddlers will stay in the same place. The right extremae of the straddlers will be moved to the left, but since they are 'closing events' (I call them 'span end'), they should be sorted BEFORE the planars, while they where AFTER the planars in the original sorting. Moving them to the left in a single linked list is non-trivial.

Same goes for the left extremae of straddlers in the right list, only this time they should be moved to the right in the linked list (but only if there are planars at the same position), which is obviously easier.

Note that you can also accumulate those clipped segments that need to be moved around


That's what I do now. In fact, I postpone both sorting and deletion. Deletion of events for primitives that get clipped away is also non-trivial with a single linked list, but when you do it in a second pass, it's easy. This does however require a second pass for all three axii, which is a pitty. Perhaps I should just leave them in place and ignore them in the other passes?
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 27 Apr 2006, 11:37

Phantom wrote:You mentioned an average depth in a previous post, and didn't put your own score next to it. I assumed your score was the same or very close.

My average figure is like i've said above "num_max_ids 12 (3.98 avg/leaf) - that is max triangles in a leaf and average triangle/non-empty-leaf - "



Phantom wrote:
this order will always be observed


Not true, I think. Suppose you have some planars at the split plane, and some straddling primitives. Now consider the left child list: The straddlers will stay in the same place. The right extremae of the straddlers will be moved to the left, but since they are 'closing events' (I call them 'span end'), they should be sorted BEFORE the planars, while they where AFTER the planars in the original sorting. Moving them to the left in a single linked list is non-trivial.

That figure is for the two non-winning axis, clipping for the winning axis is special cased, see posted code above.
For those 2 axis, planars etc don't matter at all.

EDIT: got to run, i'll post something summing all that up in a more formal way later :)
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby tbp » 27 Apr 2006, 14:31

When we start none of triangle's proxies, their aabb, are clipped as the root node by definition contains them all.
Therefore while recursing by splitting those voxels in 2, clipping can only happen against those split planes and nowhere else; on the split axis we'll obtain 2 complementary segments [min, split] & [split, max] for each straddling triangles. On top of that we know exactly where to put each side of those 2 segments: the min/max part won't move and by definition the end of the first segment will appear at the end of the left child node list and the opening of the 2nd segment will appear at the beginning of the right child node list.
Code: Select all
left node    split   right node
.......[.......]|[.......].....
       Ol     Cl Or      Cr


If we look at the other 2 axis, triangles straddling the split axis may also require some adjustments. But as we are basically cutting some piece of those triangles, those pieces cannot be 'larger' than the whole triangle (or than the piece of the triangle we got at this stage); therefore their aabb cannot 'grow'. Therefore segments (min/max on a given axis) of the aabb of those pieces must shrink or stay the same. Segregating the triangle population when we split a node in 2 childs is a stable operation maintaining the order within the resulting left and right population. Therefore for both childs,
Code: Select all
.......[....>[....]<....].....
       o     o'   c'    c

... must be observed.

QED?

EDIT: planars only matter on the split axis, they'll end up either at the end of the left node list or at the beginning of the right node list, it doesn't matter if they come before of after straddling triangles (because when we score we look for all primitives sharing the same position on an axis); on the other 2 axis we'll bin them to the proper side when we'll meet them, in an orderly fashion. Put another way, the only requirement regarding the order of 'events' sharing the same position is that an open always come before a close for a given primitive (otherwise my categorize & split pass doesn't work).
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 28 Apr 2006, 14:17

If we look at the other 2 axis, triangles straddling the split axis may also require some adjustments. But as we are basically cutting some piece of those triangles, those pieces cannot be 'larger' than the whole triangle (or than the piece of the triangle we got at this stage); therefore their aabb cannot 'grow'. Therefore segments (min/max on a given axis) of the aabb of those pieces must shrink or stay the same. Segregating the triangle population when we split a node in 2 childs is a stable operation maintaining the order within the resulting left and right population.


Have a look at the following situation:

Code: Select all
       \
        \ \
         \  \             ______
          \   \          |     /
           \    \        |    /
V-axis      \     \      |   /
      |\     \      \    |  /
^     | \     \_______\  | /
|     |__\               |/
o--> U-axis


In this case, for the U-axis, before splitting, the events are as follows:

Code: Select all
       \                           <== tri_start
        \ \
         \  \             ______   <== tri_start
          \   \          |     /
           \    \        |    /
V-axis      \     \      |   /     <== tri_start
      |\     \      \    |  /
^     | \     \_______\  | /       <== tri_end
|     |__\               |/        <== 2x tri_end
o--> U-axis


Now we split at U = x:

Code: Select all
       \     |                   
        \ \  |
         \  \|            ______ 
          \  |\          |     /
           \ |  \        |    /
V-axis      \|    \      |   /   
      |\     |      \    |  /
^     | \    |\_______\  | /     
|     |__\   |           |/     
o--> U-axis


This results in the following event list for the left child:

Code: Select all
       \     |      <== tri_start                   
        \ \  |
         \  \|
          \  |
           \ |
            \|      <== tri_end
      |\     |      <== tri_start
^     | \    |
|     |__\   |      <== tri_end
o-->


The problem here is that the tri_end for the large sloped triangle on axis u is positioned AFTER the tri_start event of the small tri on the left before clipping; after clipping it is positioned BEFORE that event. Sorting order is thus NOT maintained for axis ku/kv after clipping on axis k.

QED.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 28 Apr 2006, 14:34

Ah! But i've never claimed it was preserved for anything but the primitive we're tinkering with.
Obviously when we 'adjust' extrema on U/V we may change inter-primitive order, but that's not an issue at all. What i was looking into is: what can we say regarding this primitive we're clipping and its extrema positions/movements in the 1D space of the list on those axis.

EDIT: nice ascii art btw :P
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 28 Apr 2006, 14:41

Obviously when we 'adjust' extrema on U/V we may change inter-primitive order, but that's not an issue at all.


Why not? For the next axis, I will be evaluating that axis. If it's not sorted properly, I have a problem.

About the ASCII art: Took me ages, especially since my initial example wasn't good. Had to redraw two times. :)
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby tbp » 28 Apr 2006, 14:47

Phantom wrote:Why not? For the next axis, I will be evaluating that axis. If it's not sorted properly, I have a problem.

Sure it means we have work to do. But now we know we'll only have to nudge the open event a bit forward and the close event backward (and all that in a bounded sub-list), that's a lot easier than not knowing upfront especially when we only have a one way list to walk around.
It would be way easier with doubly linked lists, but that's only an option for the sloppiest ;)
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby Phantom » 28 Apr 2006, 20:53

That's easy to say if you have no clipping at all. :)
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby Phantom » 02 May 2006, 13:36

I'm struggling with those huge trees, can't find a solution. One thing I did notice is that my 'average prims per non-empty leaf' count is much lower than yours (tbp): You mentioned 3.98, I get 1.723. Not sure why, not sure if this means I get a better tree either. I do suspect that building a tree that's 3 times larger is a bad thing for performance (I mean, build time).

It's also worrying that I thought I had a straight forward SAH builder, while in reality my scores are so far off.

Here's my current scoring code, perhaps someone could take a quick peek to see if I'm doing something obvious wrong?

Code: Select all
// best cost calculation
float NA = a_Box.w() * a_Box.d() + a_Box.w() * a_Box.h() + a_Box.d() * a_Box.h();
float iNA = 1.0f / NA;
float bestcost = INTRCOST * a_Prims, bestpos = -1;
int i, a, bestaxis = -1, bestside = -1;
for ( a = 0; a < 3; a++ ) if ((a_Box.GetP2().cell[a] - a_Box.GetP1().cell[a]) > EPSILON)
{
   int TL = 0, TR = a_Prims, TP = 0;
   vector3 lsize = a_Box.GetP2() - a_Box.GetP1();
   vector3 rsize = lsize;
   ebox* el = elist->head[a];
   while (el)
   {
      float pos = el->pos[a];
      int pl = 0, pr = 0;
      while ((el) && (el->pos[a] == pos) && (el->side(a) == EBox::END)) { pl++; el = el->next(a); }
      while ((el) && (el->pos[a] == pos) && (el->side(a) == EBox::PLANAR)) { TP++; el = el->next(a); }
      while ((el) && (el->pos[a] == pos) && (el->side(a) == EBox::START)) { pr++; el = el->next(a); }
      TR -= (TP + pl);
      if ((pos >= a_Box.GetP1().cell[a]) && (pos <= a_Box.GetP2().cell[a]))
      {
         lsize.cell[a] = pos - a_Box.GetP1().cell[a];
         rsize.cell[a] = a_Box.GetP2().cell[a] - pos;
         float LA = lsize.x * lsize.y + lsize.x * lsize.z + lsize.y * lsize.z;
         float RA = rsize.x * rsize.y + rsize.x * rsize.z + rsize.y * rsize.z;
         float w = a_Box.GetP2().cell[a] - a_Box.GetP1().cell[a];
         float f1 = (pos - a_Box.GetP1().cell[a]) / w;
         float f2 = (a_Box.GetP2().cell[a] - pos) / w;
         float bonus1 = 1.0f, bonus2 = 1.0f;
         float minf = 0.1f + 0.01f * a_Depth;
         if (((TL == 0) || ((TR + TP) == 0)) && (f1 > minf) && (f2 > minf))
         {
            if (TL == 0) bonus2 = 0.7f + 0.3f * f2; else bonus2 = 0.7f + 0.3f * f1;
         }
         if ((((TL + TP) == 0) || (TR == 0)) && (f1 > minf) && (f2 > minf))
         {
            if (TR == 0) bonus1 = 0.7f + 0.3f * f1; else bonus1 = 0.7f + 0.3f * f2;
         }
         float cost1 = TRAVCOST + INTRCOST * bonus1 * iNA * (LA * (TL + TP) + RA * TR);
         float cost2 = TRAVCOST + INTRCOST * bonus2 * iNA * (LA * TL + RA * (TR + TP));
         if (cost1 < bestcost)
         {
            bestcost = cost1, bestpos = pos;
            bestside = 0, bestaxis = a;
         }
         if (cost2 < bestcost)
         {
            bestcost = cost2, bestpos = pos;
            bestside = 1, bestaxis = a;
         }
      }
      TL += (pr + TP);
      TP = 0;
   }
}


Some notes:
- Sorting is verified and correct; also end events come before planars come before start events.
- I'm always calculating two scores; one for 'planars go left', one for 'planars go right'.
- There's some fancy code in there that favours larger empty space cut-off over smaller ones.
- The only epsilon in there tests for near-flat cells.
- This code works, only problem is that it creates 3x larger trees than other people report. :(

About the process after scoring:
- After scoring, I check if there's a best candidate at all (by checking bestside for -1); if there isn't one, the current node is turned into a leaf.
- Next, I build the two lists. Nodes completely on one side go to the corresponding list, straddlers are clippedm reinserted and sorted.
- And then the normal recursion...

Since I get the same trees (well almost) as with my nlog2n compiler (which is radically different in the bookkeeping department) I have been primarily focussing on the scoring, but I'm really code blind by now.

Some suspicious observations:
- I get a ton of splits for leafs with e.g. 4 prims that get split to 3 prims for the left branch and 4 for the right branch. I.e., often primitive count doesn't drop after a split (e.g., [6] ==> [5][5]), or one of the childs even keep the full set of input primitives (e.g., [4] ==> [3][4]).
- The odd average prims per non-empty leaf count I mentioned earlier (I get 1.723, tbp 3.98).
- All stats are scaled by a factor 3 (more or less): Empty leafs, non empty leafs and total node count.
- Output tree quality is better than anything I've produced before. Tree size is really the only problem.
- I can simulate the 3.98 average by increasing my 'MAXPRIMSPERLEAF' to 3 or 4, that also gives me smaller trees (close to what tbp reports), but degraded performance.

So if you spot a problem, or can suggest things that I should check, please let me know. Otherwise I'll just ignore this problem and move on to fast tracing again. That would be very unsatisfying though.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby fpsunflower » 02 May 2006, 14:12

Your numbers actually seem to be closer to Wald's than ours I think. Probably due to the robust clipping. Given that your trees are of higher quality than what your previous compilers did, chances are you're actually on the right track. I suppose the kd-tree exchange could help validate this. I personally would be less concerned with the tree size and more with the end performance.

Some suggestions though:
* what are travcost/isectcost set to ? These might not be valid anymore. Did you try measuring these via linear regression as xela suggested ? Or just trying lowering isectcost.
* I set maxprimsper leaf to 0 (to be able to clip empty space inside single prim leaves) - thats only going to give you deeper trees though.
* Wald mentions having to do a final optimization pass to "undo" splits that don't actually help much. It might be worthwhile to undo those splits that produced (3/4 in L/R) (Actually it would be good to maybe dump the content of those leaves + where the split occured as geometry so you could look at it and see why its trying to do that)
* Testing for planars go left/right is a bit of a waste of work in my experience. The intel guys suggest always going to the smaller leaf (one of their heuristics in the MLRTA paper). You can double check by counting the number of times this heuristic doesn't agree with the raw cost optimization I guess.
fpsunflower
 
Posts: 193

Next

Return to General Development

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests