BIH / BkdTrees ... are they really what we want ?

Practical and theoretical implementation discussion.

BIH / BkdTrees ... are they really what we want ?

Postby Shadow007 » 26 Jun 2006, 14:44

I've read (not experienced) the two papers, and I ask myself a question :
These two data-structures help tremendously with the real-time structure construction, but what effect do they have on FULL raytracing with all these pretty shades/ recursive ray-traces ?

It seems to ma a trade-off between speed and quality, is it not ?

I know I'm in the General Development forum (not the Not Quite Realtime one), but is there some kind of measurement (a number of rays or something), that tells at which point it's better to build fast and trace "slowly", and at which point it would be better to build "slowly" and trace "fast" ?

Not wanting to disminish the work of people who are far wiser/experienced/better than me in this, but that's just a question ...
Shadow007
 
Posts: 351

Postby Phantom » 26 Jun 2006, 14:54

So far BIH gives me trees that render at 65% of the performance of kd/sah, so it's not that bad. Toxie on the other hand reports almost identical performance. We're looking into this difference. In the meantime, compiling Buddha in less than a second is awesome.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

That's right ...

Postby Shadow007 » 26 Jun 2006, 15:03

If the ray-trace perfs are equivalent ... I should'nt have posted :(
Shadow007
 
Posts: 351

Re: That's right ...

Postby fpsunflower » 27 Jun 2006, 05:19

Shadow007 wrote:If the ray-trace perfs are equivalent ... I should'nt have posted :(


:)

I report the exact same performance difference as Phantom (benchmarked across many scenes). The best case was around 71% slower, the worst around 20%. But most fell nicely into the 60-70% range. Note that these were all primary ray only tests. At this point I'm still suspecting the build method as the traversal is really almost the same as the kd-tree. toxie mentioned the BIH requires a more optimized ray/triangle test, but that advantage would theoretically carry over to kdtrees as well.


I think there's some value in testing against totally random ray distributions. They will negate most of the cache advantages of either structure so it should focus more on the geometric/scene partitioning qualities. Something like http://homepages.paradise.net.nz/nickamy/benchmark.html should do. Anyone?
fpsunflower
 
Posts: 193

Postby Phantom » 27 Jun 2006, 05:39

That would mess up packet traversal...

I am considering to try this BIH/SAH approach with incremental updates. Perhaps I could even turn a well-built kd/sah into a BIH?

On a related note: Would it be possible to (ab)use some bits of the mantissa of a floating point value to store some custom bits?
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby toxie » 27 Jun 2006, 06:14

I can only report my own results on the "random" ray topic: There is this Audi A6 in the BIH paper and this uses only single rays and pretty advanced shaders, so not much coherence there. Nevertheless the render speed is almost exactly the same as in our kD.
On the other hand, phantoms kD speed is a bit faster than ours, so maybe this is why he still experiences a 30% slower BIH traversal?!
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby fpsunflower » 27 Jun 2006, 06:19

Phantom wrote:That would mess up packet traversal...


The random sphere ray benchmark? Like I said, I'm mainly interested in testing the geometric properties as opposed to the lower level cache coherence/SIMD leveraging properties.


Phantom wrote:I am considering to try this BIH/SAH approach with incremental updates. Perhaps I could even turn a well-built kd/sah into a BIH?


Same here. Testing with a seperate build method should help narrow down the culprit of lower-than-toxie performance ;)

Phantom wrote:On a related note: Would it be possible to (ab)use some bits of the mantissa of a floating point value to store some custom bits?


I personally don't really like that kind of idea as it requires extra bit twiddling to keep the tests robust, but a full analysis of what happens during traversal probably would reveal that the LSB has little to no impact on anything.

But why would you not want to use bits from the offset field ? If you encode it as an offset rather than a raw pointer (which is what you need to do for 64 bit architectures anyway) then you can store at least one or two extra bits without impacting anything.
fpsunflower
 
Posts: 193

Postby fpsunflower » 27 Jun 2006, 06:34

toxie wrote:I can only report my own results on the "random" ray topic: There is this Audi A6 in the BIH paper and this uses only single rays and pretty advanced shaders, so not much coherence there. Nevertheless the render speed is almost exactly the same as in our kD.


Exactly what do you mean by "single rays" only ? It shoots 1 ray at each pass and you average the passes ? Thats kinda what I remember from the bp video, and it makes alot of sense caching wise, just making sure I understand correctly.

How about for shaders that are more brute force and shoot a whole bunch of rays from the same point all at once (lets say a naive ambient occlusion for example).

toxie wrote:On the other hand, phantoms kD speed is a bit faster than ours, so maybe this is why he still experiences a 30% slower BIH traversal?!


Is it exactly 30% faster by any chance? ;)

I'm seeing the same performance drop though. Surely my implementation isn't as fast (single rays only, Java, plenty of other problems - then again they would be the same for BIH, so the delta might still work out).

Are the kd tree performance numbers in the paper for a tree built using full clipping SAH ? or using the same heuristic / in place sorting as the BIH ?
fpsunflower
 
Posts: 193

Postby Phantom » 27 Jun 2006, 06:48

I'm sorry, but it can't be just my more efficient kd-tree traversal. First, Toxie optimized his traversal also (although I have 4x4 packets), but more importantly: Tree sizes and number of intersections don't match as far as I can see. Although for Buddha it all seems reasonable. Perhaps I'll whip up a 2x2 traversal today and post the traversal loop so you guys can compare.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby Phantom » 27 Jun 2006, 06:56

I just noticed (I mean, read it in Toxie's paper) that the'global heuristic' can also be applied to the construction of kd-trees. That might help to check where the problem lies. If possible, I'll change my kd-tree compiler to use this heuristic, just to see how the trees compare to sah.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby craouette » 27 Jun 2006, 08:37

Phantom wrote:On a related note: Would it be possible to (ab)use some bits of the mantissa of a floating point value to store some custom bits?


Yes it is. It is done in pbrt, have a look at the book (really good by the way) for more info.
craouette
 
Posts: 41
Location: luxembourg

Postby playmesumch00ns » 27 Jun 2006, 11:28

PBRT stores split-axis (I think) info in the lowest two bits of the mantissa of the split position.

This does cause artifacts in some cases - rays ignoring triangles when they should hit.
playmesumch00ns
 
Posts: 224

Postby Phantom » 27 Jun 2006, 12:11

Never mind, I wanted to store much more. The BVH/KD paper mentioned nodes with 4 floats, a split axis and a pointer to a child node. Besides the 4 floats I would thus need about 20 bits. I guess there's no reasonable way to store those nodes, without dedicated hardware.
--------------------------------------------------------------
Arauna - Game-oriented real-time ray tracing
http://igad.nhtv.nl/~bikker
Phantom
Overlord
 
Posts: 1188
Location: Houten, Netherlands

Postby toxie » 07 Jul 2006, 10:57

I think there's some value in testing against totally random ray distributions. They will negate most of the cache advantages of either structure so it should focus more on the geometric/scene partitioning qualities. Something like http://homepages.paradise.net.nz/nickamy/benchmark.html should do. Anyone?


Using single rays: kD: 1,44 Million rays/sec, BIH: 1,61 Million rays/sec
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby toxie » 07 Jul 2006, 11:01

I just noticed (I mean, read it in Toxie's paper) that the'global heuristic' can also be applied to the construction of kd-trees. That might help to check where the problem lies. If possible, I'll change my kd-tree compiler to use this heuristic, just to see how the trees compare to sah


Then you have at least to add some "cool" termination criterion to the heuristic, as otherwise it will split itself to death.
And btw: Only use vertices as splitting planes.
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby toxie » 07 Jul 2006, 11:04

Never mind, I wanted to store much more. The BVH/KD paper mentioned nodes with 4 floats, a split axis and a pointer to a child node. Besides the 4 floats I would thus need about 20 bits. I guess there's no reasonable way to store those nodes, without dedicated hardware.


This is possible. Just drop the lowest bits and increase/decrease (depends on the plane (left/right one)) the highest non-dropped bit. This should work as this increases the volume of the bounding interval/volume, so overall it shouldn't harm the tree too much and results in an error free traversal.
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 12:16

toxie wrote:
I think there's some value in testing against totally random ray distributions. They will negate most of the cache advantages of either structure so it should focus more on the geometric/scene partitioning qualities. Something like http://homepages.paradise.net.nz/nickamy/benchmark.html should do. Anyone?


Using single rays: kD: 1,44 Million rays/sec, BIH: 1,61 Million rays/sec


Eh? Hardware?
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby toxie » 07 Jul 2006, 13:33

P4 2.8 GHz (no HT used)
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 13:43

Then i really can't say i'm impressed.
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby toxie » 07 Jul 2006, 13:49

No one said that my code is the best optimized of the world. :)
I just provided it to see that it can compete with the kD for random rays.
What are you numbers for the bunny (kD)?
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 13:56

"Thierry Berger-Perrin reports preliminary results of 1.67M rays/sec on a 2GHz Opteron, using bounded volume hierarchies (BVH)."

That was many moons ago with a fat BVH and my first try at SAH, if i remember correctly my kd was back then at more than 2M+/s but nick never posted that result.

Anyway, i'm kinda surprised that with hardware in the same league and a nimbler but somewhat similar structure you don't beat silly that crummy result of mine.
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby toxie » 07 Jul 2006, 14:08

2M+ is impressive indeed! But isn't too unexpected for me. SAH was designed for raw traversal speed, so...
Also you have to understand that both my versions (kD and BIH) are NOT optimized to the max like yours & phantom's
are, as the ray tracer is designed to run in a production environment, so it has to function under all circumstances (=f*#ked up data, rays, etc.) and also has to avoid epsilons at any price.

Nevertheless: Newest number for the BIH (including the "trick" posted in the Links&Papers-Thread): 1,73 Mill.Rays/sec
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 14:12

Err, i was comparing the BVH & BIH results because of similarities and the fact that you can't screw up as much on SAH with a BVH as with a kd-tree (again those results are like real old and i was quite clueless about the whole SAH deal).

Really i would have expected BIH to shine if only for it's tighter footprint compared to my bloated old-school bvh with no early exit.
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby toxie » 07 Jul 2006, 14:19

Hmmm.. .. You're right. Cause using random rays the memory footprint should make a difference.
Especially as the bunny does not fit into L2-cache.

On the other hand, it might not matter at all for such a case, as the P4 loads 64 (actually 128) bytes (Athlon: 64b if i remember correctly)
at a time and if the part of the model is not in cache, this is becoming the bottleneck for everything. So it shouldn't matter if a fat BBox has to be loaded
or a slim kD/BIH node. Both should need equal numbers of cycles (memory transfer of 1(2) cachelines).
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 14:26

Hmmk, that's kinda plausible... even if i think the cache-line size is 32 bytes for P4 - now i'm not quite sure it's true on all revisions; k7 & k8 always had a 64 bytes cache-line. So i waste a quarter of cache line loading an aabb on my k8 and you waste some loading your couple of planes.
But i still had to load more, and my bvh didn't partition at all :)

edit: Ok, seems that i'm a bit full of it and/or ignorant about whatever global warmer Intel produce nowdays; your cache (re)load argument holds.
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France

Postby toxie » 07 Jul 2006, 14:48

32 bytes was PIII i think.
64 bytes is P4. But as there is some design-issue, it always loads TWO cachelines in sequel (so overall 128 bytes).
Pentium M is different i believe.
Better you leave here with your head still full of kitty cats and puppy dogs.
User avatar
toxie
Overlord
 
Posts: 1403
Location: Germany

Postby tbp » 07 Jul 2006, 14:55

I'm pretty sure at least some P4 had a 32byte cache line; anyway as
a) i never took note of the load-them-by-2 feature
b) all not-so recent P4 have gone with 64 bytes cache lines

... thusly my point was moot to say the least.

Add on top of that, given that K8 have a much better latency at cache refilling than P4, the handicap is clearly on you - even if that's a bit counter intuitive of a result ;)
Press any key to continue: _
radius | ompf | stuff | tokaspt
User avatar
tbp
Overlord
 
Posts: 1445
Location: France


Return to General Development

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests