TSC Meeting Notes 2023-03-23
Attendance:
Guests:
Ruslan Idrisov (Epic)
Joshua Miller
Mark Leone
Ed Slavin
Lutz Latta
Dhruv Govil
Eric Renaud-Houde
Omprakash Dhyade
Discussion:
Rus: EXR GPU Decompression
The Problem: read raw planar data, so needs to swizzle and remove padding to turn it into a texture for the GPU. Normally, read to a buffer then copy to the GPU.
300ms, swizzle was taking a long time, uncompressed.
For an 8K 16-bit RGB image, it takes about 250ms CPU and 50ms GPU
Kimball: Are you using the C++ library or new Core? You can unpack RAM in place. Is this timing relevant to this?
Rus: This was done before the C interface. We can get better results as it is now.
Optimization 1.0:
Read into shared memory, accessible by GPU and CPU.
Eliminates the swizzling
Increases speed from 3fps -> 15fps, producing textures in 65ms
Remaining problems:
Significant GPU costs
EXR have to be stored uncompressed
Optimization 2.0:
Read into staging buffer,
Utilize DirectX 12
15fps texture read, reduced GPU impact
Only load the required tiles
Remaining problem:
Disk space, and disk load, 30 sec for 8k 16bit and 138GB disk space
Uncompressed EXR via Direct Storage:
Direct Storage: collaboration between Microsoft and Nvidia.
Skip RAM altogether, read from media into GPU, can do decompression, in theory
Less memory usage
No need for asynchronous copy
Needs to be read in large chunks for best performance
Compression method: GDeflate
Nvidia’s take on Deflate (LZ77)
Nvcomp library (cuda)
Direct Storage library
Neither are compatible with each other
Compression out of the box
Create a custom file out of the box
Data stored planarly
Compression ratio 1:1.5
Better framerate than uncompressed, with less RAM and less disk bandwidth
Benchmarks/File structure:
Files broken into chunks and tiles, each chunk compressed separately
Header for each compressed tile, with its uncompressed size
Benchmark GPU:
Staging buffer: higher throughput, 14-16fps
Direct Storage: higher GPU consumption, more L1 and L2 cache.
Stage 2: Improving Compression Ratio
Split 16-bit EXR planar data into two bytes
Left hand and right hand bytes are separated and grouped up into corresponding arrays for each chunk
Compression ratio: 1:1.15 to 1:1.85
Decoding is 100% on GPU.
Speed increased to 19fps
Process:
remove Padding ->
Unpack from uint32 into uint8 ->
assemble to uint16 ->
uint16 ->
float16 ->
Swizzle
Kimball: One of the EXR compression types already does this trick. Very similar to Zip compression.
Stage2: Improving further
Delta encoding: store one value, then deltas.
Each chunk is broken into 8-byte groups
Compression improves from 1:1.85 to 1:1.95
New shader
Demo:
8K 16bit RGB @ 23fps
Mark Leone: An optimization to suggest. If you went to warp-wide instructions (“shuffle sync”) takes a warp-wide value and broadcasts it.
Kimball: Did you experiment with not having to do the swizzle?
Rus: I went back to planar because it was not compressing well. When we don’t it doesn’t compress well. I took the texture without the padding and pushed it through GDeflate, the compression rate wasn’t great. Asked, “Maybe it’s not worth using GDeflate?” But with planar it works well.
Kimball: Could think about delta-ing in two dimensions
Rus: We had some ideas about how to improve further.
Tiles:
Tile decompression requires a substantially bigger staging buffer for direct storage, but otherwise performs just as well. If you allocate double the size of the frame as a staging buffer, it performs just as well as chunks.
Mark Leone: 64kb tile size
Discussion:
Kimball: You’re only considering lossless compression, B44 or DWA?
Rod: Yes. That’s interesting, but it's a separate idea. As soon as we expose lossless, we have to expose a parameter to say how lossy it is. We wanted to be able to say “lossless EXR playback is possible.”
Rod: EXR is floating point numbers. You could do that with fewer bits. PQ gives you something HDR-like. We don’t use the full range of what EXR can do. High numbers get clamped because of something else.
Nick: Nothing sacred about FP16.
Rod: FP16 was an independent creation at ILM and Nvidia. We didn’t originally recognize they were doing it at the time. The folks at Nvidia were cooperative. Nvidia had slightly different handling of infinities and nans. half.h had an internal version first, but then we decided on an IEEE-based version.
Mark Leone: How to support DMA?
Need to get the file offset of a chunk.
Kimball: You can do that, the Core library has a chunk info. You have raw chunk offset, for file validation. We’re moving it into the C++ API, I’ve been slacking. You can do a random tiling if layout on disk is important.
Leone: “mip tail,” read the bottom level in a single call.
Rod: The motivating factor is reading tiles based on what we know about the view, so we can optimize.
Dhruv: Do you pay a deflation cost if it’s on a part of the object you don’t see?
Rod: Yes, we went with textured geometry that’s known shape. And we assume it’s visible if we’re looking that way. First checks to see what portions of a texture are visible, then goes and gets them. You have to take into effect all the edges of tiles and their neighbors, or things cull differently.
Dhruv: unified memory? What would your speedups be if you had unified memory?
Rod: The EXRs are external, they have a file path. Not like a game engine, where they’re internal.
Dhruv: like a desktop system.
Leone: Have you profiled on AMD and Intel?
Rus: No.
Leone: PCIE5, should be faster.
Rod: Hint is a little bit of moving data around is necessary.
Kimball: We can add a GDeflate compression type if we can get a C implementation of it.
Mark: There is an open source implementation of the microsoft version, I’ll get it to you.
Kimball: Should do an experiment: What happens we assume the file is not corrupt. Rely solely on the tiles, remove the checks and ignore the padding. The uncompressed size is stored at the beginning of the chunk.
Kimball: we might move those chunk sizes into the chunk table.
Peter: The chunk table might give you that already.
Rus: Doesn’t give you the size.
Peter: But you know that.
Rus: The compressed size.
Kimball: You can’t assume the chunks are next to each other. We would have padded the chunk out to the next sector on disk.
Rus: I do use the chuck sizes now.
Peter: The tile size is always the same.
Kimball: But it’s compact, it chops the data.
Rus: You have to do an end-of-file check to figure out where the chunk ends.
Kimball: We could do an experiment, see what the benefit is.
Kimball: The other idea is to support storing integers, would open the doors to storing PQ. But you have to store encoding. Now, we know everything is linear. It’s a can of worms.
Nick: I’m working on coupling the OpenEXR Core to Hydra. I’d be interested in channeling this data.
Rod: Not sure if we can share the slides.
Kimball: Can we come up with a benchmark tool.
Rod: Rus had made one. Had done this outside of Unreal.
Kimball: The header reads are faster because I removed the mallocs
Kimball: If we add a compression type, we should add 5.
Rod: If we wanted a different layout, we’d call it a compression type. But Rus’s work showed that layout isn’t the pricey thing.
Should we change the layout to feed GPU’s. We’re talking ourselves out of it.
Have been trying to avoid making our own thing.
Kimball: I would not do them all.
Leone: We’re in a better place with the open source version of GDeflate