TSC Meeting Notes 2023-03-23
Attendance:
- Cary Phillips
- Christina Tempelaar-Leitz
- John Mertic
- Joseph Goldstone
- Kimball Thurston
- Larry Gritz
- Nick Porcino
- Peter Hillman
- Rod Bogart
Guests:
- Ruslan Idrisov (Epic)
- Joshua Miller
- Mark Leone
- Ed Slavin
- Lutz Latta
- Dhruv Govil
- Eric Renaud-Houde
- Omprakash Dhyade
Discussion:
- Rus: EXR GPU Decompression
- The Problem: read raw planar data, so needs to swizzle and remove padding to turn it into a texture for the GPU. Normally, read to a buffer then copy to the GPU.
- 300ms, swizzle was taking a long time, uncompressed.
- For an 8K 16-bit RGB image, it takes about 250ms CPU and 50ms GPU
- Kimball: Are you using the C++ library or new Core? You can unpack RAM in place. Is this timing relevant to this?
- Rus: This was done before the C interface. We can get better results as it is now.
- Optimization 1.0:
- Read into shared memory, accessible by GPU and CPU.
- Eliminates the swizzling
- Increases speed from 3fps -> 15fps, producing textures in 65ms
- Remaining problems:
- Significant GPU costs
- EXR have to be stored uncompressed
- Optimization 2.0:
- Read into staging buffer,
- Utilize DirectX 12
- 15fps texture read, reduced GPU impact
- Only load the required tiles
- Remaining problem:
- Disk space, and disk load, 30 sec for 8k 16bit and 138GB disk space
- Uncompressed EXR via Direct Storage:
- Direct Storage: collaboration between Microsoft and Nvidia.
- Skip RAM altogether, read from media into GPU, can do decompression, in theory
- Less memory usage
- No need for asynchronous copy
- Needs to be read in large chunks for best performance
- Compression method: GDeflate
- Nvidia’s take on Deflate (LZ77)
- Nvcomp library (cuda)
- Direct Storage library
- Neither are compatible with each other
- Compression out of the box
- Create a custom file out of the box
- Data stored planarly
- Compression ratio 1:1.5
- Better framerate than uncompressed, with less RAM and less disk bandwidth
- Benchmarks/File structure:
- Files broken into chunks and tiles, each chunk compressed separately
- Header for each compressed tile, with its uncompressed size
- Benchmark GPU:
- Staging buffer: higher throughput, 14-16fps
- Direct Storage: higher GPU consumption, more L1 and L2 cache.
- Stage 2: Improving Compression Ratio
- Split 16-bit EXR planar data into two bytes
- Left hand and right hand bytes are separated and grouped up into corresponding arrays for each chunk
- Compression ratio: 1:1.15 to 1:1.85
- Decoding is 100% on GPU.
- Speed increased to 19fps
- Process:
- remove Padding ->
- Unpack from uint32 into uint8 ->
- assemble to uint16 ->
- uint16 ->
- float16 ->
- Swizzle
- Kimball: One of the EXR compression types already does this trick. Very similar to Zip compression.
- Stage2: Improving further
- Delta encoding: store one value, then deltas.
- Each chunk is broken into 8-byte groups
- Compression improves from 1:1.85 to 1:1.95
- New shader
- Demo:
- 8K 16bit RGB @ 23fps
- Mark Leone: An optimization to suggest. If you went to warp-wide instructions (“shuffle sync”) takes a warp-wide value and broadcasts it.
- Kimball: Did you experiment with not having to do the swizzle?
- Rus: I went back to planar because it was not compressing well. When we don’t it doesn’t compress well. I took the texture without the padding and pushed it through GDeflate, the compression rate wasn’t great. Asked, “Maybe it’s not worth using GDeflate?” But with planar it works well.
- Kimball: Could think about delta-ing in two dimensions
- Rus: We had some ideas about how to improve further.
- Tiles:
- Tile decompression requires a substantially bigger staging buffer for direct storage, but otherwise performs just as well. If you allocate double the size of the frame as a staging buffer, it performs just as well as chunks.
- Mark Leone: 64kb tile size
- Discussion:
- Kimball: You’re only considering lossless compression, B44 or DWA?
- Rod: Yes. That’s interesting, but it's a separate idea. As soon as we expose lossless, we have to expose a parameter to say how lossy it is. We wanted to be able to say “lossless EXR playback is possible.”
- Rod: EXR is floating point numbers. You could do that with fewer bits. PQ gives you something HDR-like. We don’t use the full range of what EXR can do. High numbers get clamped because of something else.
- Nick: Nothing sacred about FP16.
- Rod: FP16 was an independent creation at ILM and Nvidia. We didn’t originally recognize they were doing it at the time. The folks at Nvidia were cooperative. Nvidia had slightly different handling of infinities and nans. half.h had an internal version first, but then we decided on an IEEE-based version.
- Mark Leone: How to support DMA?
- Need to get the file offset of a chunk.
- Kimball: You can do that, the Core library has a chunk info. You have raw chunk offset, for file validation. We’re moving it into the C++ API, I’ve been slacking. You can do a random tiling if layout on disk is important.
- Leone: “mip tail,” read the bottom level in a single call.
- Rod: The motivating factor is reading tiles based on what we know about the view, so we can optimize.
- Dhruv: Do you pay a deflation cost if it’s on a part of the object you don’t see?
- Rod: Yes, we went with textured geometry that’s known shape. And we assume it’s visible if we’re looking that way. First checks to see what portions of a texture are visible, then goes and gets them. You have to take into effect all the edges of tiles and their neighbors, or things cull differently.
- Dhruv: unified memory? What would your speedups be if you had unified memory?
- Rod: The EXRs are external, they have a file path. Not like a game engine, where they’re internal.
- Dhruv: like a desktop system.
- Leone: Have you profiled on AMD and Intel?
- Rus: No.
- Leone: PCIE5, should be faster.
- Rod: Hint is a little bit of moving data around is necessary.
- Kimball: We can add a GDeflate compression type if we can get a C implementation of it.
- Mark: There is an open source implementation of the microsoft version, I’ll get it to you.
- Kimball: Should do an experiment: What happens we assume the file is not corrupt. Rely solely on the tiles, remove the checks and ignore the padding. The uncompressed size is stored at the beginning of the chunk.
- Kimball: we might move those chunk sizes into the chunk table.
- Peter: The chunk table might give you that already.
- Rus: Doesn’t give you the size.
- Peter: But you know that.
- Rus: The compressed size.
- Kimball: You can’t assume the chunks are next to each other. We would have padded the chunk out to the next sector on disk.
- Rus: I do use the chuck sizes now.
- Peter: The tile size is always the same.
- Kimball: But it’s compact, it chops the data.
- Rus: You have to do an end-of-file check to figure out where the chunk ends.
- Kimball: We could do an experiment, see what the benefit is.
- Kimball: The other idea is to support storing integers, would open the doors to storing PQ. But you have to store encoding. Now, we know everything is linear. It’s a can of worms.
- Nick: I’m working on coupling the OpenEXR Core to Hydra. I’d be interested in channeling this data.
- Rod: Not sure if we can share the slides.
- Kimball: Can we come up with a benchmark tool.
- Rod: Rus had made one. Had done this outside of Unreal.
- Kimball: The header reads are faster because I removed the mallocs
- Kimball: If we add a compression type, we should add 5.
- Rod: If we wanted a different layout, we’d call it a compression type. But Rus’s work showed that layout isn’t the pricey thing.
- Should we change the layout to feed GPU’s. We’re talking ourselves out of it.
- Have been trying to avoid making our own thing.
- Kimball: I would not do them all.
- Leone: We’re in a better place with the open source version of GDeflate