TSC Meeting Notes 2023-06-29

Attendance:

Cary Phillips
Christina Tempelaar-Lietz
John Mertic
Joseph Goldstone
Kimball Thurston
Larry Gritz
Nick Porcino
Peter Hillman
Rod Bogart

Guests:

  • Ruslan Idrisov, Epic Games

  • Lutz Latta, Lucasfilm

  • Thomas Wilshaw, works on Olive the video editor In his spare time (more at end of notes)

Discussion

  • Open Source Days

    • Kimball: Are we doing a BOF in addition to Town Hall?

    • Cary: Emily is setting that up. BOF is show up and chat, Town Hall is structured discussion.

    • Rod: Rus talking about optional choices for EXR decoding on the GPU. previous talk , more progress since then. Update.

  • GDeflate in OpenEXR Presentation by Rus Idrisov:

    • Intro 

      • NVIDIA’s GDeflate algo, extension of regular deflate, can achieve compression rate 1:2, deliver data directly to the GPU, read into shared memory, upload to buffer, decompress on the GPU. Still barrier need to be able to put into something readable to use bandwidth of PCIe.

      • Fork of libDeflate

      • Nvcomp is NVIDIA’s GDeflate lib

      • Epic needs to deliver large 200MB exrs, 50ms to transfer on PCIe4, 5 and 6 better but no GPUs at 6

      • PCIe5 is 4GB per lane.

      • Reduces disk speed and storage requirements. Disk rating needs to be quite high, min 4800 MB/sec. Storage high requirements too.

      • Epic's implementation uses Microsoft-only Direct Storage (DS) option. Similar to NVIDia’s GPU Direct

      • MS DS implements GDeflate.

    • Uncompressed 

      • Reading uncompressed exrs into staging buffer in shared mem (RAM), only 15 frames per seconds currently. Load only required tiles.

      • Remaining problems:

        • Disk space and disk load.  30 s 8k texture 16 bit seq takes 138GB hard drive

        • Slow throughput - 45-50 ms on PCIe4 (hope better 5 and 6)

    • GDeflate image via direct storage prototype

      • Broke it down into chunks.

      • Performance up to 23fps.

      • Issue 640kb chunk.

      • Less CPU work, just wait for Direct Storage

      • Reduced bandwidth req by 2x

      • More work on GPU side, can use second GPU if have.

      • Benchmarks

      • Best compression ratio 1:2

      • CPU time negligible now

      • Chunks uploaded sequentially

      • Work happening on GPU while uploaded to decompress previous chunk.

      • Frametime 50 ms, active GPU 42 ms.

      • 70ms to 50ms frametime improvement.

      • Storage req, disk speed req improve.

    • Benefits of Direct Storage

      • Reduced file size, reduced throughput, little CPU time, GPU decompression

      • Still have to use CPU implementation of OpenEXR because Direct Storage is MS only.

    • How does it fit into EXR?

      • DS header has ID, magic number, tile size, padding for data alignment

      • Each chunk about 64kb tile.

      • One header per chunk.

      • How to make EXRs compatible with DS and performant:

      • Store header with pointer to chunks

      • Header - uncompressed chunk size offsets, compressed fragment size

      • Remaining problems:

        • DS tile is 64kb unlikely can change

        • Header “owned” by Microsoft

  • Discussion 

    • Kimball: this is exactly how OpenEXR works. Did you get to use the decoding pipeline in the core library? Both tiles and scanlines are called chunks. This was precise idea around having the decoding pipeline. Local header was the only thing I didn’t know whether we needed to do something custom. Depending on zip or zips compression, 1 vs 16 scanlines of data. Maybe too large for DS. Maybe we have a zipg for custom number? Do we need to skip the local header in the EXR file? Mostly sanity checking in local header. Could elide some of that. Have you played with that?

    • Rus: that was the plan but never got to it. Even if it was 16 scanlines, it’s not a constant size. Needs to be a constant size. Example 8k texture vs 4k texture.

    • Kimball: We can pad it to be a constant size if that helps.

    • Rus: so instead of 16 scanlines, needs to be 10mb so it’s a constant size.

    • Kimball: for direct storage

    • Rus: yes. When I compress an entire file as one chunk, it is inefficient, becomes slower than uncompressed approach. Chunkage is important to the DS performance. 

    • Kimball: really about how to get GDeflate to give you a consistent size? It’s in each fragment correct? 

    • Rus: Yes the individual part is what has been compressed. All 64kb when uncompressed. That header generated by DS GDeflate when compressed on GPU. 

    • Rod: to what degree do we want to consider making legal EXR files that happen to have this data in it? And what are the extension roots you are considering?

    • Kimball: is direct storage header public information? 

    • Rus: somewhat not official guideline. Available as an example on their repository. Fragment size “unlikely to change” but the header could be changed by MS. If support DS 1.2, keep up to date with how they do their headers.

    • Rod: not change MS would consider lightly. But if they do update we would have to know.

    • Kimball: have we taled to MS to formalize this header definition.

    • Rus: no, that would be next step.

    • Nick: Each fragment independent or share a dictionary?

    • Rus: completely independent. Only common thing is the header.

    • Kimball: Mark Liani has created a branch where he wired in GDeflate as an option into EXR in a branch. Have not looked at it yet.

    • Rus: can run on whole data but not as efficient.

    • Kimball: if define a zipg, we would have to go through ability to break into chunks

    • Rod: also have to add the writing code. Not really MS code as long as the header doesn’t change.

    • Rod: Larry, have you worked on this yet.

    • Larry: no but I’d like to use. But it needs to have a plausible path on all platforms, GPU, CPU.

    • Rod: my understanding is that it is working.

    • Rus: we will be able to take advantage of this and provide each individual chunk to the GPU, skip system memory and CPU side.

    • Larry: I like it a lot.

    • Rus: from point of view of OpenEXR it’s a different way of storing things. For us to use GDeflate would just be a juggling of how data is stored. Lose half a sec on decompression on the CPU side.

    • Larry: thoughts are what new APIs do we need in OpenEXR to support this? We haven’t talked about giving a GPU side destination for things. Do you want uncompressed pixels? Do you want to just get the compressed chunk? No notion of async calls yet. Not of it extremely diff but need to think of the overall implications, how it interacts with the other APIs. Haven’t yet gone as far of mapping back the core work to the C++ API to support threading. Timing is good to think this stuff through.

    • Cary: also have asked the question how beholden are we to the current C++API. Could create something new and fresh potentially.

    • Larry: switch from 2 to 3 was very disruptive. We had discussed keeping APIs working and adding a few new things. Significant overhaul is possible but it’s a big ask for everyone downstream.

    • Nick: have had good luck building on top of the C core. 

    • Larry: but you put a lot of Nick-time into that work. A lesser mortal might have struggled. Not easy to figure out the C calls.

    • Rod: do we have a commitment to sharing those nicely wrapped things.

    • Cary: what is state of your code currently? How far away is it from adding to the library?

    • Rus: almost nothing of the existing library. 

    • Rod: we read and disassemble by ourselves since we know how they are laid out.

    • Cary: informative for you to take a close look at what Kimball has done with the core library. Exactly the intent.

    • Nick: was struck that it would fit elegantly into the pipeline he has made.

    • Rod: this was a prototype to see if it works at all. Yeah now feels like doing it EXR C-core style would be good next move. If you’re reading the whole picture all at once. We are also reading mipmaps and tiles and being very selective, other thing we’ve already done in uncompressed case. But ensuring the right amount of compressed data is there and you don’t have a lot of waste is going to make these tiles a little trickier.

    • Larry: concerned about use case of using them as textures, virtual texturing. Need to worry about using them as tiles and worry about what we pull in.

    • Rod: in that example, chance to introspect on the scene and figure out what you need. But we are doing this realtime. We have things that are maps, e.g. dome, so if isn’t visible / projected into frustum, don’t read it.

    • Rus: tiles is our main use case. Load lower mip levels when don’t see dome. Areas closer to screen load higher mip levels. For tile use case, who situation becomes simple bc you already have a chunk . In the case of tiles, this chunk becomes a tile.

    • Rod: size is equitable to DS.

    • Rus; 400x256 tile it becomes 393kb, smaller than 640kb but still good size for compression/decompression. In tile case, fairly straightforward.

    • Rod: Cary’s right we should try to adapt it using the C core.

    • Lutz: How married are you to DS. When we looked at it, lots of Linux use cases, not happy with how they conflate the copy from the disk to the GPU. Have you looked at way to decouple storage aspect vs GPU aspect. 

    • Rus: DS has automatic GDeflate decompression along the pipeline. No other has these concepts married.

    • Rod: our main use case is windows machines on ICVFX stages. 

    • Rus: current implementation utilizes shared memory buffer. Only just now available on Vulkan, keep mapped buffers and read and write without locking. Constantly locked structured buffer then read and write like regular memory but available through GPU almost immediately. Only available on DX 12 not sure how well it works on Vulkan. 

    • Lutz: On Vulkan we’ve been doing (?) for a long time.

    • Rus: some synchronization but available to CPU and GPU without locks. 

    • Nick: equivalent thought around RTX (?)

    • Rus: direct storage is GPU direct . NVIDIA themselves mention 

    • Lutz: advantage when it’s separate. GDeflate better controlled in the format. Loader just reads in the raw data, issue command on the buffer to decompress.

    • Rus: could read compressed data with the Cu (CUDA?) file. Then call decompression on the buffer. But there was some trick to it, not ideal to his recollection. GPU Direct is cross-platform so could do this on Linux. I should try this.

    • Cary: Virtual Town Hall would you be willing to give an abbreviated version of this? We’re going to tell people you’re doing it. Along the lines of plans for the future. August 2, before SIGGRAPH. How public are you willing to be?

    • Rod: we need to check our permissions. Epic is on holiday for next few weeks so we will get back to you on what we can present. Would prefer to have an experiment with C core in advance

    • Cary: helpful for community to know what’s going on. 

    • Larry: one of the looming things is figuring out GPU decompression options and how to change the format and API to support those.

  • 3.1.9 DWA read issue 

    • There was a Slack discussion regarding broken DWA file read introduced in 3.1.9 by PR #1439. Nick has a fix that works, shared on the channel. I

    • Larry: are we convinced that is the right solution? 

    • Nick: I didn’t exhaustively check each comparison, just changed all of them and now I can load the files. But hoping someone who wrote the code might have an opinion as to whether the change is legitimate. Planning to submit a PR soon.

    • Comparison to catch overrun in the fuzzing. It solves the CVE by preventing writing over the end of the buffer. Think it is fine but it highlights that we might need more checking on the DW compressed files in general. Unit test didn’t catch that we couldn’t open those files. S

    • Larry: spuriously failing by thinking it was corrupt. But difficult to tell that deep in the code whether it was correct.

    • Peter: 2 buffers it was checking. At end of both it’s ok but if not at the end of both, file is not ok. Maybe change it put in PR and have Kimball review then re-fuzz it. 

    • Larry: reveals gap in our testing policy.

    • Peter: may have worked with some DWA compressed files and not others.

    • Nick: may be rare but the file Larry shared triggered it.

    • Peter: test suite tests the compression work but not with a real file. 

    • Nick: we have 6 images but one per settings permutation. Not sufficient. Not sure if it’s adequate to just generate more compressed images. 

    • Cary: fuzzing is more realistic approach.

    • Nick's PR related to this conversation posted after the meeting: https://github.com/AcademySoftwareFoundation/openexr/pull/1472

  • Intro message from Thomas Wilshaw after meeting

    • I work on Olive the video editor in my spare time. We use EXRs as our disk based cache format and Peter Hillman suggest I come along as I'd mention to him we'd been thinking a bit about ways to move decompression onto the GPU (we do pretty well all the timeline compositing on the GPU). It was a really interesting discussion and I learnt a lot. We're currently stuck on OpenGL and a Nvidia only solution probably won't work for us but I'm very interested to see where this goes