Fourier Neural Operators (FNO) are highly effective instruments for studying partial differential equation resolution operators, however lack architecture-aware optimizations, with their Fourier layer executing FFT, filtering, GEMM, zero padding, and iFFT as separate levels, leading to a number of kernel launches and extreme world reminiscence visitors. The FFT -> GEMM -> iFFT computational sample has acquired insufficient consideration relating to GPU kernel fusion and reminiscence format optimization. Present strategies like Quantum ESPRESSO, Octopus, and CP2K make separate calls to FFT and BLAS routines. Nevertheless, they’ve three limitations: partial frequency utilization with extra reminiscence copy operations, lack of native frequency filtering capabilities in cuFFT, and extreme reminiscence transactions between processing levels.
FNO implements a pipeline that begins with a ahead FFT on enter characteristic maps, applies spectral filtering, and reconstructs output by way of inverse FFT. This course of necessitates frequency area truncation and zero-padding steps, which present frameworks like PyTorch execute as separate memory-copy kernels as a result of cuFFT’s limitations in native enter/output trimming help. Main FFT libraries similar to cuFFT and VkFFT lack built-in information truncation capabilities. Conventional 2D FFTs apply each 1D-FFT levels alongside spatial dimensions, however FNO applies spectral weights throughout the channel dimension, suggesting a chance for decoupling the FFT levels by protecting the primary 1D FFT alongside spatial axes whereas reinterpreting the second FFT stage alongside the hidden dimension.
Researchers from the College of California, Riverside, CA, USA, have proposed TurboFNO, the primary absolutely fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. The strategy begins with growing FFT and GEMM kernels from scratch that obtain efficiency corresponding to or sooner than closed-source state-of-the-art cuBLAS and cuFFT. An FFT variant is launched to successfully fuse FFT and GEMM workloads the place a single thread block iterates over the hidden dimension, aligning with the k-loop in GEMM. Furthermore, two shared reminiscence swizzling patterns are designed to attain 100% reminiscence financial institution utilization when forwarding FFT output to GEMM and allow iFFT to retrieve GEMM outcomes instantly from shared reminiscence.
TurboFNO integrates optimized implementations of FFT and CGEMM kernels to allow efficient fusion and built-in FFT optimizations. The kernel fusion technique in TurboFNO progresses by way of three ranges: FFT-GEMM fusion, GEMM-iFFT fusion, and full FFT-GEMM-iFFT fusion. Every stage entails aligning the FFT workflow with GEMM, resolving information format mismatches, and eliminating shared reminiscence financial institution conflicts. Key strategies embody modifying FFT output format to match GEMM’s enter format, making use of thread swizzling for conflict-free shared reminiscence entry, and integrating inverse FFT as an epilogue stage of CGEMM to bypass intermediate world reminiscence writes and improve reminiscence locality.
TurboFNO reveals nice efficiency in each 1D and 2D FNO evaluations. In 1D FNO checks, the optimized FFT-CGEMM-iFFT workflow achieves as much as 100% speedup over PyTorch, averaging 50% enchancment. These positive factors come from FFT pruning, which reduces computation by 25%-67.5%. The absolutely fused FFT-CGEMM-iFFT kernel delivers as much as 150% speedup over PyTorch and offers an extra 10%-20% enchancment over partial fusion methods. Equally, in 2D FNO, the optimized workflow outperforms PyTorch with common speedups above 50% and most enhancements reaching 100%. The 2D absolutely fused kernel achieves 50%-105% speedup over PyTorch with out efficiency degradation, regardless of the extra overhead of aligning FFT workload format with CGEMM dataflow.
On this paper, researchers launched TurboFNO, the primary absolutely fused GPU kernel that integrates FFT, CGEMM, and iFFT for accelerating Fourier Neural Operators. They developed a collection of architecture-aware optimizations to beat inefficiencies in typical FNO implementations, similar to extreme kernel launches and world reminiscence visitors. These embody a customized FFT kernel with built-in frequency filtering and nil padding, a GEMM-compatible FFT variant that mimics k-loop habits, and shared reminiscence swizzling methods that enhance financial institution utilization from 25% to 100%. TurboFNO achieves as much as 150% speedup and maintains a median 67% efficiency acquire throughout all examined configurations.
Right here is the Paper. Additionally, don’t overlook to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 90k+ ML SubReddit.

Sajjad Ansari is a remaining yr undergraduate from IIT Kharagpur. As a Tech fanatic, he delves into the sensible purposes of AI with a give attention to understanding the influence of AI applied sciences and their real-world implications. He goals to articulate complicated AI ideas in a transparent and accessible method.