September 9, 2019 | 11:49
Researchers at the Universitat Politècnica de València (Polytechnic University of Valencia) in Spain have published details of a cache memory management technique they claim can boost general-purpose GPU (GPGPU) performance by up to 118 percent while cutting power usage in half.
The release of the first general-purpose GPU (GPGPU) offload framework, which allows highly-parallelisable code to be executed on a many-core graphics processor rather than a far-fewer-core CPU, was a turning point for the high-performance compute (HPC) market. The two fastest systems in the world, the Department of Energy (DOE)'s Summit and Sierra, both use Nvidia's Volta GV100 GPUs, while Nvidia and AMD both produce and sell dedicated GPU-based accelerator boards lacking video outputs for the market - and even Intel, which is still working on bringing a high-performance GPU to market, launched GPU-like accelerator boards born from its otherwise-failed Larrabee graphics processor projects.
Researchers from the Universitat Politècnica de València (Polytechnic University of Valencia), however, claim that the current design of GPUs doesn't work for general-purpose computation as well as it could - and have proposed a new cache access management system which they claim can more than double performance for some workloads.
The issue, researchers Francisco Candel, Alejandro Valero, Salvador Petit, and Julio Sahuquillo claim, comes from the increasing complexity of GPU memory hierarchies. 'The Last Level Cache (LLC) size considerably increases each GPU generation,' the team explain. 'This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption.'
By implementing a small additional structure, based on a Fetch and Replacement Cache (FRC) concept, to store control and coherence information for blocks being pulled from main memory, the team claims to have dramatically improved GPGPU performance over the current approach: A mid-range GPU saw performance boosted by between 30 and 67 percent depending on workload, while a larger high-end GPU saw a 32 to 118 percent improvement while also reducing energy consumption between 49 and 57 percent.
'This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced,' the team explains. 'The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements.'
There is a cost, of course: The technique requires additional hardware within the GPU, and can't be implemented as a software or firmware upgrade to current GPUs. It also increases the area required by the LLC, by 7.3 percent, meaning less room for other components.
The team's work has been published in the journal IEEE Transactions on Computers.
September 18 2020 | 18:30