Linus Torvalds: ‘I Hope AVX512 Dies a Painful Death’

Skylake (6000 Series)

Skylake was Intel’s last architectural overhaul. All CPUs since have been fundamentally based on its architecture. Skylake added DX 12_1 support, deeper out-of-order buffers, improved overall execution, improved AES encryption, support for Intel SpeedShift, and an improved video decode/encode unit. Performance compared to Haswell was test and part-dependent, some Skylake chips were faster than Haswell clock-for-clock, but ran at lower clock speeds and thus ended up more-or-less equivalent. The largest performance improvements were concentrated in the mobile parts.

Linus Torvalds has written several forum posts discussing his dislike of many SIMD instruction sets, as well as his hatred of both FPU benchmarks and in general AVX-512, Intel’s 512-bit vector extensions. Linus, as per usual, pulls absolutely no punches on this one. Here’s a short sample:

I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on…

I absolutely destest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It’s a pet peeve of mine. It’s a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market.

Torvalds admits to his own bias on this topic and even recommends, at one point, taking his own opinion with a pinch of salt. He does, however, back up his argument with some solid talking points, one of which met with near-universal agreement: A key problem with AVX-512 is the way support is fragmented across the entire market.

Developers, as a rule, do not like rewriting and hand-tuning code for specific architectures, particularly when that hand-tuning will only apply to a subset of the CPUs intended to run the relevant application. If you work in HPC or machine learning, where AVX-512 servers are common, this is not an issue — but that’s statistically very few people. Most software runs on a wide range of Intel CPUs, most of which do not support AVX-512. The weaker the support across Intel’s product line, the less reason developers have to adopt AVX-512 in the first place.

Image by Colfax Research

But the problems don’t stop there. One reason why developers might be reluctant to use AVX-512 is because the CPU takes a heavy frequency hit when this mode is engaged. Travis Downs has written a fabulous deep-dive into how the AVX-512 unit of a Xeon W-2104 behaves under load.

What he found was that in additional to the known performance drop due to decreased frequency, there’s also a small additional penalty of about three percent when switching into and out of 512-bit execution mode. This also seems to be the case when AVX2 is used in his benchmark payloads, so this part of the penalty may be the 2104 runs at 3.2GHz (non-AVX Turbo), at 2.8GHz (AVX2), and at 2.4GHz when executing AVX-512. There’s a 12.5 percent frequency hit from using AVX2 as opposed to not, and a 25 percent penalty for invoking AVX-512.

But one of the problems with AVX-512, and the reason it can hurt performance, is because using AVX-512 lightly really isn’t a good idea. When activating part of the CPU requires you to take a 25 percent frequency hit, the last thing you’d ever want is to hit that block lightly but consistently, invoking it for a handful of advantageous uses that slow the CPU down so much, your net overall performance is lower than it would have been with AVX2 or even without AVX at all, depending on the scenario.

Torvalds dives into some of the specific technical issues that make AVX-512 a poor choice, including the “occasional use” use-case that AVX-512 is a very poor fit for. Others in the thread such as David Kanter contest the idea that AVX-512 is a poor use of silicon, pointing out that the instructions are very well-suited to AI and HPC applications. The fragmentation issue, however, is something no one likes.

I agree, wholeheartedly, that fragmentation has hurt AVX-512. Because the space required for its implementation is quite large, there’s basically no reason to ever add it to smaller CPU cores like Atom, which doesn’t even support AVX/AVX2 yet. As for whether it’ll find specific uses outside of AI/ML/HPC applications, we’ll have to wait for Intel to actually ship the feature on consumer CPUs.

Now Read:

Leave a Reply

Your email address will not be published. Required fields are marked *