A reader recently contacted us and asked a question worth answering in an article.
How does Windows (and perhaps all OS’s) take advantage of multiple cores? Alternatively, if this function is built into the hardware, how do the cores know which apps to execute, and when? I assume that more cores are better, but how does this work, exactly? And are there ways that one could configure apps/Windows to better take advantage of more cores?
When you turn on a PC, before the OS has even loaded, your CPU and motherboard handshake, for lack of a better term. Your CPU passes certain information about its own operating characteristics over to the motherboard UEFI, which then uses this information to initialize the motherboard and boot the system.
In computer science, a thread is defined as the smallest unit of execution managed by the OS scheduler. If you wanted to make an analogy, you could compare a thread to a one step on an assembly line. One step above the thread, we have the process. Processes are computer programs that are executed in one or more threads. In this factory analogy, the process is the entire procedure for manufacturing the product, while the thread is each individual task.
Problem: CPUs can only execute one thread at a time. Each process requires at last one thread. How do we improve computer performance?
Solution: Clock CPUs faster.
For decades, Dennard Scaling was the gift that kept on giving. Moore’s Law declared we’d be able to pack transistors into a smaller and smaller space, but Dennard Scaling is what allowed them to hit higher and higher clock speeds on lower voltages.
If the computer is running quickly enough, its inability to handle more than one thread at a time becomes much less of a problem. While there are a distinct set of problems that cannot be calculated in less time than the expected lifetime of the universe on a classical computer, there are many, many, many problems that can be calculated just fine that way.
As computers got faster, developers created more sophisticated software. The simplest form of multithreading is coarse-grained multithreading, in which the operating system switches to a different thread rather than sitting around waiting for the results of a calculation. This became important in the 1980s, when CPU and RAM clocks began to separate, with memory speed and bandwidth both increasing much more slowly than CPU clock speed. The advent of caches meant that CPUs could keep small collections of instructions nearby for immediate number crunching, while multithreading ensured the CPU always had something to do.
Important point: Everything we’ve discussed so far applies to single-core CPUs. Today, the terms multithreading and multiprocessing are often colloquially used to mean the same thing, but that wasn’t always the case. Symmetric Multiprocessing and Symmetric Multithreading are two different things. To put it simply:
SMT = The CPU can execute more than one thread simultaneously, by scheduling a second thread that can use the execution units not currently in use by the first thread. Intel calls this Hyper-Threading Technology, AMD just calls it SMT. Currently, both AMD and Intel use SMT to boost CPU performance. Both companies have historically deployed it strategically, offering it on some products but not on others. These days, the majority of CPUs from both companies offer SMT. In consumer systems, this means you have support for CPU core count * 2 threads, or 8C/16T, for example.
SMP = Symmetric multiprocessing. The CPU contains more than one CPU core (or is using a multi-socket motherboard). Each CPU core only executes one thread. The number of threads you can execute per clock cycle is limited to the number of cores you have. Written as 6C/6T.
Multithreading in a mainstream single-core context used to mean “How fast can your CPU switch between threads,” not “Can your CPU execute more than one thread at the same time?”
“Could your OS please run more than one application at a time without crashing?” was also a frequent request.
Workload Optimization and the OS
Modern CPUs, including the x86 chips built 20 years ago, implement what’s known as Out of Order Execution, or OoOE. All modern high-performance CPU cores, including the “big” smartphone cores in big.Little, are OoOE designs. These CPUs re-order the instructions they receive in realtime, for optimal execution.
The CPU executes the code the OS dispatches to it, but the OS doesn’t have anything to do with the actual execution of the instruction stream. This is handled internally by the CPU. Modern x86 CPUs both re-order the instructions they receive and convert those x86 instructions into smaller, RISC-like micro-ops. The invention of OoOE helped engineers guarantee certain performance levels without relying entirely on developers to write perfect code. Allowing the CPU to reorder its own instructions also helps multithreaded performance, even in a single-core context. Remember, the CPU is constantly switching between tasks, even when we aren’t aware of it.
The CPU, however, doesn’t do any of its own scheduling. That’s entirely up to the OS. The advent of multithreaded CPUs doesn’t change this. When the first consumer dual-processor board came out (the ABIT BP6), would-be multicore enthusiasts had to run either Windows NT or Windows 2000. The Win9X family did not support multicore processing.
Supporting execution across multiple CPU cores requires the OS to perform all of the same memory management and resource allocation tasks it uses to keep different applications from crashing the OS, with additional guard banding to keep the CPUs from blundering into each other.
A modern multi-core CPU does not have a “master scheduler unit” that assigns work to every core or otherwise distributes workloads. That’s the role of the operating system.
Can You Manually Configure Windows to Make Better Use of Cores?
As a general rule, no. There have been a handful of specific cases in which Windows needed to be updated in order to take advantage of the capabilities built into a new CPU, but this has always been something Microsoft had to perform on its own.
The exceptions to this policy are few and far between, but there are a few:
New CPUs sometimes require OS updates in order for the OS to take full advantage of the hardware’s capabilities. In this case, there’s not really a manual option, unless you mean manually installing the update.
The AMD 2990WX is something of an exception to this policy. The CPU performs quite poorly under Windows because Microsoft didn’t contemplate the existence of a CPU with more than one NUMA node, and it doesn’t utilize the 2990WX’s resources very well. In some cases, there are demonstrated ways to improve the 2990WX’s performance through manual thread assignment, though I’d frankly recommend switching to Linux if you own one, just for general peace of mind on the issue.
The 3990X is an even more theoretical outlier. Because Windows 10 limits processor groups to 64 threads, you can’t devote more than 50 percent of the 3990X’s execution resources to a single workload unless the application implements a custom scheduler. This is why the 3990X isn’t really recommended for most applications — it works best with renderers and other professional apps that have taken this step.
Outside of the highest core-count systems, where some manual tuning could theoretically improve performance because Microsoft hasn’t really optimized for those use-cases yet, no, there’s nothing you can do to really optimize how Windows divides up workloads. To be honest, you really don’t want there to be. End users shouldn’t need to be concerned with manually assigning threads for optimum performance, because the optimum configuration will change depending on which tasks the CPUs are processing in any given moment. The long-term trend in CPU and OS design is towards closer cooperation between the CPU and operating system in order to better facilitate power management and turbo modes.
Editor’s Note: Thanks to Bruce Borkosky for the article suggestion.
- Asrock Announces $1,100 Water-Cooled Z490 Motherboard
- No, AMD Isn’t Building a 48-Core Ryzen Threadripper 3980X
- Overclocking Results Show We’re Hitting the Fundamental Limits of Silicon