AI is everywhere these days. SoC vendors are falling over themselves to bake these capabilities into their products. From Intel and Nvidia at the top of the market to Qualcomm, Google, and Tesla, everyone is talking about building new chips to handle various workloads related to artificial intelligence and machine learning.
While these companies have shown their own products racking up impressive scores in various tests, there’s a need for tools that independent third parties can use to compare and evaluate chips. MLPerf is a joint effort between dozens of developers, academics, and interested individuals from a number of companies and organizations involved in various aspects of AI/DL. The goal of the project is to create a test framework that can be used to evaluate a huge range of potential products and use-cases.
In this case, we’re discussing inferencing, not training. Inferencing is the application of a model to a task. Training refers to creating the model in the first place. Models are typically trained on high-end PCs, servers, or the equivalent of an HPC cluster, while inferencing workloads can run on anything from a cell phone to a high-end server.
According to David Kanter, who co-chairs development of the inference benchmark, MLPerf’s design team has settled on four key scenarios that they evaluate. Edge devices, like cell phones, focus on reading data from one stream at a time. These devices emphasize low latency and are benchmarked against those criteria. The next class of products read multiple streams of data at once, like a Tesla with its eight separate cameras. In these cases, the ability of the system to handle all of the data streams in question within acceptable latencies becomes important.
On the backend, there’s the question of whether or not the server can maintain an adequate number of queries per second within a defined response envelope, while the “Offline” test is intended for tasks like photo sorting, that don’t have a time dimension attached to them. These four tasks conceptually bracket the areas where inferencing is expected to be used.
MLPerf will allow users to submit benchmark results in both a closed and an open division. The closed division will have stricter rules around its submissions and tests, while the open division will allow for more experimentation and customization. One small feature I particularly like is how MLPerf results are categorized. There are three categories: Available, Preview, and RDO (Research, Development, Other). Available means you can buy hardware today, Preview means it’ll be available within 180 days or before the next release of MLPerf, and RDO is for prototypes and systems not intended for general production. This type of distinction will allow users to understand which hardware platforms should be compared.
The performance range of various solutions in the Closed performance division. The fact that these results are up to 10,000x different was quite a challenge. Kanter told us this made the inferencing benchmark particularly tricky — MLPerf needed to devise tests that could scale over four orders of magnitude. My colleague David Cardinal has referred to some of the specific Nvidia comparisons, but I wanted to highlight the larger release of the information set.
To give you an idea how difficult this is: When Maxon released a new version of Cinebench, they did so partly because it took so little time to render the old Cinebench scene, it was no longer a meaningful CPU test. The new CB20 test takes more time to render than CB15 did. But this makes it quite annoying to run the single-core version of the test, which now takes significantly longer. That longer delay is irritating when running a test from, say, 1-16 cores. Trying to build a benchmark that can complete in an acceptable amount of time on chips with a vastly larger gap is a difficult prospect.
I’m excited to see products like MLPerf under development and I’m looking forward to working with the project when it’s ready for public release. MLPerf has a large development team from around the globe and a blend of academic and commercial support. It’s not an effort controlled by any single company and it wasn’t designed to favor any single company’s products. As AI and ML-enabled hardware filters to market, we’re going to need tests that can evaluate its capabilities. So far, MLPerf looks like one of the best.
- Nvidia Crushes Self to Take AI Benchmark Crown
- Nvidia’s new Jetson Xavier NX Adds Horsepower to AI at the Edge
- DeepMind’s StarCraft II AI Can Now Defeat 99.8 Percent of Human Players