Baron Fung

The 3^rd AI Hardware Summit took place virtually earlier this month and it was exciting to see how quickly the ecosystem has evolved and to learn of the challenges the industry has to solve in scaling artificial intelligence (AI) infrastructure. I would like to share highlights of the Summit, along with other notable observations from the industry in the area of accelerated computing.

The proliferation of AI has emerged as a disruptive force, enhancing applications such as image and speech recognition, security, real-time text translation, autonomous driving, and predictive analytics. AI is driving the need for specialized solutions at the chip and system level in the form of accelerated compute servers optimized for training and inference workloads at the data center and the edge.

The Tier 1 Cloud service providers in the US and China lead the way in the deployment of these accelerated compute servers. While the deployment of these accelerated compute servers still occupies a fraction of the Cloud service providers’ overall server footprint, this market is projected to grow at a double-digit compound annual growth rate over the next five years. Most accelerated server platforms shipped today are based on GPUs and FPGAs from Intel, Nvidia, and Xilinx, the number of new entrants, especially for the edge AI market, is growing.

However, these Cloud service providers, or enterprises deploying AI applications, simply cannot increase the number of these accelerated compute servers without addressing bottlenecks at the system and data center level. I have identified some notable technology developments that need to be addressed to advance the proliferation of AI:

- Rack Architecture: We have observed a trend of these accelerated processors shifting from a distributed model (i.e., one GPU in each server), to a centralized model consisting of an accelerated compute server with multiple GPUs or accelerated processors. These accelerated compute servers have demanding thermal dissipation requirements, oftentimes requiring unique solutions in form-factor, power distribution, and cooling. Some of these systems are liquid-cooled at the chip level, as we have seen with the Google TPU, while more innovative solutions such as liquid immersion cooling of entire systems are being explored. As these accelerated compute servers are becoming more centralized, resources are pooled and shared among many users through virtualization. NVIDIA’s recently launched A100 Ampere takes virtualizing to the next step with the ability to allow up to seven GPU instances with a single A100 GPU.
- CPU: The GPU and other accelerated processors are complementary and are not intended to replace the CPU for AI applications. The CPU can be viewed as the taskmaster of the entire system, managing a wide range of general-purpose computing tasks, with the GPU and other accelerated processors performing a narrower range of more specialized tasks. The number of CPUs also needs to be balanced with the number of GPUs in the system; adequate CPU cycles are needed to run the AI application, while sufficient GPU cores are needed to parallel process large training models. Successive CPU platform refreshes, either from Intel or AMD, are better optimized with processing inference frameworks and libraries, and support higher I/O bandwidth within and out of the server.
- Memory: My favorite session from the AI Hardware Summit was the panel discussion on memory and interconnects. During that session, experts from Google, Marvell, and Rambus shared their views on how memory performance can limit the scaling of large AI training models. Apparently, the abundance of data that needs to be processed in memory for large training models on these accelerated compute servers is demanding greater amounts of memory. More memory capacity means more modules and interfaces, which ultimately degrades chip-to-chip latencies. One proposed solution that was put forth is the use of 3D stacking to package chips closer together. High Bandwidth Memory (HBM) also helps to minimize the trade-off between memory bandwidth and capacity, but at a premium cost. Ultimately, the panel agreed that there needs to be an optimal balance between memory bandwidth and capacity within the system, while adequately addressing thermal dissipation challenges.
- Network Connectivity: As these accelerated compute nodes become more centralized, a high-speed fabric is needed to ensure the flow of huge amounts of unstructured AI data over the network to accelerated compute servers for in-memory processing and training. These connections can be server-to-server as part of a large cluster, using NVIDIA’s NVlink and InfiniBand (which NVIDIA acquired with Mellanox). Ethernet, now available up to 400 Gbps, is an ideal choice for connecting storage and compute nodes within the network fabric. I believe that these accelerated compute servers will be the most bandwidth-hungry nodes within the data center, and will drive the implementation of next-generation Ethernet. Innovations, such as Smart NICs, could also be used to minimize packet loss, optimize network traffic for AI workloads, and enable the scaling of storage devices within the network using NVMe over Fabrics

I anticipate that specialized solutions in the form of accelerated computing servers will scale with the increasing demands of AI, and will comprise a growing portion of the data center capital expenditures. Data centers could benefit from the deployment of accelerated computing, and would be able to process AI workloads more efficiently with fewer, but more powerful and denser accelerated servers. For more insights and information on technology drivers shaping the server and data center infrastructure market, take a look at our Data Center Capex report.

Unlike many enterprises that have been migrating their IT infrastructure to the Public Cloud in recent years, the financial sector continues to be an important contributor to purchasing and deploying IT, which includes data center infrastructure such as servers, storage, and networking equipment. Last year, we estimated that roughly 24% of the on-premise enterprise server installed base among Fortune 2000 firms to be attributed to the financial sector, with high-frequency trading among a major use-case. We expect the financial sector to gain share in the on-premise enterprise server installed base, as traditional enterprises in the areas of oil & gas, offline retail, travel, and hospitality, and healthcare are projected to reduce spending on IT this year due to weak economic outlook. Growth in those traditional enterprise industries tends to mirror that of the broader economy, and could face a sluggish road to recovery ahead.

On Premise Server Unit Installed Base by Fortune 2000 Industry Segment (2019)

Other observations related to data center capex trends in the financial sector:

Data Center IT is moving to the Public Cloud (i.e. AWS, Microsoft Azure, and Google Cloud), and the pandemic has accelerated Cloud adoption as companies prefer an operating expenditure-based IT consumption model to conserve capital in uncertain economic times, and the ease in which cloud applications could be provisioned to a remote workforce. However, we expect the financial sector will be one of the last enterprise sectors to move the workload to the Cloud. Applications such as high-frequency trading will continue to be owned and operated by these enterprises due to proximity requirements and demands an ultra-low latency network. Regulations and data sovereignty requirements also dictate that certain data and workloads need to remain in on-premise data centers (and not move to the Cloud). We estimate that over 80% of the workloads for the financial sector have remained on-premise, versus, 20% of the workload outsourced to the Cloud.

Some of these major financial institutions such as JP Morgan and Goldman Sachs own and operate their own data centers, which are also complemented by colocation data centers that are operated by Equinix for instance. In the first half of 2020, we estimate a 2% average Y/Y growth in revenue generated by the financial sector among the colocation providers. That is versus an estimate of 5% average Y/Y growth in 2019. While the 1H20 growth rate is conservative, the financial sector is expected to outperform that of other enterprise sectors when it comes to colocation infrastructure spending this year.

The large financial institutions are innovators of server and data center architecture. While some have offloaded some requirements to the Public Cloud, they continue to invest in their own data centers. Some of the large firms operate hundreds of thousands of servers and would be able to benefit from economies of scale a Cloud service provider would. The more servers that these firms have, the more transactions and customers they can support. While this segment still purchases IT equipment from branded server and storage system vendors such as HPE, Dell, and IBM, some of the more sophisticated firms have customized servers in their data centers in an effort to streamline their architecture. They have also deployed accelerators such as Smart NICs (network interface cards) to reduce network latency necessary for high-frequency trading and to ensure secure connections. Specialized servers, equipped with accelerators such as GPUs for instance, are also deployed for AI-centric applications pertaining to predictive analysis and risk management computations.

In order to get additional insights and outlook for spending on servers and other data center center equipment by Cloud and Enterprise segments, please check out our Data Center Capex report.

At this week’s NVIDIA GPU Technology Conference, the company announced its groundbreaking Bluefield -2X DPU (Data Processing Unit), which combines key elements of the existing Bluefield DPU and NVIDIA’s Ampere GPU. The new DPU enables the use of AI to perform real-time security analytics, and identify abnormal traffic and malicious network activities. The building-blocks NVIDIA needs to enable high-performance and AI workloads in the data center and edge are coming together—this includes the GPU for application-specific workloads, the DPU to facilitate data I/O processing, and finally, the CPU for compute, as NVIDIA seeks to complete its acquisition of ARM.

The DPU, a terminology coined by NVIDIA, falls within Dell’Oro Group’s definition of the Smart NIC market that we track in the Ethernet Controller and Adapter research. Smart NICs are fully programmable network interface cards designed to accelerate key data center security, networking, and storage task functions, offloading valuable CPU cores to run business-oriented applications.

According to Dell’Oro Group’s latest forecast, the Smart NIC market will grow at a 26% compound annual growth rate, from $270M in 2020 to $848M by 2024, vastly outpacing the overall Ethernet controller and adapter market. We believe that this new class of Smart NIC with integrated AI that NVIDIA has introduced, to be potentially disruptive, and could expand the range of applications that have been available to Smart NICs.

As Cloud and Enterprise data centers continue to scale, we believe that Smart NICs could be a solution in achieving high network throughput, low latency, and minimal packet loss demanded by emerging applications such as high-performance computing and AI. However, there are some notable constraints vendors would need to address before we see stronger adoption of Smart NICs:

The scalability of Smart NICs depends on key considerations such as price and power consumption. While we are still awaiting details from NVIDIA, I believe that the inclusion of both the ARM processor and GPU in the Bluefield-2X DPU could result in higher unit cost and power consumption compared to that of alternatives. NVIDIA announced that future generations of the Bluefield, such as the Bluefield-4X DPU, will have an integrated ARM and GPU cores, which could result in total cost of ownership improvements.
Most Smart NICs shipped deployed are based on the ARM architecture, including that of the NVIDIA Bluefield DPU family. However, other Smart NIC vendors such as Intel, Napatech, and Xilinx have released FPGA-based solutions that have demonstrated benefits in adapting to a wide range of applications in conjunction with AI inferencing applications. I predict that both ARM-based and FPGA-based solutions will coexist and be optimized for different use-cases.
Extensive engineering resources and lead-time are required to bring Smart NICs to market. While the major Cloud service providers have the engineering resources devoted to Smart NIC application development, the smaller Cloud service providers and enterprises do not. It is crucial for vendors to provide value-added services and application development toolkits to customers for software implementation. NVIDIA announced the availability of the NVIDIA DOCA SDK, which enables developers to rapidly create applications and services on top of the Bluefield DPU.

Ultimately, customers will need to find a balance between the benefits Smart NICs could bring, along with the aforementioned considerations. However, I am excited that the vendor community is bringing new innovations to the market that could give customers more choices to implement the solutions that fit their requirements. As servers continue to become more commoditized overtime, Smart NICs could shift more control of the data center architecture back to the customers.

We just finished our first Data Center Capex July 2020 5-year Forecast Report. Below are some highlights from the report. If you need to access the full report, please contact us at dgsales@delloro.com.

Data center capex, which includes capex for servers and other data center infrastructure equipment, is forecasted to grow at a 6% CAGR to just over $200 B over the next five years. Growth is forecasted to be mixed depending on the customer segment. The Cloud, which already accounts for more than 60% of the worldwide data center capex, will continue to gain momentum over Enterprise/On-premise data center deployments. Edge data centers deployed over Telco networks could emerge in the longer-term horizon.

Capex on servers, which generally accounts for nearly half of the data center capex, may be influenced by the following factors:

- Change in server unit demand from Cloud capacity and digestion cycles.
- Market volatility of commodity pricing of components such as memory.
- Server refresh cycles, which could prompt the replacement of aged servers and drive new deployments, could impact server architecture and pricing.

Servers also drive the demand for auxiliary data center infrastructure equipment such as networking switches, storage systems, and facilities.

The COVID-19 pandemic is expected to profoundly disrupt global demand for data center infrastructure equipment in 2020. Impacted vertical industries, especially brick-and-mortar retail, travel, hospitality, and small and medium enterprises, have seen a pull-back in IT spending as they wait for the business climate to stabilize. As enterprises seek to conserve capital, Public Cloud, which offers a flexible and consumption-based infrastructure, could help meet the growing demands of remote work and distance learning. The COVID-19 pandemic and the ensuing recession may have the long-lasting effect of accelerating the permanent migration of certain industries and workloads to the Cloud.

Market and Technology Trends to Watch Out For

The Top 4 U.S. Cloud service providers—Amazon, Facebook, Google, and Microsoft—are positioned to continue their momentum of expansion over the next five years. Servers will continue to be consolidated in fewer mega Cloud data centers that could potentially provide greater capacity than the same number of servers spread out across thousands of Enterprise data centers.
The Top 4 U.S. Cloud service providers have been prolonging the useful life of servers in an effort to lower server depreciation expense while maintaining high efficiencies and reliability of their server fleet.
The Intel server processor refresh cycles have historically influenced IT spending. While the major Cloud service providers typically ramp server capacity outside of the processor refresh cycle, the upcoming Intel 10 nm Whitley server platform refresh due later this year could generate an uplift on server spending. Viable alternatives to Intel processors, AMD EPYC and ARM, for server and storage system applications are starting to materialize in certain markets.
Various open-source organizations have come together to share and standardize best practices in the design of efficient, scalable, and sustainable data center infrastructure. The Open Compute Project (OCP), in particular, has introduced various technological innovations in the areas of server and server connectivity, rack architecture, and networking switches, which could shape the future development of data center infrastructure.

To learn more about the COVID-19 impact on Data Center Infrastructure and Server spending, please click here to watch my latest video.

About the Data Center Capex 5 Year Forecast Report

Dell’Oro Group’s Data Center Capex 5-Year Forecast Report details the data center infrastructure capital expenditures of each of the ten largest Cloud service providers, as well as the Rest-of-Cloud, Telco, and Enterprise customer segments. Allocation of the data center infrastructure capex for servers, storage systems, and other auxiliary data center equipment is provided. The report also discusses the market and technology trends that can shape the forecast. Highlights from the Server and Storage System Report (now discontinued) was transitioned to this report. Click here to learn more about the report or contact us (dgsales@delloro.com) for the full report.

Related video about Data Center Capex:

Analyst Talk - COVID-10 Impact on Data Center Infrastructure Capex — | 11 mins watch|

Since 2017, no fewer than ten vendors have launched Smart Network Interface Cards or Smart NICs. The Smart NIC market is projected to become a $600 M market by 2024, or 23% of the total Ethernet adapter market. Vendors have developed or are developing innovative solutions to gain entry in the expanding Cloud data center market, and the emerging telco edge market.

Smart NICs, with an on-board processor, can provide a wide range of offload benefits in the following scenarios:

For a Public Cloud service provider operating a large-scale data center, Smart NICs could free up valuable CPU cores to run business applications for the end-user, potentially enabling higher server utilization.
Smart NICs are offered to meet a wide range of offloads. Some of these include transport and storage protocol offloads such as RoCE, TCP, NVMe-over-Fabrics.
Certain classes of Smart NICs are programmable and can be tailored for a wide range of applications and retooled to meet new requirements.

Smart NICs, however, are not without drawbacks and the following areas would need to be addressed before we see broader adoption:

Smart NICs are priced at a significant premium over that of a standard NIC. This price premium can be 5-10X higher given the same port speed and need to come down especially for volume production.
Smart NICs can draw anywhere from 20W up to 80W of power, which is not non-consequential on a per unit server basis.
Given the programmability and complexity of Smart NICs, they can consume significant engineering resources to develop and debug, resulting in a lengthy and costly implementation.

Given the above considerations the major Cloud service providers and IC vendors have developed Smart NIC adapters based on different IC solutions: ARM-based SoC, field programmable gate arrays (FPGAs), and custom ASICs. Each of these solutions offers varying degrees of offloads and programmability. In general, ARM-based SoCs and FPGAs are fabricated with programmable cores and can be adapted to a wide range of applications. However, the drawback of the programmability is the greater extent of engineering resources and lead time needed to bring the products to market. Custom ASICs tend to be hard-coded with customization generally limited to vendor-provided application tool sets. As products start to ramp and the market reaches consensus on product definition, we expect the following three categories of Ethernet adapters to emerge: 1) traditional or standard NICs, 2) non-programmable Smart NICs that are ASIC-based, and 3) programmable Smart NICs, that are ARM-based or FPGA based.

Network IC vendors that either currently have Smart NIC adapters or are planning to launch one have a wide range of solutions and target market segments. Notable vendors include Broadcom, Ethernity Networks, Intel, Marvell, Mellanox, Napatech, Netronome, Pensando, and Xilinx. The major Cloud service providers have developed their own solutions, further fragmenting market.

Our long-term outlook of the Smart NIC market by market segments is as follows:

Top 4 U.S. Cloud: In 2019, the Top 4 U.S. Cloud service providers, with Amazon in particular, drove more than 90% of the Smart NIC market by port shipments. This may be a challenging market for Ethernet adapter vendors to enter given that some of these Cloud service providers are likely to continue to develop their own solutions.
Other Cloud: These segments include Chinese Cloud service providers, such as Alibaba and Tencent, and Tier 2 Cloud service providers such as Apple and Oracle. As these companies scale data center capacity higher, Smart NICs could be a solution to enable higher utilization. These companies may not necessarily have the resources to develop their own Smart NICs, and are likely to seek 3^rd party solutions from adapter vendors.
Telco Operators: This segment is increasingly looking to shift core network services to run on x86 servers, thus, Smart NICs could be used to offload network function virtualization. Certain adapter vendors are also targeting the emerging edge computing market as well, as Smart NICs is complementary to multi-access edge computing (MEC) nodes to satisfy low-latency requirements.
Enterprise: Generally, enterprise data centers tend to operate at a smaller scale, and would have less incentives to maximize utilization. Many enterprises would also rely on vendors to provide a solution with software implementation in place. Certain workloads, mainly enterprise storage arrays, are being developed with Smart NICs to facilitate NVMe-over-Fabrics connectivity.

As the Smart NIC markets continue to evolve, we believe the success of each vendor depends on whether its solution is a worthwhile upgrade over standard NICs from a performance, price, power consumption, and implementation perspective in their respective target markets. In 2020, activity level is high, as vendors work with end-users to complete product evaluation cycles. We expect to see volume ramp from a greater mix of vendors next year, as some of the short-comings mentioned above are addressed, realizing the benefits of Smart NICs.

Contact Us

Related video about Data Center Capex: