Baron Fung

As pandemic-related headwinds started to ease, we were optimistic for a return to higher growth on data center infrastructure spending in 2021. The Cloud was entering an expansion cycle and demand signals in the Enterprise were gaining momentum. While data center capex grew 9% in 2021, which was in line with our prior projections, growth was mainly driven by higher cost of data center equipment, rather than by unit volume. Server unit growth, which was flat for the year, was constrained by component shortages and long lead times. Deliveries for networking and physical infrastructure equipment are also facing a mounting backlog. Furthermore, higher supply chain costs, from increased commodity, expedite, and logistics costs led to higher system prices. Our 2022 outlook is more optimistic, with a data center capex projection of 17%, accompanied by double-digit growth in server unit shipments. We identify the following key trends that could shape the dynamics of data center capex in 2022.

Hyperscale Cloud on Expansion Cycle

The Top 4 Cloud service providers—Amazon, Google, Meta (formerly Facebook), and Microsoft—are expected to increase data center capex by over 30% in 2022. Investments will go towards the replacement of aged servers, increased deployment of accelerated computing, as well as servers for new data centers in more than 30 regions that are scheduled to launch in 2022. Furthermore, infrastructure planned last year that was not deployed due to extended equipment lead-times have resulted in additional tailwind growth as deliveries are fulfilled in 2022.

Supply Chain Stabilizing

Generally, the major Cloud service providers have weathered through this tough supply chain climate better than the rest of the market given their strong visibility in their demand and can proactively increase inventory levels of crucial components and build redundancies in their supply chains. On the other hand, data center capex growth in Tier 2 and 3 Cloud service providers and Enterprise have been supply-constrained. There is some consensus that the level of supply chain disruptions is starting to stabilize and possibly ease by the second half of 2022. Lead-time for servers could improve sooner than other data center equipment such as networking, given their relatively larger scale and lower product mix.

Metaverse Could Drive Opportunities In AI Infrastructure

Some of the major Cloud service providers, such as Apple, Meta, Microsoft, and Tencent, have announced plans to enrich their metaverse offerings for both enterprise and consumer applications. This would require increased investments in new infrastructures, such as servers with accelerated co-processors, low-latency networking, and enhanced thermal management solutions. Chip manufacturers and major Cloud service providers will be developing specialized processors for AI applications. The ecosystem would need to evolve to enable the community of AI application developers to broaden the reach of AI into enterprises. AI infrastructure is costly and will be a major capex driver. For instance, we estimate that the cost of AI infrastructure is largely responsible for Meta’s plans to increase capex by approximately 60% this year.

New Server Architectures On The Horizon

Intel is releasing a new processor platform, Sapphire Rapids, later this year. Sapphire Rapids will feature the latest in server interconnect technologies, such as PCIe 5, DDR5, and more importantly, CXL. These new high-speed interfaces could alleviate system bandwidth constraints, enabling more processor cores and memory to be packaged into a single server. CXL would enable memory sharing between the CPU and other co-processors within the server and rack, enabling data-intensive applications such as AI to access memory more efficiently and at lower latencies. AMD and ARM will also incorporate these new interfaces within their processor platforms as well. We expect these enhancements could kick off a multi-year journey of new server architecture developments.

Let’s Not Forget About Server Connectivity

Last but not least on this list, server connectivity will also need to evolve continuously and not clog the connection between server and the rest of the network. The hyperscale Cloud service providers have been deploying in production the latest generation network interface cards (NICs) based on 56 Gbps PAM-4 SerDes of up to 100 Gbps for general purpose workloads, and up to 200 Gbps for advanced workloads such as AI. The Enterprise is fully embracing 25 Gbps NICs, and we anticipate the number of 25 Gbps ports to overtake that of 10 Gbps later this year. Smart NICs, or data processing units (DPUs) are being deployed by the major Cloud service providers across their infrastructure to improve server utilization, and to accelerate latency-sensitive applications such as AI. Outside of the hyperscale, Smart NIC adoption is still in its nascent stage. However, given that most of the network adapter vendors have a Smart NIC solution available in the market, enterprises potentially have a wide range of choices to fit their applications and budget.

The Nvidia GTC Fall 2021 virtual event I attended last week highlighted some exciting developments in the field of AI and machine learning, most notably, in new applications for the metaverse. A metaverse is a digital universe created by the convergence of the real world and a virtual world abstracted from virtual reality, augmented reality, and other 3D visual projections.

Several leading Cloud service providers recently laid out their visions of the metaverse. Facebook, which changed its name to Meta to align its focus on the metaverse, envisions people working, traveling, and socializing in virtual worlds. Microsoft already offers holograms and mixed-reality on its Microsoft Mesh platform and announced plans to bring holograms and virtual avatars to Microsoft Teams next year. Tencent recently shared its metaverse plan to leverage its strengths in multiplayer gaming on its social media platform.

In order to recreate an accurate virtual representation of the real world, massive amounts of AI training data would need to be acquired, captured, and processed. This would stretch the limits of the compute infrastructure. During GTC, Nvidia highlighted various solutions in three areas that could help pave the way for the proliferation of the metaverse in the near future:

Compute Architecture: During the Q&A session, I asked Nvidia CEO Jensen Huang how the data center would need to evolve to meet the needs of the metaverse. Jensen emphasized that computer vision and graphics and physics simulation would need to converge in a coherent architecture and be scaled out to millions of people. In a sense, this would be a new type of computer, a fusion of various disciplines with the data center as the new unit of computing. In my view, such an architecture would be composed of a large cluster of accelerated servers with multiple GPUs within a network of tightly coupled, general-purpose servers. The servers would run applications and store massive amounts of data. Memory coherent interfaces, such as CXL, NVLink, or their future iterations, offered on x86- and ARM-based platforms, would enable memory sharing across racks and pods. These interfaces would also improve connectivity between CPUs and GPUs, reducing system bottlenecks.
Network Architecture: As the unit of computing continues to scale, new network architectures will need to be developed. During GTC, Nvidia introduced Quantum-2, a networking solution composed of a 400 Gbps InfiniBand and a Bluefield-3 DPU (data processing unit) Smart NIC. This combination will enable high-throughput, low-latency networking in a dense and tightly coupled cluster scaling up to one million nodes needed for metaverse applications. 400 Gbps is the fastest server access speed available today. It could double to 800 Gbps in several years. The ARM processor in the Bluefield DPU could directly access the network interface, bypassing the CPU and benefiting time-sensitive AI workloads. Furthermore, we can expect that these scaled-out computing clusters would be shared across multiple users. With a Smart NIC, such as the Bluefield DPU, layer isolation could be provided among users, thereby enhancing security.
Omniverse: The compute and network infrastructure could only be effectively utilized with a solid software development platform and ecosystem in place. Nvidia’s Omniverse provides the platform to enable developers and enterprises to create and connect virtual worlds for various use cases. During GTC, Jensen described how the Omniverse could be applied to build a digital twin in an automotive factory with the manufacturing process simulated and optimized by AI. This twin would later serve as the blueprint for the physical construct. The range of potential applications ranged from education to healthcare, retail, and beyond.

We are still in the initial developmental stages of the metaverse; the technology build-blocks and ecosystem are still coming together. Furthermore, as we have seen recently with certain social media platforms and the gaming industry, new regulations could emerge to reset the boundaries between the real and virtual worlds. Nevertheless, I believe that the metaverse has the potential to unlock new use cases for both consumers and enterprises and drive investments in data center infrastructure in the Cloud and Enterprise. To access the full Data Center Capex report, please contact us at dgsales@delloro.com.

Dell’Oro Group projects that the spend on accelerated compute servers targeted to artificial intelligence (AI) workloads will reach double-digit growth over the next five years, outpacing other data center infrastructure. An accelerated compute server equipped with accelerators such as a GPU, FPGA, or custom ASIC can generally handle AI workloads with much greater efficiency than general purpose (without accelerators) servers. Numerically speaking, deployment of these servers still represents only a fraction of Cloud service providers’ overall server footprint. Yet, at ten or more times the cost of a general-purpose server, accelerated compute servers are becoming a substantial portion of the data center capex.

Tier 1 Cloud service providers are increasing their spending on new infrastructure tailored for AI workloads. In Facebook’s 3Q21 earnings calls, the company announced its plans to increase capex by more than 50% in 2022. Investments will be driven by AI and machine learning to improve ranking and recommendations across Facebook’s platform. In the longer term, as the company shifts its business model to the metaverse, capex investments will be driven by video and compute-intensive applications such as AR and VR. At the same time, Tier 1 Cloud service providers, such as Amazon, Google, and Microsoft, also aim to increase spending on AI-focused infrastructure to enable their enterprise customers to deploy applications with enhanced intelligence and automation.

It has been a year since my last blog on AI data center infrastructure. Since that time, new architectures and solutions have emerged that could pave the way for the further proliferation of AI in the data center. Following are three innovations I’ll be watching closely:

New CPU Architectures

Intel is scheduled to launch its next-generation Sapphire Rapids processor next year. With its AMX (Advanced matrix Extension) instruction set, Sapphire Rapids is optimized for AI and ML workloads. CXL, which will be offered with Sapphire Rapids for the first time, will establish a memory-coherent, high-speed link PCIe Gen 5 interface between the host CPU and accelerators. This, in turn, will reduce system bottlenecks by enabling lower latencies and more efficient sharing of resources across devices. AMD will likely follow on the heels of Intel and offer CXL on EPYC Genoa. For ARM, competing coherent interfaces will also be offered, such as CCIX with Ampere’s Altra processor and NVlink on Nvidia’s upcoming Grace processor.

Faster Networks and Server Connectivity

AI applications are bandwidth hungry. For this reason, the fastest networks available would need to be deployed to connect host servers to accelerated servers to facilitate the movement of large volumes of unstructured data and training models (a) between the host CPU and accelerators, and (b) among accelerators in a high-performance computing cluster. Some Tier 1 Cloud service providers are deploying 400 Gbps Ethernet networks and beyond. The network interface card (NIC) must also evolve to ensure that server connectivity is not inhibited as data sets become larger. 100 Gbps NICs have been the standard server access speed for most accelerated compute servers. Most recently, however, 200 Gbps NICs are increasingly used with these high-end workloads, especially by Tier 1 Cloud service providers. Some vendors have added an additional layer of performance by integrating accelerated compute servers with Smart NICs or Data Processing Units (DPUs). For instance, Nvidia’s DGX system could be configured with two Bluefield-2 DPUs to facilitate packet processing of large datasets and provide multi-tenant isolation.

Rack Infrastructure

Accelerated compute servers, generally equipped with four or more GPUs, tend to be power hungry. For example, an Nvidia DGX system with 8 A100 GPUs has a maximum system power usage rated at 6.5kW. Extra consideration would be needed to ensure efficient thermal management. Today, air-based, thermal management infrastructure is predominantly used. However, as rack power densities are on the rise to support accelerated computing hardware, air-cooling efficiencies and limits are being reached. Novel liquid-based, thermal management solutions, including immersion cooling, are under development to further enhance the thermal efficiencies of accelerated compute servers.

These technology trends will continue to evolve and drive the commercialization of specialized hardware for AI applications. Please stay tuned for more updates from the upcoming Data Center Capex reports.

Dell’Oro published an update to the Ethernet Controller & Adapter 5-Year Forecast report, July 2021. Revenue for the worldwide Ethernet controller and adapter market is projected to increase at a 4% compound annual growth rate (CAGR) from 2020 to 2025, reaching nearly $3.2 billion. The increase is partly driven by the migration to server access speed of 100 Gbps and higher.

The ramp of 25 Gbps port shipments has been strong since the availability of 28 Gbps SerDes in 2016. 25 Gbps has already displaced 10 Gbps to become the dominant speed in revenue, as 25 Gbps gains broad adoption across Cloud service providers (SPs) and high-end enterprises. However, we project that 100 and 200 Gbps speed ports to overtake that of 25 Gbps in revenue as early as 2023.

We identify the market and technology drivers below that are likely to drive the adoption of next-generation server connectivity based on 100 Gbps and beyond:

50 Gbps ports, based on two 28 Gbps SerDes lanes, have been deployed in mainstream among some of the major Cloud SPs. However, with the exponential growth of network traffic and proliferation of cloud computing, the Top 4 US Cloud SPs are demanding even higher server access speeds than the rest of the market. The availability of 56 Gbps SerDes since late 2018 has prompted some of the Top 4 US Cloud SPs to upgrade their networks to 400 Gbps, with upgrades in server network connectivity to 100 Gbps for general-purpose computing in progress.
Higher server access speeds of up to 200 Gbps, based on two lanes of 112 Gbps SerDes, could begin to ramp for general-purpose computing for the Top 4 US Cloud SPs following network upgrades 800 Gbps as early as 2022.
The increase in demand for bandwidth-hungry AI applications will continue to push the boundaries of server connectivity. Today, 100 Gbps is commonly used to interconnect accelerated servers, while general-purpose servers are connected at 25 or 50 Gbps. As 100 Gbps become the standard connection for general-purpose in several years for the major Cloud SPs, accelerated servers may be connected at twice the data rate at 200 Gbps.

To learn more about the Ethernet Controller and Adapter market, or if you need to access the full report, please contact us at dgsales@delloro.com.

About the Report

The Dell’Oro Group Ethernet Controller and Adapter 5-Year Forecast Report provides a complete, in-depth analysis of the market with tables covering manufacturers’ revenue; average selling prices; and unit and port shipments by speed (1 Gbps, 10 Gbps, 25 Gbps, 40 Gbps, 50 Gbps, and 100 Gbps) for Ethernet and Fibre Channel Over Ethernet (FCoE) controllers and adapters. The report also covers Smart NIC and InfiniBand controllers and adapters. To purchase this report, please contact us at dgsales@delloro.com.

Contact Us