[wp_tech_share]

Last month was incredibly exciting, to say the least! We had the opportunity to attend two of the most impactful and prominent events in the industry: NVDA’s GTC followed by OFC.

As previously discussed in my pre-OFC show blog, we have been anticipating that AI networks will be in the spotlight at OFC 2024 and will accelerate the development of innovative optical connectivity solutions. These solutions are tailored to address the explosive growth in bandwidth within AI clusters while tackling cost and power consumption challenges. GTC 2024 has further intensified this focus. During GTC 2024, Nvidia announced the latest Blackwell B200 Tensor Core GPU, designed to empower trillion-parameter AI Large Language Models. The Blackwell B200 demands advanced 800 Gbps networking, aligning perfectly with the predictions outlined in our AI Networks for AI Workloads report. With an anticipated 10X traffic growth in AI workloads every two years, these AI workloads are expected to outpace traditional front-end networks by at least two speed upgrade cycles.

While a multitude of topics and innovative solutions were discussed at OFC regarding inter-data center applications as well as compute interconnect for scaling up the number of accelerators within the same domain, this blog will primarily focus on intra-data center applications. Specifically, it will focus on scaling out the network needed to connect various accelerated nodes in large AI clusters with 1000’s of accelerators. This network is commonly referred to in the industry as the ‘AI Back-end Network’ (also referred to; by some vendors; as the network for East-West traffic). Some of the topics and solutions that have been explored at the show are as follows:

1) Linear Drive Pluggable Optics vs. Linear Receive Optics vs. Co-Packaged Optics

Pluggable optics are expected to account for an increasingly significant portion of power consumption at a system level. An issue that will get further amplified as Cloud SPs build their next-generation AI networks featuring a proliferation of high-speed optics.

At OFC 2023, the introduction of Linear Drive Pluggable Optics (LPOs) promising significant cost and power savings through the removal of the DSP, initiated a flurry of testing activities. Fast forward to OFC 2024, we witnessed nearly 20 demonstrations, featuring key players including Amphenol, Eoptolink, HiSense, Innolight, and others. Conversations during the event revealed industry-wide enthusiasm for the high-quality 100G SerDes integrated into the latest 51.2 Tbps network switch chips, with many eager to capitalize on this advancement to be able to remove the DSP from the optical pluggable modules.

However, despite the excitement, the hesitancy from hyperscalers — with the exception of ByteDance and Tencent, who have announced plans to test the technology by end of this year— suggests that LPOs may not be poised for mass adoption just yet. Interviews highlighted hyperscalers’ reluctance to shoulder the responsibility of qualification and potential failure of LPOs. Instead, they express a preference for switch suppliers to handle those responsibilities.

In the interim, early deployments of 51.2 Tbps network chips are expected to continue leveraging pluggable optics, at least through the middle of next year. However, if LPOs can demonstrate safe deployment at mass scale while offering significant power savings for hyperscalers — enabling them to deploy more accelerators per rack — the temptation to adopt may prove irresistible. Ultimately, the decision hinges on whether LPOs can deliver on these promises.

Furthermore, Half-Retimed Linear Optics (HALO), also known as Linear Receive Optics (LROs) were discussed at the show. LRO integrates the DSP chip only on the transmitter side (as opposed to completely removing it in the case of LPOs). Our interviews revealed that while LPOs may proof to be doable at 100G-PAM4 SerDes, they may become challenging at 200G-PAM4 SerDes and that’s when LROs may be needed.

Meanwhile, Co-Packaged Optics (CPOs) remain in development, with large industry players such as Broadcom showcasing ongoing development and progress in the technology. While we believe current LPO and LRO solutions will certainly have a faster time to market with similar promises as CPOs, the latter may eventually become the sole solution capable of enabling higher speeds at some point in the future.

Before closing this section, let’s just not forget that, when possible, copper would be a much better alternative than all of the optical connectivity options discussed above. Put simply, use copper when you can, use optics when you must. Interestingly, liquid cooling may facilitate the densification of accelerators within the rack, enabling increased usage of copper for connecting various accelerator nodes within the same rack. The recent announcement of the NVIDIA GB200 NVL72 at GTC perfectly illustrates this trend.

2) Optical Circuit Switches

OFC 2024 brought some interesting Optical Circuit Switches (OCS) related announcements. OCS can bring many benefits including high bandwidth and low network latency as well as significant capex savings. That is because OCS switches can lead to a significant reduction in the number of required electrical switches within the network which eliminates the expensive optical-to-electrical-to-optical conversions associated with electrical switches. Additionally, unlike electrical switches, OCS switches are speed agnostic and don’t necessarily need to be upgraded when servers adopt next generation optical transceivers.

However, OCS is a novel technology and so far, only Google, after many years in development, was able to deploy it in mass in its data center networks. Additionally, OCS switches may require a change in the installed base of fiber. For that reason, we are still watching to see if any other Cloud SP, besides Google, has any plans to follow suit and adopt OCS switches in the network.

3) The Path to 3.2 Tbps

At OFC 2023, numerous 1.6 Tbps optical components and transceivers based on 200G per lambda were introduced. At OFC 2024, we witnessed further technology demonstrations of such 1.6 Tbps optics. While we don’t anticipate volume shipment of 1.6 Tbps until 2025/2026, the industry has already begun efforts in exploring various paths and options towards achieving 3.2 Tbps.

Given the complexity encountered in transitioning from 100G-PAM4 electrical lane speeds to 200G-PAM4, initial 3.2 Tbps solutions may utilize 16 lanes of 200G-PAM4 within an OSFP-XD form factor, instead of 8 lanes of 400G-PAMx. It’s worth noting that OSFP-XD, which was initially explored and demonstrated two years ago at OFC 2022, may be brought back to action due to the urgency stemming from AI cluster deployments. 3.2 Tbps solutions in OSFP-XD form factor offer superior faceplate density and cost savings compared to 1.6 Tbps. Ultimately, the industry is expected to figure out a way to enable 3.2 Tbps based on 8 lanes of 400G-PAMx SerDes, albeit it may take some time to reach that target.

In summary, OFC 2024 showcased numerous potential solutions aimed at addressing common challenges: cost, power, and speed. We anticipate that different hyperscalers will make distinct choices, leading to market diversification. However, one of the key considerations will be time to market. It’s important to note that the refresh cycle in the AI back-end network is typically around 18 to 24 months, significantly shorter compared to the 5 to 6 years seen in the traditional front-end networks used to connect general-purpose server.

For more detailed views and insights on the Ethernet Switch—Data Center report or the AI Networks for AI Workloads report, please contact us at dgsales@delloro.com.

[wp_tech_share]

Happy New Year! We couldn’t hope for a more exciting start to the year than with the groundbreaking announcement that HPE has entered into a definitive agreement to acquire Juniper. In this blog, we delve into the potential impact of this acquisition on the market, along with additional predictions for what 2024 may have in store for us:

  1. The Campus Switch Market is on the Verge of a Correction in 2024

Right before the holidays, we published our 3Q23 reports which provided an overview of the market performance for the first nine months of 2023. Based on those results, the market is estimated to have grown strong double-digits in 2023, marking the third consecutive year of a very robust growth. As a reminder, the typical growth rate in the campus switch market, pre-pandemic, has been in the low-to-mid single digits. The outstanding performance in the last couple of years begs the question: where do we go from here? Based on our interviews with the vendors as well as value-added resellers (VARs) and system integrators (SIs), we believe the market is poised for a correction in 2024. We anticipate the demand in the market to slow down significantly as customers absorb and digest existing capacity. Additionally, conversations with key vendors indicate their anticipation of a return to normal backlog levels by the beginning of 2024. Once the backlog is restored to its typical state, sales performance will more accurately mirror organic market demand, eliminating the potential for backlog-driven inflation

2. HPE/Juniper acquisition Will create a Tectonic Shift in the Market

While the HPE/Juniper acquisition may not be finalized until the end of the year, we anticipate witnessing its impact on the market and competitive landscape throughout 2024. We foresee other vendors accelerating the pace of innovation and product introductions, anticipating potential synergies created by the combined HPE/Juniper entity, as explained in my HPE/Juniper blog. Additionally, we expect employees to transition between companies, fostering cross-pollination. Monitoring customer reactions will be crucial throughout the year. We believe the HPE/Juniper deal may further amplify the anticipated pause in market demand as customers will be seeking clarity on how the acquisition will impact future roadmaps.

3. AI capabilities will increasingly define the competitive landscape in the market

In the midst of intense competition and an expected slowdown in market demand, vendors find themselves compelled to enhance their offerings with AI capabilities. The addition of these AI capabilities brings several benefits, including product differentiation, increased demand for new use cases and applications, acceleration in product refresh cycles, and higher customer retention. However, it remains intriguing to observe vendors’ ability to effectively monetize these features. Furthermore, as customers weigh the options between on-premises and cloud-managed solutions, as well as subscription versus perpetual consumption models, we believe that AI features will play a pivotal role in influencing these choices. Customers are likely to opt for the model that allows them to benefit the most from these AI features.

For more detailed views and insights on the campus switch market, please contact us at dgsales@delloro.com

[wp_tech_share]

Happy New Year! As usual, we’re excited to start the year by reflecting on the developments in the Ethernet data center switch market throughout 2023 and exploring the anticipated trends for 2024.

First, looking back at 2023, the market performed largely in line with our expectations as outlined in our 2023 prediction blog published in January of last year. As of January 2024, data center switch sales are set to achieve double-digit growth in 2023, based on the data collected up to the 3Q23 period. Shipments of 200/400 Gbps nearly doubled in 2023. While Google, Amazon, Microsoft, and Meta continue to dominate deployments, we observed a notable increase in 200/400 Gbps port shipments destined toward Tier 2/3 Cloud Service Providers (SPs) and large enterprises. In the meantime, 800 Gbps deployments remained sluggish throughout 2023, with expectations for acceleration in 2024. Unforgettably, 2023 marked a transformative moment in the history of AI with the emergence of generative AI applications, propelling meaningful impact and changes on modern data center networks.

Now as we look into 2024, below are our top 3 predictions for the year:

1. The Data Center Switch market to slow down in 2024

Following three consecutive years of double-digit growth, the Ethernet data center switch market is expected to slow down in 2024 and grow at less than half the rate of 2023. We expect 2024 sales performance to be suppressed by normalization of backlog, digestion of existing capacity, and optimization of spending caused either by macroeconomic conditions or a shift in focus to AI and budgets diverted away from traditional front-end networks used to connect general-purpose servers.

2. The 800 Gbps adoption to significantly accelerate in 2024

We predict 2024 to be a tremendous year for 800 Gbps deployments, as we expect a swift adoption of a second wave of 800 Gbps (based on 51.2 Tbps chips) from a couple of large Cloud SPs. The first wave of 800 Gbps (based on 25.6 Tbps chips) started back in 2022/2023 but has been slow as it has been adopted only by one Cloud SP. In the meantime, we expect 400 Gbps port shipments to continue to grow as 51.2 Tbps chips will also enable another wave of 400 Gbps adoption. We expect 400 Gbps/800 Gbps speeds to achieve more than 40% penetration by 2027 in terms of port volume.

3. AI workloads to drive new network requirements and to expand the market opportunity for both Ethernet and InfiniBand

The enormous appetite for AI is reshaping the data center switch market.  Emerging generative AI applications deal with trillions of parameters that drive the need for thousands or even hundreds of thousands of accelerated nodes. To connect these accelerated nodes, there is a need for a new fabric, called the AI back-end network, which is different from the traditional front-end network mostly used to connect general-purpose servers. Currently, InfiniBand is dominating the AI back-end networks but Ethernet is expected to gain significant share over the next five years. We provide more details about the AI back-end network market in our recently published Advanced Research Report: ‘AI Networks for AI Workloads.’ Among many other requirements, AI back-end networks will accelerate the migration to high speeds. As noted in the chart below, the majority of switch ports in AI back-end networks are expected to be 800 Gbps by 2025 and 1600 Gbps by 2027.

Migration to High-speeds in AI Clusters (AI Back-end Networks)

For more detailed views and insights on the Ethernet Switch—Data Center report or the AI Networks for AI Workloads report, please contact us at dgsales@delloro.com.

[wp_tech_share]

I would like to share some initial thoughts about the groundbreaking announcement that HPE has entered into a definitive agreement to acquire Juniper for $14 B. My thoughts are mostly around the switch business of both firms. The WLAN and security aspects of the acquisition are covered by our WLAN analyst Sian Morgan and security analyst Mauricio Sanchez.

My initial key takeaways and thoughts on the potential upside and downside impact of the acquisition are:

Pros:

  • In the combined data center and campus switch market, Cisco has consistently dominated as the major incumbent vendor, with a 46% revenue share in 2022. HPE held the fourth position with approximately 5%, and Juniper secured the fifth spot with around 3%. A consolidated HPE/Juniper entity would climb to the fourth position, capturing 8% market share, trailing closely behind Huawei and Arista.
  • Juniper’s standout performer is undeniably their Mist portfolio, recognized as the most cutting-edge AI-driven platform in the industry. As AI capabilities increasingly define the competitive landscape for networking vendors, HPE stands to gain significantly from its access to the Mist platform. We believe that Mist played a pivotal role in motivating HPE to offer a premium of about 30% for the acquisition of Juniper. In other words, Juniper brings better “AI technology for networking” to the table.
  • In the data center space, HPE has predominantly focused on the compute side, with a relatively modest presence in the Data Center switch business (HPE Data Center switch sales amounted to approximately $150 M in 2022, in contrast to Juniper’s sales that exceeded $650 M). Consequently, we anticipate that HPE stands to gain significantly from Juniper’s data center portfolio. Nonetheless, a notable contribution from HPE lies in their Slingshot Fabric, which serves as a compelling alternative to InfiniBand for connecting large GPU clusters. In other words, HPE brings better “Networking technology for AI” to the table.
  • Juniper would definitely benefit from HPE’s extensive channels and go-to-market strategy (about 95% of HPE’s business goes through channels). Additionally, HPE has made great progress driving their as-a-service GreenLake solution. However, GreenLake has been so far mostly dominated by compute. With the Juniper acquisition, we expect to see more networking components pushed through GreenLake.
  • In campus and with the Mist acquisition in particular, Juniper has been focusing mostly on high-end enterprises whereas HPE has been playing mainly in commercial and mid-market. Therefore, from that standpoint, there should be a little overlap in the customer base and more of cross-selling opportunities.

Cons:

  • Undoubtedly, a significant challenge arises from the substantial product overlap, evident across various domains such as data center switching, campus switching, WLAN, and security. Observing how HPE navigates the convergence of these diverse product lines will be intriguing. Ideally, the merged product portfolio should synergize to bolster the market share of the consolidated entities. Regrettably, history has shown that not all product integration and consolidation meet that desired outcome.
[wp_tech_share]

We’ve been participating in the OCP Open Compute Project (OCP) Global Summit for many years, and while each year has brought pleasant surprises and announcements, as described in previous OCP blogs from 2022 and 2021, this year stands out in a league of its own. 2023 marks a significant turning point, notably with the advent of AI, which many speakers have referred to as a tectonic shift in the industry and a once-in-a-generation inflection point in computing and in the broader market. This transformation has unfolded within just the past few months, sparking a remarkable level of interest at the OCP conference. In fact, this year, the conference was completely sold out, demonstrating the widespread eagerness to grasp the opportunities and confront the challenges that this transformative shift presents to the market. Furthermore, at OCP 2023, there was a new track just to focus on AI. This year marks the beginning of a new era in the age of AI. AI is here! The race is on!

This new era of AI is marked and defined by the emergence of new generative AI applications and large language models. Some of these applications deal with billions and even trillions of parameters and the number of parameters seems to be growing 1000X every 2 to 3 years.

This complexity and size of the emerging AI applications dictate the number of accelerated nodes needed to run the AI applications as well as the scale and type of infrastructure needed to support and connect those accelerated nodes. Regrettably, as illustrated in the chart below presented by Meta at the OCP conference, a growing disparity exists between the requirements for model training and the available infrastructure to facilitate it.

This predicament poses the pivotal question: How can one scale to hundreds of thousands or even millions of accelerated nodes? The answer lies in the power of AI Networks purposively built and tuned for AI applications. So, what are the requirements that the AI Networks need to satisfy? To answer that question, let’s first look at the characteristics of AI workloads, which include but are not limited to the following:

  • Traffic patterns consist of a large portion of elephant flows
  • AI workloads require a large number of short remote memory access
  • The fact that all nodes transmit at the same time saturates links very quickly
  • The progression of all nodes can be held back by any delayed flow. In fact, Meta showed last year that 33% of elapsed time in AI/ML is spent waiting for the network.

Given these unique characteristics of AI workloads, AI Networks have to meet certain requirements such as high speed, low tail-latency, lossless and scalable fabrics.

In terms of high-speed performance, the chart below, which I presented at OCP, shows that by 2027, we anticipate that nearly all ports in the AI back-end network will operate at a minimum speed of 800 Gbps, with 1600 Gbps comprising half of the ports. In contrast, our forecast for the port speed mix in the front-end network reveals that only about a third of the ports will be at 800 Gbps speed by 2027, while 1600 Gbps ports will constitute just 10%. This discrepancy in port speed mix underscores the substantial disparity in requirements between the front-end network, primarily used to connect general-purpose servers, and the back-end network, which primarily supports AI workloads.

In the pursuit of achieving tail-latency and creating a lossless fabric, we are witnessing numerous initiatives aimed at enhancing Ethernet and modernizing it for optimal performance in AI workloads. For instance, the Ultra Ethernet Consortium (UEC) was established in July 2023, with the objective of delivering an open, interoperable, high-performance full-communications stack architecture based on Ethernet. Additionally, OCP has formed a new alliance to address significant networking challenges within AI cluster infrastructure. Another groundbreaking announcement from the OCP conference came from Google, who unveiled their opening of Falcon chips; a low-latency hardware transport, to the ecosystem through the Open Compute Project.

Source: Google

 

At OCP, there was a huge emphasis on adopting an open approach to address the scalability challenges of AI workloads, aligning seamlessly with the OCP 2023 theme: ‘Scaling Innovation Through Collaboration.’ Both Meta and Microsoft have consistently advocated, over the years, for community collaboration to tackle scalability issues. However, we were pleasantly surprised by the following statement from Google at OCP 2023: “A new era of AI systems design necessitates a dynamic open industry ecosystem”.

The challenges presented by AI workloads to network and infrastructure are compounded by the broad spectrum of workloads. As illustrated in the chart below showcased by Meta at OCP 2023, the diversity of workloads is evident in their varying requirements.

Source: Meta at OCP 2023

 

This diversity underscores the necessity of adopting a heterogeneous approach to build high-performance AI Networks and infrastructure capable of supporting a wide range of AI workloads. This heterogeneous approach will entail a combination of standardized as well as proprietary innovations and solutions. We anticipate that Cloud service providers will make distinct and unique choices, resulting in market bifurcation. In the upcoming Dell’Oro Group’s AI Networks for AI Workloads report, I delve into the various network fabric requirements based on cluster size, workload characteristics, and the distinctive choices made by cloud service providers.

Exciting years lie ahead of us! The AI journey is just 1% finished!

 


Save the date: Free OCP Educational Webinar on November 9, 8 AM PT, explores AI-driven network solutions, market potential, featuring Juniper Networks and Dell’Oro Group. Register now!