Part of the Technology photoes in this website are created by rawpixel.com - www.freepik.com

CXL 2.0 - GPU Memory Sharing and Expansion

16410

CXL memory have been widely discussed for its capability to enhance memory bandwidth and capacity, and these benefits are significant to the emerging AI/ML applications. Aside from memory, the CXL specification also includes accelerators. As we have not seen much discussion on CXL accelerators, and with our experiences in PCIe switches, we decided to share some thoughts on CXL2.0 and the disaggregated architectures that incorporates not only memory but also accelerators.

Under CXL specification, a memory expander is classified as a “type 3 device”. The GPGPUs that has device-attached memory would be “type 2 devices”. A memory expander is straight forward, it’s only purpose is to provide more memory to the CPU, be it volatile (i.e., DDR) or non-volatile (i.e., persistent memory), there isn’t cache coherence issue with CXL memory, the host CPU simply access the media through CXL.mem protocol. Nonetheless, an accelerator is much more complex. Cache coherence is critical in heterogeneous computing, and CXL helps to maintain cache coherency between multiple processors and the accelerators thorugh the CXL.cache protocol.

CXL device types and usages:

CXL device types and usages

Image from https://www.computeexpresslink.org/

If we put a CXL GPU (type 2 device) and a CXL memory expansion module (type 3 device) under the same CXL fabric that is connected to a CPU serving as the home agent, we can expect that the GPU is able to access data directly from the memory module through cxl.mem protocol. On the other hand, if 2 CXL GPUs are put under the same CXL fabric, we can expect that the two GPUs communicating peer to peer through cxl.mem and maintaining cache coherence with cxl.cach protocol by a home agent CPU.

The proposed CXL GPU and memory system

The proposed CXL GPU and memory system

Within the system, users can compose cache coherence domain of any size. For example, host 1 in the illustration is in color yellow, and the memory module and GPU assigned to host 1 are also indicated in yellow. Now, host 1 CPU will serve as the home agent and maintains cache coherence between host 1 and the GPU and memory module assigned to it. While Host 1 forms a cache coherence domain, host 2 and host 3 can be used to form different cache coherence domain.

Now let’s extend it a little further, what if the GPUs could directly fetch data from the CXL memory? Currently, NVIDIA’s CUDA allows GPUs to access certain portion of host memory as their own memory, and CXL memory can be used to expand host memory. So, the GPU should be able to recognize the CXL memory as the host memory without a problem, and the host should be able to allocate a portion of the CXL memory to the GPU devices, creating a direct path for the GPUs to access the CXL memory using the cxl.mem protocol.

CXL GPU memory access and cache coherence

CXL GPU memory access and cache coherence

The article CXL: Simplifying server, written by Siddharth Bhatla, discussed that “Scaling the server memory beyond a point becomes less attractive when using the DDR or HBM memory (due to physical, power and cost limitations).” With CXL, the PCIe attached DRAM is able to give byte-level memory access to CPU just like the DDR DRAMs. While it benefits CPU memory expansion, is it possible to apply the same concept to GPUs?

The NVLink was introduced by Nvidia to allow combining memory of multiple GPUs as a larger pool. Now, with CXL memory expansion, we can further extend the amount of memory that GPU has, exceeding the limitation of on-GPU memory physical, power and costs. We thought this is possible because according to CXL.org, a CXL type 1 or 2 device is able to target memory in peer CXL device as long as the target device supports the CXL.mem protocol. In the context of memory expansion, CXL does not specify that a home agent is necessary. Nonetheless, the cache coherence between CXL GPUs is still managed by a CPU home agent.

A good outcome of using CXL though, is that it allows processors to send direct read/write commands to media and does not require doorbells or interrupts (like DMA), meaning that the latency can be further reduced. So, if GPUs are able to fetch data directly from CXL memory without much performance drop, we can allocate any amount of memory to the GPUs and wouldn’t have to worry much about data batch size for training efficiency when designing our AI models or any similar applications.

As we all have known, CXL leverages PCIe gen 5 physical layers, and the performance is still a bit lesser than the direct-attached DIMMs. As a result, the DIMMs local to CPU sockets are still significant to current computing architectures. Maybe we would see servers without DIMM slots, or accelerators devices without RAM soldering on it as CXL memory expansion technology evolves.

Possible heterogeneous computing architecture with CXL

Possible heterogeneous computing architecture with CXL






References:

https://dl.acm.org/doi/fullHtml/10.1145/3533737.3535090

https://www.computeexpresslink.org/post/__q-a

https://www.counterpointresearch.com/cxl-simplifying-server-fabric/




修正document.querySelector('link[rel="canonical"]').href = url_now; setCanonical('https://www.h3platform.com/blog-detail/' + reserved_para); } if (blogNum == "0") { if (para_id == "26") { setTD("NVMe MR-IOV - Lower TCO of IT System|H3 Platform", " Falcon 5208 NVMe MR-IOV solution ensures SSD performance and flexibility,. With built-in PCIe fabric, it requires less hardware to achieve high-performance storage service in comparison to other NVMe-oF solutions. An MR-IOV solution also allows better utilization of expensive CPUs especially in virtual environments."); } else if (para_id == "29") { setTD("【CXL Storage】 CXL 2.0 / PCIe Gen 5 - The Future of Composable Infrastructure|H3 Platform", "H3 Platform has NVMe MR-IOV solution, increasing storage utilization. SR-IOV of the NVMe SSDs is enabled in the NVMe chassis. CXL device are general-purpose accelerators such as NIC and GPU. CXL specification is based on PCIe Gen 5, and CXL allows CPU to access shared memory on accelerator devices. Nowadays, CXL 2.0 introduces pooling capability to the CXL protocol, improving the composability of memory."); } else if (para_id == "30") { setTD("【PCIe Expansion Chassis】– Big Accelerator Memory-Enhancing GPU and Storage Efficiency with PCIe Expansion Solution|H3 Platform", "Nvidia recently released a report on the effectiveness of Big Accelerator Memory (BaM) architecture. BaM leverages GPUDirect RDMA, allowing GPU thread to communicate with SSDs using NVMe queues to ultimately reduce reliance on CPU."); } else if (para_id == "36") { setTD("【CXL memory expansion】– Memory Expansion for Breakthrough Performance|H3 Platform", "CXL memory have been widely discussed for its capability to enhance memory bandwidth and capacity, and these benefits are significant to the emerging AI/ML applications. "); } else if (para_id == "40") { setTD("Toward PCIe Gen 5 Composable Infrastructure as a Service|H3 Platform", "The two case examples above indicate H3's capability to realize device pooling potential and expand resource configuration flexibility. That might be why SC 22 invites H3 to share experiences in the panel session. H3 is ready for everything @SC22. We look forward to displaying H3's avant-garde PCIe Gen 5 CIaaS worldwide."); } else { setTD(strT); } setInternalLink(document.querySelector("div.editor-content"), { href: "/product-list/10", anchor: array_gpuchassis[urlID % array_gpuchassis.length] }, { href: "/product", anchor: array_product[urlID % array_product.length] }); setArticleSchema(); document.querySelectorAll("ul.breadcrumb a")[1].href = "https://www.h3platform.com/blog-list?category=10"; document.querySelectorAll("ul.breadcrumb a")[2].innerHTML = document.querySelector(".title-container h1").innerText; document.querySelectorAll("ul.breadcrumb a")[2].href = url_now; document.querySelectorAll("ul.breadcrumb a")[2].style.color = "#808285"; } else if (blogNum == "1") { if (para_id == "24") { setTD("Increase the Efficiency of Storage System with Multi-host NVMe SR-IOV solution|H3 Platform", "NVMe SR-IOV is the solution for NVMe SSD sharing the resource among multiple servers often limits SSD’s performance as the networking creates I/O bottleneck."); } else if (para_id == "25") { setTD("NVMe MR-IOV – High-Performance Storage Solution for Virtual Environment Deployments|H3 Platform", "Multi-host NVMe SR-IOV, or multi-root SR-IOV (MR-IOV) is the solution aims to improve SSD performance under virtual environments while ensuring high utilization and flexibility for the storage resources. H3 Platform's proposed MR-IOV solution extends the application of SR-IOV."); } else if (para_id == "50") { setTD("【PCIe Gen 5 NVMe chassis】PCIe Gen 5 NVMe MRIOV Solution for Storage Scalability|H3 Platform", "NVMe, a new generation of high-speed storage interface, has higher bandwidth and lower latency than the traditional SATA interface. NVMe Multi-Root IO Virtualization technology (NVMe MR-IOV) further scales up the NVMe resources to realize mass storage sharing and virtualization by allowing multiple virtual machines to visit the same pool of NVMe devices at the same time."); } else { setTD(strT); } setInternalLink(document.querySelector("div.editor-content"), { href: "/product-list/17", anchor: "NVMe MR-IOV Solution" }, { href: "/product", anchor: "Composable NVMe SSD" }); setArticleSchema(); document.querySelectorAll("ul.breadcrumb a")[1].href = "https://www.h3platform.com/blog-list?category=11"; document.querySelectorAll("ul.breadcrumb a")[2].innerHTML = document.querySelector(".title-container h1").innerText; document.querySelectorAll("ul.breadcrumb a")[2].href = url_now; document.querySelectorAll("ul.breadcrumb a")[2].style.color = "#808285"; } else if (blogNum == "2") { setTD(strT); setInternalLink(document.querySelector("div.editor-content"), { href: "/product-list/17", anchor: "NVMe MR-IOV Solution" }, { href: "/product", anchor: "Composable NVMe SSD" }); setArticleSchema(); document.querySelectorAll("ul.breadcrumb a")[1].href = "https://www.h3platform.com/blog-list?category=12"; document.querySelectorAll("ul.breadcrumb a")[2].innerHTML = document.querySelector(".title-container h1").innerText; document.querySelectorAll("ul.breadcrumb a")[2].href = url_now; document.querySelectorAll("ul.breadcrumb a")[2].style.color = "#808285"; } else if (blogNum == "3") { if (para_id == "73") { setTD("Composable Memory System: 210M IOPS, Reduce Bottlenecks|H3 Platform", "Composable memory systems deliver up to 210 million IOPS and remove memory bottlenecks using CXL. Features include dynamic memory pooling, real-time allocation, and improved resource use—helping data centers scale faster while reducing TCO."); } else if (para_id == "72") { setTD("CXL 2.0 Memory Pooling Breakthrough|Four Servers Sharing 2TB Achieve 210M IOPS and 120GB/s Bandwidth|H3 Platform", "Discover H3 Platform's latest advancement in CXL 2.0 memory pooling and memory sharing technology, enabling four servers to share 2TB of memory. Key highlights include achieving 210 million IOPS and 120GB/s bandwidth, significantly enhancing data access speeds and system performance. Explore the detailed test environment, methodologies, and results that showcase this innovative leap in server memory management."); } else if (para_id == "68") { setTD("What is CXL Memory Sharing? Unlocking Shared Memory for AI and HPC|H3 Platform", "Learn how CXL memory sharing is revolutionizing computing with enhanced scalability and efficiency. This blog dives into CXL shared memory, its applications in AI and HPC, and how it transforms disaggregated memory architecture. Explore CXL technologies, protocols, and their role in creating resilient memory management systems for distributed environments. Discover why CXL memory is the future of high-performance computing and data processing."); document.querySelector("main#blog-content img.cover").alt = document.querySelector("div.title-container h1").textContent; } else { setTD(strT); } setInternalLink(document.querySelector("div.editor-content"), { href: "/product-list/18", anchor: "CXL Memory Pooling Solution" }, { href: "/blog-detail/68", anchor: "CXL Memory Sharing Architecture" }); setArticleSchema(); document.querySelectorAll("ul.breadcrumb a")[1].href = "https://www.h3platform.com/blog-list?category=14"; document.querySelectorAll("ul.breadcrumb a")[2].innerHTML = document.querySelector(".title-container h1").innerText; document.querySelectorAll("ul.breadcrumb a")[2].href = url_now; document.querySelectorAll("ul.breadcrumb a")[2].style.color = "#808285"; } else if (blogNum == "4") { // 2025-1208 setTD(strT); /* setInternalLink(document.querySelector("div.editor-content"), { href: "/blog-detail/77", anchor: "AI Storage Fundamentals" }); */ setArticleSchema(); document.querySelectorAll("ul.breadcrumb a")[1].href = "https://www.h3platform.com/blog-list?category=15"; document.querySelectorAll("ul.breadcrumb a")[2].innerHTML = document.querySelector(".title-container h1").innerText; document.querySelectorAll("ul.breadcrumb a")[2].href = url_now; document.querySelectorAll("ul.breadcrumb a")[2].style.color = "#808285"; if (para_id == "77") { setFAQSchema(); } } var breads = [{ href: "/", anchor: "H3 Platform" }, { href: "/blog-list", anchor: "Blog" }, { href: url_now, anchor: document.querySelector(".title-container h1").innerText }]; setBreadCrumbSchema(breads); setSocialMediaMeta({ cond: "meta[property='og:title']", cont: strT }, { cond: "meta[property='og:url']", cont: url_now }, { cond: "meta[property='og:description']", cont: strD }); createTag("meta", { name: "thumbnail", content: document.querySelector("img.cover").src }); function checkData(obj) { for (var i = 0; i < obj.group.length; i++) { if (obj.group[i].blogID.includes(para_id)) { return i; } } } function setTD() { var metaTitle = document.querySelector("title"); var metaDes = document.querySelector("meta[name='description']"); if (arguments.length > 1) { if (!metaDes) { var des = document.createElement("meta"); des.name = "description"; document.getElementsByTagName("head")[0].appendChild(des); des.content = arguments[1]; } else { metaDes.content = arguments[1]; } metaTitle.innerHTML = arguments[0]; } else { metaTitle.innerHTML = arguments[0]; } } function createDetailContent(target, id, content) { var real_id = "jsContent" + id; target.innerHTML = '' + target.textContent + ''; var tag_article = document.createElement("article"); tag_article.style.display = "none"; tag_article.style.textAlign = "center"; tag_article.style.marginBottom = "1em"; tag_article.id = real_id; tag_article.innerHTML = content; target.parentNode.insertBefore(tag_article, target.nextElementSibling); } function show(id) { var t = document.querySelector("article#" + id); t.style.display = (t.style.display == "none") ? "" : "none"; } function addSchema(schema) { var scriptJSON = document.createElement("script"); scriptJSON.type = 'application/ld+json'; scriptJSON.innerHTML = JSON.stringify(schema); document.getElementsByTagName("head")[0].appendChild(scriptJSON); } function extend(obj, src) { for (var key in src) { if (src.hasOwnProperty(key)) obj[key] = src[key]; } } function setBreadCrumbSchema(breadContent) { var schemaData_bread = { "@context": "http://schema.org", "@type": "BreadcrumbList", "itemListElement": [] }; var itemListElement = []; for (var i = 0; i < breadContent.length; i++) { var item = { "@type": "ListItem", "position": i + 1, "item": { "@id": breadContent[i].href, "name": breadContent[i].anchor } }; itemListElement.push(item); } extend(schemaData_bread.itemListElement, itemListElement); addSchema(schemaData_bread); } function setSocialMediaMeta() { for (var i = 0; i < arguments.length; i++) { document.querySelector(arguments[i].cond).content = arguments[i].cont; } } function createTag(tagName) { var tag_head = document.getElementsByTagName("head")[0]; var tag = document.createElement(tagName); for (var i = 1; i < arguments.length; i++) { for (attr in arguments[i]) { tag.setAttribute(attr, arguments[i][attr]); } } tag_head.appendChild(tag); } function setInternalLink(target) { var tagDiv = document.createElement("div"); tagDiv.style.marginTop = "2.5em"; tagDiv.style.textAlign = "left"; tagDiv.style.color = "#231F20"; var strLink = ""; for (var i = 1; i < arguments.length; i++) { strLink += '' + arguments[i].anchor + '|'; } tagDiv.innerHTML = 'Product Info:' + strLink.substring(0, strLink.length - 1); target.appendChild(tagDiv); } function count_url(url) { var url_to_id = 0; for (var i = 0; i < url.length; i++) { url_to_id += url.charCodeAt(i); } return url_to_id; } function getParameter(name, url) { name = name.replace(/[\[\]]/g, "\\$&"); var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"); var results = regex.exec(url); if (!results) { return null; } if (!results[2]) { return ' '; } return decodeURIComponent(results[2].replace(/\+/g, " ")); } function setCanonical(url_path){ var canonical_check = document.querySelector("link[rel=canonical]"); if(!canonical_check){ var link_seo = document.createElement("link"); link_seo.rel = "canonical"; link_seo.href = url_path; var head_place = document.getElementsByTagName("head")[0]; head_place.appendChild(link_seo); } else{ canonical_check.href = url_path; } } function setArticleSchema() { var imgElem = document.querySelector("div.editor-content img"); var imgUrl = imgElem ? imgElem.src : "https://www.h3platform.com/img/blog/blog-banner.jpg"; var timeText = document.querySelector("time").textContent.trim(); var PublishDate = formatDateToISO(timeText); var schemaData_Article = { "@context": "https://schema.org", "@type": "Article", "headline": document.querySelector(".title-container h1").innerText, "image": imgUrl, "datePublished": PublishDate, "author": { "@type": "Organization", "name": "H3 Platform", "url": "https://www.h3platform.com/about" }, "publisher": { "@type": "Organization", "name": "H3 Platform", "logo": { "@type": "ImageObject", "url": "https://www.h3platform.com/img/logo.png" } }, "description": document.querySelector("div.editor-content").innerText.substring(0, 300) + " ..." }; addSchema(schemaData_Article); } function formatDateToISO(timeText) { var dateObj = new Date(timeText); var yyyy = dateObj.getFullYear(); var mm = String(dateObj.getMonth() + 1).padStart(2, '0'); var dd = String(dateObj.getDate()).padStart(2, '0'); var fixedTime = "09:00:00"; var timezone = "+08:00"; return `${yyyy}-${mm}-${dd}T${fixedTime}${timezone}`; } function setFAQSchema() { var schemaData_FAQ = { "@context": "http://schema.org", "@type": "FAQPage", "mainEntity": [] }; var questionList = []; for (var i = 0; i < document.querySelectorAll(".FAQ_Schema_Q").length; i++) { var item = { "@type": "Question", "name": document.querySelectorAll(".FAQ_Schema_Q")[i].textContent.trim(), "acceptedAnswer": { "@type": "Answer", "text": document.querySelectorAll(".FAQ_Schema_A")[i].textContent.trim() } }; questionList.push(item); } extend(schemaData_FAQ.mainEntity, questionList); addSchema(schemaData_FAQ); }