<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>RDMA on martinuke0&#39;s Blog</title>
    <link>https://martinuke0.github.io/tags/rdma/</link>
    <description>Recent content in RDMA on martinuke0&#39;s Blog</description>
    <image>
      <title>martinuke0&#39;s Blog</title>
      <url>https://martinuke0.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</url>
      <link>https://martinuke0.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</link>
    </image>
    <generator>Hugo -- 0.152.2</generator>
    <language>en</language>
    <lastBuildDate>Fri, 03 Apr 2026 20:01:11 +0000</lastBuildDate>
    <atom:link href="https://martinuke0.github.io/tags/rdma/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Optimizing Distributed Model Training on Bare‑Metal Clusters with RDMA and Low‑Latency Interconnects</title>
      <link>https://martinuke0.github.io/posts/2026-04-03-optimizing-distributed-model-training-on-baremetal-clusters-with-rdma-and-lowlatency-interconnects/</link>
      <pubDate>Fri, 03 Apr 2026 20:01:11 +0000</pubDate>
      <guid>https://martinuke0.github.io/posts/2026-04-03-optimizing-distributed-model-training-on-baremetal-clusters-with-rdma-and-lowlatency-interconnects/</guid>
      <description>&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Training state‑of‑the‑art deep‑learning models now routinely requires &lt;strong&gt;hundreds of GPUs&lt;/strong&gt; working in concert. While public cloud providers offer convenient, on‑demand clusters, many research labs and enterprises still prefer &lt;strong&gt;bare‑metal clusters&lt;/strong&gt; for three core reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Predictable performance&lt;/strong&gt; – no noisy neighbors, no hypervisor overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost efficiency at scale&lt;/strong&gt; – amortized CAPEX and lower per‑GPU price.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full control over hardware and software&lt;/strong&gt; – ability to fine‑tune network stacks, install custom drivers, and leverage specialized interconnects.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When you combine bare‑metal hardware with &lt;strong&gt;RDMA (Remote Direct Memory Access)&lt;/strong&gt; and &lt;strong&gt;low‑latency interconnects&lt;/strong&gt; such as InfiniBand or RoCE (RDMA over Converged Ethernet), you can dramatically reduce the communication overhead that traditionally limits distributed training speed. This article walks through the entire optimization stack—from networking fundamentals to concrete PyTorch code—so you can extract the maximum throughput from your cluster.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
