A member of Nvidia’s Infrastructure Specialists team has raised concerns internally about Microsoft’s cooling design for its Blackwell GPU installations. The employee, who asked to remain anonymous, described the setup at one of Microsoft’s data centres as wasteful in an email circulated within Nvidia. The memo referenced a Blackwell deployment intended for OpenAI, for which Microsoft provides cloud infrastructure. While the criticism focused on overall cooling efficiency, it has prompted explanations from researchers and Microsoft outlining how such systems work and why certain design choices are made.
Shaolei Ren, an associate professor at the University of California who studies data centre resources, said the Nvidia employee was likely referring to the building level cooling that operates alongside liquid cooling systems installed directly on GPU racks. Ren explained that even with liquid cooling at the server level, data centres still require a second layer of infrastructure to move heat out of the facility. In some cases, operators opt for air based cooling at the building level instead of water based systems. Air cooling consumes more energy but avoids large scale water usage, something Ren said is increasingly important in regions sensitive to water consumption.
Microsoft’s response aligned with this two tier explanation. The company said its liquid cooling heat exchanger is a closed loop unit deployed across air cooled data centres to increase cooling capacity without reworking entire facilities. According to Microsoft, the design enhances heat dissipation and supports AI hardware requirements while making efficient use of its global data centre footprint. The company reiterated its long term environmental targets, including goals to be carbon negative, water positive, and zero waste by 2030. Microsoft also highlighted its work on zero water cooling designs for future facilities and ongoing research into on chip cooling.
The Nvidia employee’s memo detailed multiple challenges encountered during the Blackwell installation. The setup used two GB200 NVL72 racks, each holding seventy two GPUs, relying on Microsoft’s liquid cooling technology to handle the intense heat generated by the hardware. The employee said substantial time was spent documenting and validating installation steps, especially for staff less familiar with cluster and system validation processes. They also noted that coordination between Nvidia and Microsoft required stronger handover procedures than those used in earlier deployments.
However, the note was not entirely critical. The staffer reported that production units of the GB200 NVL72 hardware showed clear improvements over earlier samples and achieved a full pass rate in compute performance tests. Nvidia later said publicly that Blackwell systems deliver high performance, reliability, and energy efficiency across workloads. The company added that customers, including Microsoft, have already deployed hundreds of thousands of GB200 and GB300 NVL72 systems to meet rapidly rising demand for AI training and inference.
Nvidia introduced the Blackwell architecture in 2024, positioning it as a major step up from the previous Hopper generation. CEO Jensen Huang said at launch that Blackwell chips would roughly double performance. The GB200 led the first wave, followed by GB300 models now in circulation.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
