用户头像
麦田来客
 · 安徽  

看到一个技术类文章,看到“多个芯片的集成本质上会增加热阻并放大局部冷却问题。这种增加的电阻是由于芯片、中介层(如果存在)和封装基板之间的接口造成的。这种电阻的增加,加上芯片之间功率分布不均匀,导致处理器表面出现显着的温差。”由此对专利中“原子级”的必要性有了认知。---------

As anyone in the industry knows, the need for greater computational power drives evolutionary changes in processor design. Today we’re seeing an accelerating shift from traditional, monolithic single-die configurations toward more advanced, chiplet-based multi-die processor assemblies, with growth at up to a 86% CAGR depending on the market segment. This shift, though desirable for performance, also offers other advantages, including flexibility, modularity, and cost optimization, so the big silicon suppliers, including Nvidia, Intel, and AMD, have publicly committed to multi-die processor assemblies. They’re clearly the future.

业内任何人都知道,对更强计算能力的需求推动了处理器设计的演变。今天,我们看到从传统的单片单芯片配置加速转向更先进的基于小芯片的多芯片处理器组件,根据细分市场的不同,复合年增长率高达 86%。 这种转变虽然对性能有利,但也提供了其他优势,包括灵活性、模块化和成本优化,因此包括 NvidiaIntelAMD 在内的大型芯片供应商已公开承诺多芯片处理器组件。他们显然是未来。

However, there’s a downside to multi-die processors. While multi-die architectures offer remarkable performance gains, they also introduce significant thermal management challenges, in part because the 2.5D and 3D physical designs increase heat density in ways conventional air-cooled technologies can’t easily address. The increasing complexity of multi-die designs, the ways they impede airflow, and heat differentials from die to die, coupled with the accelerated increase in processor thermal design power (TDP) demands are pushing silicon suppliers, server manufacturers, and cloud service providers to explore innovative cooling solutions capable of addressing millimeter-sized localized heat flux and temperature gradients.

然而,多芯片处理器有一个缺点。虽然多芯片架构提供了显着的性能提升,但它们也带来了重大的热管理挑战,部分原因是 2.5D 和 3D 物理设计以传统风冷技术无法轻松解决的方式增加了热密度。多芯片设计的复杂性不断增加,它们阻碍气流的方式,以及芯片与芯片之间的热差,加上处理器热设计功耗 (TDP) 需求的加速增长,正在推动芯片供应商、服务器制造商和云服务提供商探索能够解决毫米级局部热通量和温度梯度的创新冷却解决方案。

The core problem in cooling multi-die processors revolves around an intrinsic design issue: inherent non-uniformity in power distribution and resulting localized hotspots. For example, multi-die packages like AMD’s Threadripper series, combine dies of varying power densities within a single package, creating a complex thermal landscape unlike the thermal distribution of a typical single-die processor.

冷却多芯片处理器的核心问题围绕着一个内在的设计问题:配电固有的不均匀性和由此产生的局部热点。例如,AMD 的 Threadripper 系列等多芯片封装将不同功率密度的芯片组合在单个封装中,从而形成复杂的热环境,这与典型单芯片处理器的热分布不同。

We conducted a study, “Thermal Performance of Modular Microconvective Heat Sinks for Multi-Die Processor Assemblies,” which determined that the integration of multiple dies inherently increases thermal resistance and amplifies localized cooling problems. This added resistance is due to the interfaces between the dies, the interposer (if present), and the package substrate. This increased resistance, coupled with non-uniform power distribution from die to die, leads to significant temperature differentials across the processor surface. These differentials create localized, non-uniform areas of high heat density (thermal gradients) that can result in:

我们进行了一项研究,“ 用于多芯片处理器组件的模块化微对流散热器的热性能 ”,该研究确定多个芯片的集成本质上会增加热阻并放大局部冷却问题。这种增加的电阻是由于芯片、中介层(如果存在)和封装基板之间的接口造成的。这种电阻的增加,加上芯片之间功率分布不均匀,导致处理器表面出现显着的温差。这些差异会产生局部的、不均匀的高热密度区域(热梯度),从而导致:

Limits on total operating power

总工作功率限制

Increases in leakage currents

泄漏电流增加

Mechanical stress due to differential thermal expansion

热膨胀差引起的机械应力

Restricted performance 性能受限

Reduced processor longevity缩短处理器寿命

These issues, if left unchecked, reduce the overall efficiency of multi-die-based systems and also reduce the long-term operational value of those systems. Failure rates increase due to problems that include cracking and delamination due to excess heat, causing downtime, and downtime reduces ROI.

如果不加以控制,这些问题会降低基于多芯片的系统的整体效率,也会降低这些系统的长期运营价值。由于过热导致的开裂和分层等问题,故障率增加,导致停机,停机会降低投资回报率。

Of course, processor designers aren’t entirely unaware of these problems; silicon manufacturers add design features to improve cooling. The principal feature — thermal interface material (TIM) layers — help address the heat challenge of these processors. While TIM layers play a crucial role in reducing thermal resistance between the die and the heat spreader or cold plate, they often struggle to effectively manage the concentrated heat fluxes generated by individual dies within a multi-die package. This is especially true as TDPs continue to climb, pushing the limits of what conventional TIM layers can handle. The TIM layer, though important, is already insufficient to consistently cool the hottest processors under demanding conditions.

当然,处理器设计人员并非完全没有意识到这些问题;硅制造商增加了设计功能以提高冷却效果。主要特征——热界面材料 (TIM) 层——有助于解决这些处理器的热挑战。虽然 TIM 层在降低芯片和散热器或冷板之间的热阻方面发挥着至关重要的作用,但它们通常难以有效管理多芯片封装内单个芯片产生的集中热通量。随着 TDP 不断攀升,突破了传统 TIM 层可以处理的极限,这一点尤其如此。TIM 层虽然很重要,但已经不足以在苛刻的条件下持续冷却最热的处理器。

It’s become clear that silicon manufacturers, server manufacturers, and cloud service providers need a better way to ensure processors aren’t damaged or destroyed by heat.

很明显,芯片制造商、服务器制造商和云服务提供商需要一种更好的方法来确保处理器不会被热损坏或破坏。