Sightline

Medium

scikit-learn

Sensitive Data Leakage via TfidfVectorizer in Machine Learning Library

A vulnerability in scikit-learn's TfidfVectorizer (version 1.4.1.post1) allows for the unintended storage of sensitive data in the `stop_words_` attribute, leading to potential data leakage. This issue was patched in version 1.5.0. The flaw arises when limiting the vocabulary size during the fitting process, causing all unique tokens to be stored, including those not needed for the vectorization process.

Available publicly on Jun 01 2024 | Available with Premium on May 22 2024

CVE:

CVE-2024-5206

CWE:

921:Storage of Sensitive Data in a Mechanism without Access Control

CVSS:

CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N

Credit:

kemalty

Threat Overview

The TfidfVectorizer is designed to convert text data into vectors for machine learning models. However, a flaw in its implementation causes it to store all unique tokens passed during the fitting process in the stop_words_ attribute, rather than just the necessary subset. This behavior leads to the unintended storage of potentially sensitive information, such as passwords or confidential keys, which should not be retained post-processing. The leakage occurs regardless of the method used to limit the vocabulary size, posing a risk of exposing sensitive data to unauthorized parties.

Attack Scenario

An attacker with read access to the stored vectorizer object, possibly through a data breach or by accessing a publicly exposed dataset, could extract the stop_words_ attribute. This attribute may contain sensitive tokens used during the training phase. The attacker could then attempt to reconstruct sensitive information or secrets from these tokens, especially if the vectorizer was trained on data containing confidential or critical information.

Who is affected

Entities using scikit-learn's TfidfVectorizer for processing text data, especially those fitting the vectorizer on datasets containing sensitive or confidential information, are at risk. The severity of the impact varies based on the nature of the data processed; it ranges from minimal for public datasets to critical for datasets containing secrets or confidential company information.

Technical Report

Want more out of Sightline?

Sightline offers even more for premium customers

Go Premium

We have 649 related security advisories that are available with Sightline Premium.