Sightline

Medium

scikit-learn

Sensitive Data Leakage via TfidfVectorizer in Machine Learning Library

A vulnerability in scikit-learn's TfidfVectorizer (version 1.4.1.post1) allows for the unintended storage of sensitive data in the `stop_words_` attribute, leading to potential data leakage. This issue was patched in version 1.5.0. The flaw arises when limiting the vocabulary size during the fitting process, causing all unique tokens to be stored, including those not needed for the vectorization process.

Available publicly on Jun 01 2024 | Available with Premium on May 22 2024

CVE:

CVE-2024-5206

CWE:

921:Storage of Sensitive Data in a Mechanism without Access Control

CVSS:

CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N

Credit:

kemalty

Assess Detect

Unavailable

Remediate

Remediation Steps

Upgrade to scikit-learn version 1.5.0 or later.
For versions prior to 1.5.0, manually clear the stop_words_ attribute after fitting the TfidfVectorizer to prevent sensitive data storage.
Review and sanitize datasets for sensitive information before fitting the vectorizer.
Implement access controls and encryption for stored vectorizer objects to mitigate unauthorized access risks.

Patch Details

Fixed Version: 1.5.0
Patch Commit: https://github.com/scikit-learn/scikit-learn/commit/70ca21f106b603b611da73012c9ade7cd8e438b8

Want more out of Sightline?

Sightline offers even more for premium customers

Go Premium

We have 649- related security advisories that are available with Sightline Premium.