Medium

scikit-learn

Sensitive Data Leakage via TfidfVectorizer in Machine Learning Library

A vulnerability in scikit-learn's TfidfVectorizer (version 1.4.1.post1) allows for the unintended storage of sensitive data in the `stop_words_` attribute, leading to potential data leakage. This issue was patched in version 1.5.0. The flaw arises when limiting the vocabulary size during the fitting process, causing all unique tokens to be stored, including those not needed for the vectorization process.

Available publicly on Jun 01 2024

5.3

CVSS:

CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N

Credit:

kemalty
Remediation Steps
  • Upgrade to scikit-learn version 1.5.0 or later.
  • For versions prior to 1.5.0, manually clear the stop_words_ attribute after fitting the TfidfVectorizer to prevent sensitive data storage.
  • Review and sanitize datasets for sensitive information before fitting the vectorizer.
  • Implement access controls and encryption for stored vectorizer objects to mitigate unauthorized access risks.
Patch Details
  • Fixed Version: 1.5.0
  • Patch Commit: https://github.com/scikit-learn/scikit-learn/commit/70ca21f106b603b611da73012c9ade7cd8e438b8
Want more out of Sightline?

Sightline offers even more for premium customers

Go Premium

We have - related security advisories that are available with Sightline Premium.