Sensitive Data Leakage via TfidfVectorizer in Machine Learning Library
A vulnerability in scikit-learn's TfidfVectorizer (version 1.4.1.post1) allows for the unintended storage of sensitive data in the `stop_words_` attribute, leading to potential data leakage. This issue was patched in version 1.5.0. The flaw arises when limiting the vocabulary size during the fitting process, causing all unique tokens to be stored, including those not needed for the vectorization process.
Available publicly on Jun 01 2024 | Available with Premium on May 22 2024
Remediation Steps
- Upgrade to scikit-learn version 1.5.0 or later.
- For versions prior to 1.5.0, manually clear the
stop_words_
attribute after fitting the TfidfVectorizer to prevent sensitive data storage. - Review and sanitize datasets for sensitive information before fitting the vectorizer.
- Implement access controls and encryption for stored vectorizer objects to mitigate unauthorized access risks.
Patch Details
- Fixed Version: 1.5.0
- Patch Commit: https://github.com/scikit-learn/scikit-learn/commit/70ca21f106b603b611da73012c9ade7cd8e438b8
Want more out of Sightline?
Sightline offers even more for premium customers
Go Premium
We have - related security advisories that are available with Sightline Premium.