CISPA
Browse

Memorization in Self-Supervised Learning Improves Downstream Generalization

Download (1.5 MB)
conference contribution
posted on 2024-02-26, 11:07 authored by Wenhao Wang, Muhammad Ahmad Kaleem, Adam DziedzicAdam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.

History

Primary Research Area

  • Trustworthy Information Processing

Name of Conference

International Conference on Learning Representations (ICLR)

Journal

ICLR

BibTeX

@conference{Wang:Kaleem:Dziedzic:Backes:Papernot:Boenisch:2024, title = "Memorization in Self-Supervised Learning Improves Downstream Generalization", author = "Wang, Wenhao" AND "Kaleem, Muhammad Ahmad" AND "Dziedzic, Adam" AND "Backes, Michael" AND "Papernot, Nicolas" AND "Boenisch, Franziska", year = 2024, month = 5, journal = "ICLR" }

Usage metrics

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC