Privacy and Confidentiality,
in particular, computational disclosure control

by Latanya Sweeney

Summary

Maintaining privacy and confidentiality in an electronic setting, has allowed me to explore a new area of computer science which I term, computational disclosure control. The goal of this work is to build and study computational techniques for controlling the disclosure of data such that the identity of any individual contained in the released data cannot be recognized. Producing anonymous data that remains specific enough to be useful is often a very difficult task and practice today tends to either incorrectly believe confidentiality is maintained when it is not or produces data that is practically useless. The objective of my work is to discover and control inferences that can be drawn regarding the identities of entities contained in released data.

In 1996, I presented the Scrub System which locates and replaces personally identifying information in unrestricted text. Letters between physicians and notes written by clinicians often contain nicknames, phone numbers and references to other care takers and family members. The Scrub System found 99-100% of these references, while the straightforward approach of global search-and-replace properly located no more than 30-60% of all such references. However, the Scrub System merely de-identifies information and cannot guarantee anonymity. In de-identified data, all explicit identifiers, such as Social Security number, name, address and phone number, are removed, generalized or replaced with a made-up alternative; anonymous, however, implies that the data cannot be manipulated or linked to identify any individual. Even when information shared with secondary parties is de-identified, it is far from anonymous.

In 1997, I presented the Datafly System whose goal is to provide the most general information useful to the recipient. Datafly maintains anonymity in data by automatically aggregating, substituting and removing information as appropriate. Decisions are made at the field and record level at the time of database access, so the approach can be incorporated into role-based security within an institution as well as in exporting schemes for data leaving an institution. The end result is a subset of the original database that provides minimal linking and matching of data since each record matches as many people as the user had specified.

Despite the possible effectiveness of these systems and others not mentioned here, completely anonymous data may not contain sufficient details for all uses, so care must be taken when released data can identify individuals and such care must be enforced by coherent policies and procedures. The harm to individuals can be extreme and irreparable and can occur without the individual's knowledge. Remedy against abuse however, lies outside these systems and resides in contracts, operating procedures and laws. For this reason, I also work on policy issues. Maintaining privacy and confidentiality in an electronic setting requires a symbiotic relationship between technology and policy.

Related Publications

  • Protection models for anonymous databases. Under review for publication.

  • Towards the collection of all the data on all the people. MIT Artificial Intelligence Working Paper, 1998.

  • Foundations of computational disclosure control. Under review for publication.

  • Commentary: researchers need not rely on consent or not. New England Journal of Medicine, 1998. (forthcoming)

  • Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression (with Pierangela Samarati). Unpublished.

  • Generalizing data to provide anonymity when disclosing information (with Pierangela Samarati). ACM Principles of Database Systems, Seatle, WA, USA, 1998. (forthcoming)

  • Towards the optimal suppression of details when disclosing medical data, the use of sub-combination analysis. Under review for publication.

  • Three computational systems for disclosing medical data in the year 1999. Proceedings, MEDINFO 98. International Medical Informatics Association. Seoul, Korea. North-Holland, 1998 (forthcoming).

  • Datafly: a system for providing anonymity in medical data. Database Security XI: Status and Prospects, T.Y. Lin and S. Qian, eds. IEEE, IFIP. New York: Chapman & Hall, 1998.
    Postscript file,(238 KB)

  • Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.

  • Sweeney, L. Maintaining anonymity when sharing medical data, the datafly system. MIT Artificial Intelligence Laboratory Working Paper. Cambridge: AIWP-WP344 (1997).
    Long, technical paper.
    Postscript file (1.4 MB)
    Postscript file, Compressed (380 KB)

  • Sweeney, L. Computational disclosure control for medical microdata. Record Linkage Workshop Bureau of the Census. Washington: (1997).
    Coming to the web soon.

  • Sweeney, L. Guaranteeing anonymity when sharing medical data, the datafly system. Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc, 1997.
    Short paper. Coming to the web soon.

  • Sweeney, L. Replacing personally-identifying information in medical records, the scrub system. In: Cimino, JJ, ed. Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc, 1996:333-337.
    This paper was awarded First Prize at AMIA 1996.
    Postscript file (300 KB)


    Click here for:
  • Latanya Sweeney's Home Page
  • Privacy and confidentiality
  • Computational disclosure control
  • Selected publications by Latanya Sweeney

    Last modified 2/24/98 by latanya@andrew.cmu.edu