Maintaining privacy and confidentiality in an electronic setting, has allowed me to explore a new area of computer science which I term, computational disclosure control. The goal of this work is to build and study computational techniques for controlling the disclosure of data such that the identity of any individual contained in the released data cannot be recognized. Producing anonymous data that remains specific enough to be useful is often a very difficult task and practice today tends to either incorrectly believe confidentiality is maintained when it is not or produces data that is practically useless. The objective of my work is to discover and control inferences that can be drawn regarding the identities of entities contained in released data.
In 1996, I presented the Scrub System which locates and replaces personally identifying information in unrestricted text. Letters between physicians and notes written by clinicians often contain nicknames, phone numbers and references to other care takers and family members. The Scrub System found 99-100% of these references, while the straightforward approach of global search-and-replace properly located no more than 30-60% of all such references. However, the Scrub System merely de-identifies information and cannot guarantee anonymity. In de-identified data, all explicit identifiers, such as Social Security number, name, address and phone number, are removed, generalized or replaced with a made-up alternative; anonymous, however, implies that the data cannot be manipulated or linked to identify any individual. Even when information shared with secondary parties is de-identified, it is far from anonymous.
In 1997, I presented the Datafly System whose goal is to provide the most general information useful to the recipient. Datafly maintains anonymity in data by automatically aggregating, substituting and removing information as appropriate. Decisions are made at the field and record level at the time of database access, so the approach can be incorporated into role-based security within an institution as well as in exporting schemes for data leaving an institution. The end result is a subset of the original database that provides minimal linking and matching of data since each record matches as many people as the user had specified.
Despite the possible effectiveness of these systems and others not mentioned here, completely anonymous data may not contain sufficient details for all uses, so care must be taken when released data can identify individuals and such care must be enforced by coherent policies and procedures. The harm to individuals can be extreme and irreparable and can occur without the individual's knowledge. Remedy against abuse however, lies outside these systems and resides in contracts, operating procedures and laws. For this reason, I also work on policy issues. Maintaining privacy and confidentiality in an electronic setting requires a symbiotic relationship between technology and policy.
Last modified 2/24/98 by latanya@andrew.cmu.edu