Balancing Privacy and Security: The Privacy Implications of Government Data Mining Programs

Share

Chairman Leahy, Members of the Committee:

It is a pleasure and an honor to be with you today to speakabout the privacy implications of government data mining. You havechosen a very important issue to lead off what I know will be anaggressive docket of hearings and oversight in the Senate JudiciaryCommittee during the 110th Congress.

We all want the government to secure the country using methodsthat work. And we all want the government to cast aside securitymethods that do not work. The time and energy of the men and womenworking in national security is too important to be wasted, andlaw-abiding American citizens should not give up their privacy togovernment programs and practices that do not materially improvetheir security.

For the reasons I will articulate below, data mining is not, andcannot be, a useful tool in the anti-terror arsenal. The incidenceof terrorism and terrorism planning is too low for there to bestatistically sound modeling of terrorist activity.

The use of predictive data mining in an attempt to findterrorists or terrorism planning among Americans can only bepremised on using massive amounts of data about Americans'lifestyles, purchases, communications, travels, and many otherfacets of their lives. This raises a variety of privacy concerns.And the high false-positive rates that would be produced bypredictive data mining for terrorism would subject law-abidingAmericans to scrutiny and investigation based on entirely lawfuland innocent behavior.

I am director of information policy studies at the CatoInstitute, a non-profit research foundation dedicated to preservingthe traditional American principles of limited government,individual liberty, free markets, and peace. In that role, I studythe unique problems in adapting law and policy to the informationage. I also serve as a member of the Department of HomelandSecurity's Data Privacy and Integrity Advisory Committee, whichadvises the DHS Privacy Office and the Secretary of HomelandSecurity.

My most recent book is entitled Identity Crisis: HowIdentification Is Overused and Misunderstood. I am editor ofPrivacilla.org, a Web-based think tank devoted exclusively toprivacy, and I maintain an online resource about federallegislation and spending called WashingtonWatch.com. At HastingsCollege of the Law, I was editor-in-chief of the HastingsConstitutional Law Quarterly. I speak only for myself todayand not for any of the organizations with which I am affiliated orfor any colleague.

There are many facets to data mining and privacy issues, ofcourse, and I will discuss them below, but it is important to startwith terminology. The words used to describe these information ageissues tend to have fluid definitions. It would be unfortunate ifsemantics preserved disagreement when common ground is withinreach.

What is Privacy?

Everyone agrees that privacy is important, but people often meandifferent things when they talk about it. There are many dimensionsto "privacy" as the term is used in common parlance.

One dimension is the interest in control of information. In hisseminal 1967 book Privacy and Freedom, Alan Westincharacterized privacy as "the claim of individuals, groups, orinstitutions to determine for themselves when, how, and to whatextent information about them is communicated to others." I use andpromote a more precise, legalistic definition of privacy: thesubjective condition people experience when they have power tocontrol information about themselves and when they have exercisedthat power consistent with their interests and values. The"control" dimension of privacy alone has many nuances, but thereare other dimensions.

The Department of Homeland Security's Data Privacy and IntegrityAdvisory Committee has produced a privacy "framework" document thatusefully lists the dimensions of privacy, including control,fairness, liberty, and data security, as well as sub-dimensions ofthese values. This "framework" document helps our committee analyzehomeland security programs, technologies, and applications in lightof their effects on privacy. I recommend it to you and haveattached a copy of it to my testimony.

Fairness is an important value that is highly relevant here.People should be treated fairly when decisions are made about themusing stores of data. This requires consideration of both theaccuracy and integrity of data, and the legitimacy of thedecision-making tool or algorithm.

Privacy is sometimes used to refer to liberty interests, aswell. When freedom of movement or action is conditioned onrevealing personal information, such as when there is comprehensivesurveillance, this is also a privacy problem. "Dataveillance" --surveillance of data about people's actions -- is equivalent tovideo camera surveillance. The information it collects is notvisual, but the consequences and concerns are tightly inparallel.

Data security and personal security are also importantdimensions of "privacy" in its general sense. People are rightlyconcerned that information collected about them may be used to harmthem in some way. We are all familiar with the information agecrime of identity fraud, in which people's identifiers are used inremote transactions to impersonate them, debts are run up in theirnames, and their credit histories are polluted with inaccurateinformation. The Drivers Privacy Protection Act, Pub. L. No.103-322, was passed by Congress in part due to concerns that publicrecords about drivers could be used by stalkers, killers, and othermalefactors to locate them.

Privacy Issues in Terms Familiar to the JudiciaryCommittee

I have spoken about privacy in general terms, but these conceptscan be translated into language that is more familiar to theJudiciary Committee.

For example, if government data mining will affect individuals'life, liberty, or property -- including the recognized libertyinterest in travel -- the questions whether information is accurateand whether an algorithm is legitimate go to Fifth Amendment DueProcess. Using inaccurate information or unsound algorithms mayviolate individuals' Due Process rights if they cannot contestdecisions that government officials make about them.

If officials search or seize someone's person, house, papers, oreffects because he or she has been made a suspect by data mining,there are Fourth Amendment questions. A search or seizure premisedon bad data or lousy math is unlikely to be reasonable and thuswill fail to meet the crucial standard set by the FourthAmendment.

I hasten to add that the Supreme Court's Fourth Amendmentdoctrine has rapidly fallen out of step with modern life.Information that people create, transmit, or store in online anddigital environments is just as sensitive as the letters, writings,and records that the Framers sought protection for through theFourth Amendment, yet a number of Supreme Court precedents suggestthat such information falls outside of the Fourth Amendment becauseof the mechanics of its creation and transmission, or its remotestorage with third parties.

A bad algorithm may also violate Equal Protection by treatingpeople differently or making them suspects based on characteristicsthe Equal Protection doctrine has ruled out.

There are a number of different concerns that the Americanpeople rightly have with government data mining. The protections ofour constitution are meant to provide them security against threatsto privacy and related interests. But before we draw conclusionsabout data mining, it is important to work on a common terminologyto describe this field.

What is Data Mining?

There is little doubt that public debate about data mining hasbeen hampered by the fact that people often do not use common termsto describe the concepts under consideration. Let me offer the wayI think about these issues, first by dividing the field of "dataanalysis" or "information analysis" into two subsets: link analysis(also called subject-based analysis) and pattern analysis.

Link Analysis

Link analysis is a relatively unremarkable use of databases. Itinvolves following known information to other information. Forexample, a phone number associated with terrorist activity might becompared against lists of phone numbers to see who has called thatnumber, who has been called by that number, who has reported thatnumber as their own, and so on. When the number is found in anotherdatabase, a link has been made. It is a lead to follow, wherever itgoes.

This is all subject to common sense and (often) Fourth Amendmentlimitations: The suspiciousness or importance of the originatinginformation and of the new information dictates what is appropriateto do with, or based on, the new information.

Following links is what law enforcement and national securitypersonnel have done for hundreds of years. We expect them to do it,and we want them to do it. The exciting thing about link analysisin the information age is that observations made by differentpeople at different times, collected in databases, can now readilybe combined. As Jeff Jonas and I wrote in our recent paper on datamining:

"Data analysis adds to the investigatory arsenal ofnational security and law enforcement by bringing together moreinformation from more diverse sources and correlating the data.Finding previously unknown financial or communications linksbetween criminal gangs, for example, can give investigators moreinsight into their activities and culture, strengthening the handof law enforcement."

Jonas is distinguished engineer and chief scientist with IBM'sEntity Analytic Solutions Group. I have attached our paper,Effective Counterterrorism and the Limited Role of PredictiveData Mining to my testimony.

Following links from known information to new information isdistinct from pattern-based analysis, which is where the concernsabout "data mining" are most merited.

Pattern Analysis

Pattern analysis is looking for a pattern in data that has twocharacteristics: 1) It is consistent with bad behavior, such asterrorism planning or crime; and 2) it is inconsistent withinnocent behavior.

In our paper, Jonas and I wrote about the classic FourthAmendment case, Terry v. Ohio, where a police officer sawTerry walking past a store multiple times, looking in furtively.This was 1) consistent with criminal planning ("casing" the storefor robbery) and 2) inconsistent with innocent behavior -- itdidn't look like shopping, curiosity, or unrequited love of a storeclerk. The officer's "hunch" in Terry can be described asa successful use of pattern analysis before the age ofdatabases.

There are three ways that seem to be used (or, at least, havebeen proposed) to develop similar "hunches" -- or suitable patternsin data: 1) historical information; 2) red-teaming; and 3)anomaly.

Historical Patterns

As Jonas and I discuss in our paper, marketers use historicalinformation to find the patterns that they use as their basis foraction. They try to figure out which combinations of variablesamong current customers make them customers. When the combinationsof variables are found again, this points them to potential newcustomers, and it merits them sending a mailer to the prospects'homes, for example. Credit issuers do the same things, and there isa fascinating array of different ways that they slice and diceinformation seeking after good credit risks that other creditissuers have not found. Historical data is widely accepted in theseareas as a tool for finding patterns, and consumers enjoy economicbenefits from these processes.

Historical patterns can also form the basis for discovery ofrelatively common crimes, such as credit card fraud. With manythousands of examples per year, credit card networks are in aposition to develop patterns of fraud based on historical evidence.Finding these patterns in current data, they are justified incalling their customers to ask whether certain charges are theirs.Jonas and I call this "predictive data mining" because thehistorical pattern predicts with suitable accuracy that a certainactivity or condition (credit card fraud, a willing buyer, etc.)will be found when the pattern is found.

However, the terrorism context has a distinct lack of historicalpatterns to go on. In our paper, Jonas and I write:

"With a relatively small number of attempts every yearand only one or two major terrorist incidents every few years--eachone distinct in terms of planning and execution--there are nomeaningful patterns that show what behavior indicates planning orpreparation for terrorism."

The lack of historical patterns is just half of the problem withfinding terrorists using pattern analysis.

False Positives

The rarity of terrorists and terrorist acts is good news, to besure, but it further compounds the problem of data mining to findthem: When a condition is rare, even a very accurate test for itwill result in a high number of false positives. Even a highlyaccurate test is often inappropriate to use in searching for a rarecondition among a large group.

In our paper, Jonas and I illustrate this using a hypotheticaltest for disease that would accurately detect it 99% of the timeand yield a false positive only 1 percent of the time. If the testindicated the disease, the protocol would call for a doctor toperform a biopsy on the patient to confirm or falsify the testresult.

If 0.1 percent of the U.S. population had the disease, 297,000of the 300,000 victims would be identified by running the test onthe entire population. But doing so would falsely identify 3million people as having the disease and subject them toan unnecessary biopsy. Running the test multiple times would drivefalse positives even higher.

The rarity of terrorists and terrorism planning in the U.S.means that even a highly accurate test for terrorists would havevery high false positives. This, we conclude, would renderpredictive data mining for terrorism more harmful than beneficial.It would cost too much money, occupy too much investigator time,and do more to threaten civil liberties than is justified by anyimprovement in security it would bring.

"Red-Teaming"

A second way to create patterns is "red-teaming." This is theidea that one can create patterns to look for by planning an attackand then watching what data is produced in that planning process,or in preliminaries to carrying out the attack. That pattern, foundagain in data, would indicate planning or preparation for that typeof attack.

This technique was not a subject of our paper, but many of thesame problems apply. The pattern developed by red-teaming willmatch terrorism planning -- it is, after all, synthesized planning.But, to work, it must also not fit a pattern of innocentbehavior.

Recall that after 9/11 people were questioned and even arrestedfor taking pictures of bridges, monuments, and buildings. To commonknowledge, photographing landmarks fits a pattern of terrorismplanning. After all, terrorists need to case their targets. Butphotographing landmarks fits many patterns of innocent behavioralso, such as tourism, photography as a hobby, architecture, and soon. This clumsy, improvised "red-teaming" failed the second test ofpattern development.

Formal red-teaming would surely be more finely tuned, but itstill would have to overcome the false positive problem. Given anextremely small number of terrorists or terrorist activities in alarge population, near perfection would be required in the pattern,or it would yield massive error rates, invite waste ofinvestigative energy, and threaten privacy and civil liberties.

It seems doubtful that red teams would be able to devise anattack with a data profile so narrow that it does not createexcessive false positives, yet so broad that it matches somegroup's plan for a terror attack. To me, using red-teaming this wayhas all the plausibility of stopping a fired bullet with anotherbullet.

Red-teaming can be useful, it seems, but not for data analysis.If red-teaming were to come up with a viable attack, the means ofcarrying out that attack should be foreclosed directly with newsecurity measures applied to the tool or target of the attack --never mind who might carry it out. It would be gross malpracticefor anyone in our national security services to conceive of anattack on our infrastructure or people, and then fail to secureagainst the vulnerability directly while watching for the attack'spattern in data.

Anomaly

Without historical or red-team patterns, some have suggestedthat anomaly should be the basis of suspicion. Given the patternsin data of "normal" behavior, things deviating from that might beregarded as suspicious. (This is actually a version of historicalpatterning, but the idea is to find deviation from a pattern ratherthan matching to a pattern.)

It is downright un-American to think that acting differentlycould make a person a suspect. On a practical level,one-in-a-million things happen a million times a day. Looking foranomalies will turn up lots of things, but none relevant. Andterrorists could avoid this technique by acting as normally aspossible. In short, anomaly is not a legitimate basis for formingsuspicion.

Historical-pattern-based data analysis -- what Jeff Jonas and Icall "predictive data mining" -- has many uses in things such asmedical research, marketing and credit scoring, many forms ofscientific inquiry, and other searches for knowledge. It is notuseful in the terrorist discovery problem. Searching for"red-teamed" patterns and for anomalies has many of the sameflaws.

Data Mining for Terrorists Does Not Work

The conclusion whether a type of data analysis "works" turns onthe most important question in the data-analysis analysis: Whataction does a "match" create a predicate for? When a link, pattern,or deviation from a pattern has been established, and then it isfound in the data, what action will be taken?

When marketers use a historical pattern to determine who willreceive a promotional flyer, this predictive data mining "works"even if it is wrong 95% of the time. The cost of being wrong may be50 cents for mailing it, and a few moments of time for the personwrongly identified as a potential customer.

Predictive data mining is appropriate for seeking credit cardfraud. A call to a customer from the credit issuer will reassurethe customer whether he or she is correctly targeted or not.

Predictive data mining and other forms of pattern analysis mightbe used to send beat cops to a certain part of town. The harm frombeing wrong is some wasted resources -- which nobody wants, ofcourse -- but there is no threat to individual rights.

If, on the other hand, government officials are using datamining to pull U.S. citizen travelers out of line, if they areusing patterns to determine that phones in the United States shouldbe tapped, and so on, data mining does not "work" unless it isquite a bit more accurate.

The question whether data mining works is not a technical one.It is not a question for computer or database experts to answer. Itis a question of reasonableness under the Fourth Amendment, to bedetermined by the courts, by Congress, and, broadly speaking, bythe society as a whole.

Because of the near statistical impossibility of catchingterrorists through data mining, and because of its high costs ininvestigator time, taxpayer dollars, lost privacy, and threatenedliberty, I conclude that data mining does not work in the area ofterrorism.

But my conclusion should not be determinative. Rather, it shouldbe an early part of a national conversation about government dataanalysis, the applications in which data analysis and data mining"work," and those in which it does not.

Fairness, Reasonableness, and Transparency

One of the most important places for that conversation to happenis in Congress -- here in this Committee -- and in the courts. Thishearing begins to shed light on the questions involved in datamining.

But government data mining programs must also be subjected tothe legal controls imposed by the Constitution. The questionwhether a data analysis program affecting individuals meetsconstitutional muster brings us to the final important question:whether the program provides redress.

"Redress" is data-analysis jargon for Due Process. If a datamining or other data analysis system is going to affectindividuals' rights or liberty, Due Process requires that theperson should be able to appeal or contest the decision made usingthe system, ultimately -- if not originally -- in a court oflaw.

This requires two things, I think: access to the data that wasanalyzed in determining that the person should be singled out, andaccess to the pattern or link information that was used todetermine that the person should be singled out.

Access to data is like asking the police officer in Terry v.Ohio what he saw when he determined that he should pat downthe defendant. Was the officer entitled to look where he looked?Was he paying sufficient attention to the defendants' actions? Wewould not deny defendants the chance to explore these questions ina criminal court, and should not let data mining that affectsindividuals' liberties escape similar scrutiny.

Access to the pattern/algorithm allows review analogous todetermining whether the officer's decision to pat down Terry was,as required by the Fourth Amendment, reasonable. Was the pattern ofbehavior he saw so consistent with wrongful behavior, and soinconsistent with innocent behavior, that it justifies having lawenforcement intervene in the privacy and repose of the presumedinnocent? This question can and should be asked of data miningprograms.

Government data mining and data analysis may seem to involvehighly technical issues, reserved for computer and databaseexperts. But, again, the most important questions are routinelyaddressed by this Committee, by Congress, by the press, and by theAmerican people. The questions are embedded in the Constitution'sFourth and Fifth Amendments and the Supreme Court's precedents.They are about simple fairness: Do these systems use accurateinformation? Do they draw sensible conclusions? And do theirfindings justify the actions officialdom takes because of them?

Citizens must have full redress/Due Process when their rights orliberties are affected by government data mining or other dataanalysis programs, just as when their rights or liberties areaffected by any program. This requires transparency, which to datehas not been forthcoming.

Many data-intensive programs in the federal government -- datamining or not -- have been obscured from the vision of the press,the public, and Congress. Often, these programs are hidden by thickjargon and inadequate disclosure.

This hearing, and your continued oversight, will help clear thefog. Proponents of these programs should make the case for them,forthrightly and openly.

In some cases, data-intensive programs have been obscured bydirect claims to secrecy. These claims would deny the courts,Congress, and the public from determining whether they are fair andreasonable.

The secrecy claims suggest that these systems are poorlydesigned. It is well known that "security by obscurity" is a weaksecurity practice. It amounts to hiding weaknesses, rather thanrepairing them, in the hopes that your attacker does not find them.Data intensive systems that require secrecy to function -- that donot allow people to see the data used or review the algorithm --are premised on security by obscurity.

These systems have weaknesses. We just do not know whatthey are. Because people on our side in the press, thepublic, Congress, and elsewhere cannot probe these systems and lookfor their flaws, they will tend to have more flaws than systemsthat are transparent, and subject to criticism and testing. We willnot know when an attacker has discovered a flaw and is preparing toexploit it.

The best security systems are available for examination andtesting -- by good people and bad people alike -- and they stillwork to secure. Locks on doors are a good, familiar example. Anyonecan study locks and learn how to break them, yet they serve thepurpose they are designed for, and we know enough not to use themfor things they will not protect.

As long as we are unable to examine government data analysissystems the same way we examine locks and other security tools,these systems will not provide reliable security. But they willmanifest an ongoing threat to privacy and civil liberties.

Conclusion

I have devoted my testimony to the question whether governmentdata mining can work to discover terrorism. The security issues areparamount. I feel it clear that data mining does not work for thispurpose.

Government data mining relies on access to large stores of dataabout Americans -- from federal government files, state publicrecords, telecommunications company databases, from banks andpayment processors, from health care providers, and so on.Predictive data mining, in particular, hungers for Americans'personal information because it uses data both in the developmentof patterns and in the search for those patterns.

There is a growing industry that collects consumer data foruseful purposes like marketing and consumer credit. But thisindustry also appears to see the government as a lucrativecustomer. Most Americans are probably still unaware that a gooddeal of information about them in the data-stream of commerce maybe used by their government to make decisions that coercivelyaffect their lives, liberty, and property.

Here, again, the answer is transparency. Along with thetransparency that will give this Committee the ability to doeffective oversight into programs and practices, there should betransparency of the type that empowers individuals.

The data used in government data mining programs should besubject to the protections of the Privacy Act, no matter where thedata is housed or by whom it is processed. Data in these programscannot be exempted from the Privacy Act under national security orlaw enforcement exemptions without them treating all citizens likesuspects.

The data sources should be made known, especially when data oranalyses are provided to the government by private providers. Thiswould allow the public to better understand where the informationeconomy may work against their interests.

Many things must be done to capture the privacy implications ofgovernment data mining. This hearing provides an important firststart by commencing a needed conversation on the issues.Transparency and much more examination of government data mining isthe first, most important step toward making sure that thisinformation age practice is used to the maximum benefit of theAmerican people.

Jim Harper

Committee on the Judiciary
United States Senate