Data Matching: Some Ethical Considerations
Michael McFarland, SJ
Matching is the comparison of personal data from two or more different sources in a search for anomalous conditions. The anomaly might be the existence of a person in two different data sets where that is not expected or allowed. For example, persons drawing welfare or unemployment benefits should not also appear in a list of workers currently employed by the federal government. If they do, the presumption is, they are "double-dipping," drawing benefits fraudulently. Or, the anomaly might be the presence of the person in one set of data and absence in another, where any person in one should be in the other. When registration for the draft was mandatory for 18-year-old males, for example, a list of males over 18 years of age holding drivers licenses might be checked against a list of draft registrants to see if any in the former list were missing from the latter. Finally the anomaly might be an inconsistency in the data for the same individual in two different data sets. The Internal Revenue Service, for example, might be interested in a discrepancy between the income declared on an individual's tax return and the income listed for the same individual in an employer's records.
These searches often involve looking for matches among millions, even hundreds of millions, of records. It is the processing power of the digital computer and the existence of voluminous computerized databases that make them feasible. In order to identify matches reliably, it is also necessary to have a way of identifying individuals that is unique and consistent across all the databases to be compared. Names do not work because different individuals can have the same name. In the United States the identifier most often used is the Social Security number (SSN). That is one of the reasons why so many different institutions, both in government and in the private sector, require that clients supply their SSNs. Most other countries have some kind of national identification numbers that all citizens are required to have.
In the United States, computer matching has been used extensively since the 1970s to check recipients of benefit programs for possible fraud and abuse. By 1982, there were about 500 different matching searches carried out regularly by state and federal agencies. 1 These were tied to government attempts to make benefit programs less wasteful and thereby to reduce their drain on state and federal budgets. A number of federal laws, for example, require that states implement matching programs if they are to receive federal funding for welfare programs. 2 As a result welfare recipients are often subject to matching. For example, those receiving Aid to Families with Dependent Children (AFDC) are checked against "data from state income tax, motor vehicle registration, school records, correction files on inmate status, veteran records, worker's compensation, and low income home compilations together with bureau records from employers, banks and credit agencies," 3 in order to find conditions that might make them ineligible. The welfare reform bill passed in the early 1990s, responding to the popular concern over the welfare costs caused by parents, mostly fathers, who abandon their children, requires that lists of parents who default on child care payments be checked against "Directories of New Hires" to be set up in every state and nationally. 4 In both these cases, it should be noted, the search includes records from the private sector as well as government records. The Social Security Number is almost always the linking identifier.
Matches are used to try to increase revenue as well as control costs. The Internal Revenue Service checks tax forms against local government and private employer records to try to catch undeclared and underreported income. 5
Matching can also be used for enforcement in areas not directly tied to the budget. During the Vietnam War, matches were run against various lists of males who should be eligible for the draft to try to catch draft evaders. Since 9/11, the Department of Homeland Security, the FBI and the Defense Intelligence Agency, among others, have carried out numerous searches that sift through massive amounts of data from a wide variety of sources, including Internet searches, looking for anomalies that might be signs of terrorists and terrorist activities. These searches involve U.S. citizens as well and foreign nationals. This activity continues even though Congress recently killed a similar effort because of concerns about privacy and civil rights. 6
With the development of facial-recognition technology, the use of matching has extended into databases of photos and videos. For example, one type of program used by at least 34 states, combs through databases of driver's license photos, looking for matches that might indicate that someone has created a second, false identity. Unfortunately the program creates false positives that cast suspicion on innocent people, sometimes with dire consequences. The same technology is being adopted in many other areas, as well. The State Department, for example, uses it to check visa applications; and now some police departments are preparing to use facial recognition to take pictures of people they encounter and check them against databases of known offenders. 7
The United States is not alone in its use of computer matching. In Germany personal data linked through Social Security numbers is used to expose fraud and regulate the work permits issued to foreigners. In Australia a national identification system permits the government to match data relevant to taxation. Sweden also uses extensive data collection to support the tax system. 8
It is easy to see why matching is attractive to government authorities. They cannot investigate individually the millions of taxpayers and welfare and Social Security recipients they deal with every year. Matching provides a very efficient way of detecting possible cases of fraud and abuse. It has helped catch people who have been receiving welfare benefits for which they were not eligible, found health-care providers who were double-billing Medicare, and detected thousands of cases where Social Security checks were being sent to deceased beneficiaries. In 1984 Richard Kusserow, the head of the federal Department of Health and Human Services, claimed that in one year matching programs contributed to $1.4 billion in savings by his department. 9 The potential of matching to uncover terrorist plots is also attractive to many.
Nevertheless, matching raises some serious problems. One is reliability. Whatever the problems are with maintaining a single data file on a group of individuals, they are far worse when trying to relate two or more separate ones with different origins and administrators. Social Security numbers and other identifiers can be misread, mistyped, intentionally falsified or missing altogether. This can lead to many missed matches and false matches. Files can have different formats or coding systems, so that data in one or both files are incomparable or subject to misinterpretation. Data can be out of date, incomplete or in error. Moreover the problem of taking the data out of context is far more serious. What comes out of a matching program is usually just a list of identifiers for records that matched (or did not match) in the two data files. Therefore not only is the data divorced from its original context and meaning, but the list of matches is removed from the data that gave rise to it. This makes it even more likely that the users will draw false conclusions from it. 10 Trying to match photos of faces has its own set of problems and inaccuracies. Even the best-trained human experts are often wrong.
An incident some years ago in Massachusetts indicates how unreliable matching programs can be. In 1994 the state ran a search to find families that were receiving welfare benefits simultaneously from Massachusetts and another state. They found 642 "hits," which supposedly meant that those families were receiving benefits illegally. However, when the state tried to cut off benefits, several of the families sued, and the court found that at least 378 of the 642 families identified were mistakenly accused. 11 Unfortunately cases such as this with error rates of 50 percent or more are not that unusual. When the US Department of Health, Education and Welfare ran its list of welfare recipients against a list of its own employees to find welfare cheaters, it generated 33,000 matches. After a year of investigation, this was reduced to 638 cases of possible fraud, and of these only 55 were taken to court. 12
The above cases highlight another problem with matching programs: lack of due process. Matches in themselves are often interpreted as evidence of wrongdoing. Therefore in welfare investigations, for instance, authorities will move to cut off benefits for those identified by the matching program. The accused must then go to court to get the benefits restored, an expensive and time-consuming process, during which the accused may have no means of support. Thus those identified by the program lose their right to the presumption of innocence based on the outcome of a machine sifting through millions of records, without any human being looking at their case or checking the validity of the outcome. The burden of proof is shifted to the accused, which is unfair, especially given the unreliability of the matching procedure.
There is a more fundamental objection to matching, which addresses the nature of the activity, quite apart from any consequences. The problem is that a matching program accesses the personal information of large numbers of people, mostly innocent, without their knowledge or consent. Often information given by the subject for the purposes of credit, banking, employment, education or health care is used to check on compliance with government programs. On the face of it, this is a violation of the subjects' autonomy. Furthermore people are accused of wrongdoing and subject to sanctions based on the outcome of a computer program looking for certain conditions in a large volume of data. This reduces people to collections of facts, or, worse yet, computer codes, which is a violation of the obligation to treat them as persons. Some critics, like John Shattuck, also argue that matching violates the right to freedom from "unreasonable searches and seizures" protected by the Fourth Amendment of the United States Constitution, which decrees that the government is only allowed to investigate people when it has an indication that they are involved in wrongdoing. It was directed against mass house-to-house searches and arbitrary "stop-and-frisk" operations on the street. The argument is that combing through the personal data of large populations looking for anomalies is the same kind of arbitrary intrusion into innocent people's lives. 13
Those who defend matching, on the other hand, claim that when people apply for welfare or Social Security, or pay their taxes, they implicitly give their consent to using relevant data for checking on their eligibility. This is the argument made, for example, by Rubin E. Cruse, Jr., in an article in the Computer/Law Journal.14 According to Cruse, when people apply for benefits, they must reveal their incomes and assets. Therefore when the government checks records of such data, it is simply looking at information that the applicants have already consented to give. Furthermore the government only gains new information on people who gave false information on their applications. It is only the wrongdoers who are identified by the matching process. The others and their data go unremarked. And those who have been shown by the matching process to have discrepancies in their data can legitimately be investigated, because there is now evidence that they have been involved in wrongdoing. Implicit in this argument is the assumption that when personal data is run through a computer with no human monitoring, there is no violation of confidentiality, except for the data that is reported out.
The case of taxpayers is similar, Cruse argues. When they file their tax returns they disclose to the IRS their incomes. When the IRS checks these figures against income reports from employers, banks and so on, it is not gaining access to any information that it does not already have unless there is a discrepancy between self-reported income and income listed in other sources. But in that case there are grounds for an investigation.
This clever argument does capture the sense most people have that privacy and confidentiality should not be a cover for fraud. It depends, however, on an interesting philosophical assumption that not everyone would grant. It assumes that there is no access to records except when someone looks at them. But when a government agency runs a match on data from another source, particularly in the private sector, the government does appropriate the personal data of many thousands or even millions of people, most of whom have no reason to be subject to an investigation. And it is only when the run is finished that government knows which subjects are suspect. Furthermore, the argument that the investigators only look at the records of those whose data is suspicious assumes a high degree of reliability in the matching process and is a difficult assumption to justify. If the error rate were 1 percent, or even 10 percent, it might be possible to claim that a "hit" was enough to throw suspicion on the subject. But when the error rate is 50 percent, or 90 percent, that is not such a convincing assertion.
Even if we grant its underlying assumptions, the argument that matching has the implicit consent of its subjects is valid only under limited circumstances. First it must be used only on people who have provided information, either freely in return for some benefit, or in response to some legitimate government mandate, such as paying taxes. Second it can be used only to verify information that the subjects have provided. And third, the possibility of a matching program and accompanying investigation must be known and understood by the subjects when they provide the information.Michael McFarland, S.J., a computer scientist with extensive liberal arts teaching experience and a special interest in the intersection of technology and ethics, served as the 31st president of the College of the Holy Cross.
1. Clarke, "Information Technology and Dataveilance," p. 504.
2. ibid, p. 504.
3. Simits, "Reviewing Privacy in an Information Society," p. 715.
4. Boyd, "In Cyberspace, Private Files are Becoming an Open Book," p. 3.
5. Simits, p. 716.
6. Declan McCullagh, "Government Data-Mining Lives On," CNet, (June 1, 2004), http://news.cnet.com/Government-data-mining-lives-on/2010-1028_3-5223088.html
7. Robert Charette, "Here's Looking at You, and You, and You …," IEEE Spectrum, (July 25, 2011), http://spectrum.ieee.org/riskfactor/computing/it/heres-looking-at-you-and-you-and-you-.
8. Simits, pp. 716-17.
9. Richard P. Kusserow, "The Government Needs Computer Matching to Root Out Waste and Fraud," Communications of the ACM, 27(6) (June, 1984): 542-545.
10. Clarke, p. 506.
11. Judy Rakowsky, "Weld Officials Err in Welfare Crackdown," The Boston Globe(October 20, 1994): 29-30.
12. Clarke, p. 508.
13. John Shattuck, "Computer Matching is a Serious Threat to Individual Rights," Communications of the ACM, 27(6) (June, 1984): 538-541.
14. Rubin E. Cruse, Jr., "Invasions of Privacy and Computer Matching Programs: A Different Perspective," Computer/Law Journal XI (1992): 461-480.
Jun 1, 2012
Internet Ethics Stories
A recent collaboration with The Atlantic magazine addresses key issues in technology ethics.
A list of readings that provides a starting point for conversations about the ethical issues in big data.
An Ethics Case Study
China is in the process of developing a "social credit score" for its citizens.