Dataset Catalog for Data Ethics Projects - Markkula Center for Applied Ethics

The Markkula Center for Applied Ethics hosts a list of datasets that can be used to introduce and address considerations of ethics in graduate and undergraduate machine learning and data analytics courses.

Markkula Center contact for Dataset release for SCU students: Subbu Vincent (svincent@scu.edu)

Home Mortgage Disclosure Act

This research-ready data set of U.S. home mortgage loan applications is based on data from the federally mandated Home Mortgage Disclosure Act. In 2014, the most recent year for which data is available, there were about 11.7 million loan records reported by 7,062 financial institutions in 2014. These records include applications for home purchase, for home improvement, and for refinancing.

Potential ethics-linked analysis:

Explore this US national mortgage dataset and build predictions for loan approvals or denials.

Check a) whether it can be used in predicting loan approvals and b) if so, identify data bias effects, if any, and how that impacts fairness. c) If there is unfairness, outline the specifics and what could be changed?

Availability:

For SCU Students: Upon request (Licensed by the Markkula Center)

For others: https://www.propublica.org/datastore/dataset/home-mortgage-disclosure-act

Criminal recidivism risk analysis for parole decisions

Across the nation, judges, probation and parole officers are increasingly using algorithms to assess a criminal defendant’s likelihood to re-offend. The linked data includes: a database containing the criminal history, jail and prison time, demographics and COMPAS risk scores for defendants from Broward County from 2013 and 2014; code in R and Python; a Jupyter notebook; and other files needed for the analysis.)

Potential ethics-linked analysis:

A commercial algorithm that calculates recidivism risk was found to be biased against certain groups of people. Do your own exploration of the data, identify risk and see what can be changed in your data or algorithm or models (including what the data owners could be asked to provide) to make your predictions perform more fairly than otherwise. (You need to define the question sharply.) Caveats/Notes: ProPublica reported their analysis in Mar 2016. In Nov'18, Duke researchers put out a paper confirming an issue with ProPublica's approach in reverse engineering the COMPAS algorithm. They also pointed out deeper issues on the definition of fairness, the lack of transparency in COMPAS's algorithm and more. If you are using this dataset, review this paper:

The age of secrecy and unfairness in recidivism prediction: Cynthia Rudin, Caroline Wang, Beau Coker

https://arxiv.org/abs/1811.00731

Availability: (this dataset is free)

Dataset: https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis

Github site: https://github.com/propublica/compas-analysis/

Small Business Loans

The Small Business Administration’s 7a program provides loans to small business owners who can't obtain financing through traditional channels. The program operates through private-sector lenders who provide loans that are, in turn, guaranteed by the SBA. The SBA7a program itself has no funds for direct lending or grants. The data contain information on the business getting the loan including address and industry code, the bank lending the money, the amount loaned, and (where applicable) whether the loan was paid in full or charged off.

Potential ethics-linked analysis:

Explore this small business loan dataset to check a) whether it can be used predicting loan approvals and b) if so, identify data bias effects if any and how that impacts fairness c) If there is unfairness, outline the specifics and what could be changed?

Availability:

For SCU Students: Upon request (Licensed by the Markkula Center)

For others: https://www.propublica.org/datastore/dataset/home-mortgage-disclosure-act

Predicting Factuality of Reporting and Bias of News Media Sources

A team at MIT has developed an ML SVM model (paper) to detect factuality and partisanship (bias) in news at the site level. For a given domain, they will predict the partisan slant of the site (left, far-left, centrist, right, far-right) and “factuality.” The researchers crawled the site https://mediabiasfactcheck.com/, a human-reviewed site rating service that provides detailed qualitative summaries on news and disinformation sites, to create their dataset.

Potential analysis and ethics-linked work:

Replicate their study. Identify models alternative to SVM to improve accuracy. Create additional features in the dataset through scraping web sources about those domains, and explore if that improves accuracy. Explore if any particular types of legitimate news-sites are disadvantaged by this approach.

Availability:

For SCU Students: Upon request (With the Markkula Center)

For others: The data is on MIT’s Github and need secure credentials to access.

Credits:

The first pilot was done in Sanjiv Das’s FNCE 3490 class at Santa Clara University in Spring 2018. Spring 2019 is following likewise. CSCI 180 (Sukanya Manna, Winter’19) students also used this inventory.

What students typically do:

Define their question well, especially on ethical considerations such as bias identification, de-biasing the data, adjusting the pipeline, measure fairness vs accuracy tradeoffs
Execute their analysis across multiple models, dataset variations
Make a presentation with findings and recommendations or write a project findings report (depends on the faculty)