3. Replicating Propbulica’s COMPAS Audit#
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.metrics import roc_curve
import warnings
warnings.filterwarnings('ignore')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import numpy as np
2 import pandas as pd
----> 3 import scipy
4 import matplotlib.pyplot as plt
5 import seaborn as sns
ModuleNotFoundError: No module named 'scipy'
We will relaod the cleaned data
clean_data_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
df = pd.read_csv(clean_data_url).set_index('id')
and look at the top to remmeber what it looks like.
df.head(1) # only 1 line to save space, on zoomed in
3.1. What does the set_index
do?#
How can you figure out what it does from code only?
To figure out, we can re-run the load data without set_index
pd.read_csv(clean_data_url).head(1)
If we compare this to the one above, we note that when we use set_index('id')
the 'id'
column is treated idffeerntly
3.2. Computing Statistics#
df['decile_score'].mean()
And we can get that for each group using groupby
df.groupby('race')['decile_score'].mean()
3.3. Plotting by race#
df.groupby('race')['decile_score'].value_counts().plot(kind='bar')
df.groupby('race')['decile_score'].value_counts().unstack().plot(kind='bar')
df.groupby('race')['decile_score'].value_counts().unstack().T.plot(kind='bar')
df.groupby('decile_score')['race'].value_counts().unstack().plot(kind='bar')
race_score_counts = df.groupby('decile_score')['race'].value_counts().unstack()
race_score_counts.plot(kind='bar',figsize=(12,7))
df.groupby('priors_count')['race'].value_counts().unstack().plot(kind='bar',figsize=(12,7))
df.groupby(['race','decile_score'])['two_year_recid'].mean()
race_score_recid = df.groupby(['race','decile_score'])['two_year_recid'].mean().unstack()
race_score_recid
race_score_recid.T.plot(kind='bar',figsize=(12,7))
race_score_recid
race_counts_df = df.groupby(['race','decile_score'])['two_year_recid'].count().reset_index().rename(
columns={'two_year_recid':'count'})
race_counts_df
sns.set_theme(palette='colorblind',context='poster')
sns.catplot(data=race_counts_df,x='decile_score',row='race',y='count',
kind='bar',hue='race',aspect=2)
df.groupby(['race','decile_score'])['two_year_recid'].apply(pd.Series.mode)
dfQ = pd.read_csv('https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_cq.csv')
dfQ.head()
df.head(1)
dfQ[['two_year_recid','score_text']].corr()
dfQ.groupby('race')[['two_year_recid','score_text']].corr()
dfQ.groupby(['race','score_text'])[['two_year_recid']].mean()