3. Replicating Propbulica’s COMPAS Audit#

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.metrics import roc_curve
import warnings
warnings.filterwarnings('ignore')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import pandas as pd
----> 3 import scipy
      4 import matplotlib.pyplot as plt
      5 import seaborn as sns

ModuleNotFoundError: No module named 'scipy'

We will relaod the cleaned data

clean_data_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
df = pd.read_csv(clean_data_url).set_index('id')

and look at the top to remmeber what it looks like.

df.head(1) # only 1 line to save space, on zoomed in

3.1. What does the set_index do?#

How can you figure out what it does from code only?

To figure out, we can re-run the load data without set_index

pd.read_csv(clean_data_url).head(1)

If we compare this to the one above, we note that when we use set_index('id') the 'id' column is treated idffeerntly

3.2. Computing Statistics#

df['decile_score'].mean()

And we can get that for each group using groupby

df.groupby('race')['decile_score'].mean()

3.3. Plotting by race#

df.groupby('race')['decile_score'].value_counts().plot(kind='bar')
df.groupby('race')['decile_score'].value_counts().unstack().plot(kind='bar')
df.groupby('race')['decile_score'].value_counts().unstack().T.plot(kind='bar')
df.groupby('decile_score')['race'].value_counts().unstack().plot(kind='bar')
race_score_counts = df.groupby('decile_score')['race'].value_counts().unstack()
race_score_counts.plot(kind='bar',figsize=(12,7))
df.groupby('priors_count')['race'].value_counts().unstack().plot(kind='bar',figsize=(12,7))
df.groupby(['race','decile_score'])['two_year_recid'].mean()
race_score_recid = df.groupby(['race','decile_score'])['two_year_recid'].mean().unstack()
race_score_recid
race_score_recid.T.plot(kind='bar',figsize=(12,7))
race_score_recid
race_counts_df = df.groupby(['race','decile_score'])['two_year_recid'].count().reset_index().rename(
    columns={'two_year_recid':'count'})
race_counts_df
sns.set_theme(palette='colorblind',context='poster')
sns.catplot(data=race_counts_df,x='decile_score',row='race',y='count',
           kind='bar',hue='race',aspect=2)
df.groupby(['race','decile_score'])['two_year_recid'].apply(pd.Series.mode)
dfQ = pd.read_csv('https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_cq.csv')
dfQ.head()
df.head(1)
dfQ[['two_year_recid','score_text']].corr()
dfQ.groupby('race')[['two_year_recid','score_text']].corr()
dfQ.groupby(['race','score_text'])[['two_year_recid']].mean()