Case Study: AI and Bias in Hiring Practices

February 19, 2019

Case Study: AI and Bias in Hiring Practices

Predicting personality traits and performance potential from video content is a compelling topic in hiring circles right now. But what’s the best methodology and technology to deploy? Which types of data are best used for predictive hiring? As with any machine learning problem, achieving success here all comes down to getting data that is not biased and is tied to the outcome we are trying to predict.
ChaLearn (LAP), a group focused on “looking at people” (LAP) in images, initiated a competition using the ChaLearn First Impressions dataset with the goal of using machine learning to  predict personality traits. The group focused on the common assessment metric of evaluating candidates called the Big Five Personality Traits: openness, conscientiousness, extraversion, agreeableness and neuroticism (OCEAN). Typically, psychologists measure these traits with extensive multiple choice questionnaires often taking hours of time for each individual. But because both gathering video and administering those tests are time-consuming and expensive, the dataset used for training was created using YouTube videos coupled with personality traits identified by human evaluators.
The dataset includes 10,000 clips (average duration of 15 seconds) extracted from more than 3,000 different YouTube videos of English speakers. The human evaluators of these videos, Amazon Mechanical Turks, were given some training and then shown pairs of videos to compare to each candidate on the Big Five (which person is more open, agreeable, etc.), along with an interview flag (which person they would rather invite in for a job interview). These comparisons were then done over many different video pairs, and shown to several evaluators. Finally, an algorithm was used to translate these comparisons into an overall score for each video on each of the six measures (Big 5 + interview progression).
Given our depth studying human judgment of job candidates, we were curious as to what was actually being measured through these apparent personality traits and if it was truly fair and accurate. The truth is, even with highly trained human evaluators and consistent conditions and questions, human-evaluated personality assessment is difficult. Fifteen seconds of a random video likely does not contain the information necessary to accurately and fairly assess these traits. What evaluators are then left to rely on is truly a “first impression,” with the information they are gathering informed by what the person looks like and sounds like in just a few seconds.
While this dataset is described as a “personality dataset,” in reality, the data is really more telling of how humans perceive personality, rather than their true personality and job fit potential. In this case, how do Mechanical Turks intuit personality traits from only 15 seconds of video?
To investigate, we used our trained deep learning models to predict age, race, gender, and attractiveness for the subject of each of these videos. These models are trained with self-identified age, race and gender, and average attractiveness as evaluated by other people. We then looked at Mechanical Turk-generated score distributions for each of the measured attributes for different groups. The results were striking.
Age. The score differences showed that older people are seen as more conscientious and less neurotic, which could be considered as positive, but are also seen as less agreeable, open, extraverted, and, ultimately and most importantly, less likely to be recommended for a job interview.
Gender. Looking at male/female differences, we see that female scores were generally distributed more towards the higher score end, with the exception of agreeableness.

Race. Score distributions by ethnicity showed that whites and Asians were consistently rated higher than blacks and others in all six dimensions.
Attractiveness. Splitting the attractiveness rating into three tiers, we see what is probably the strongest trend in the data. This is especially interesting because fairness based on looks is not addressed in most processes (“unattractive” people are not a legally protected class). In the First Impressions dataset, better looking people are seen as more everything, below average looking people are seen as less, and average looking people have a pretty flat distribution.
These results are illuminating on several levels. It's well observed that first impressions play a big role in the interviewing process. A 2015 study found that 30 percent of interviewers make their decision about an interviewee within the first five minutes of the interview. More than that, the First Impression interviewers form of a candidate greatly influences how they perceive the candidate's responses throughout the interview.
On the data science practitioner side of things, these results are a powerful reminder of the importance of auditing algorithmic assessments for adverse impact to ensure that AI-driven evaluation is not mimicking bias from the training data.
Unfortunately, these results are not particularly shocking for those of us combating bias in the hiring space. While humans do add value to the decision-making process, alone, they are often a source of bias. It's thus incredibly important for recruiters and hiring managers to have an objective evaluation of each candidate that they can use to check their "first impression" and "gut feeling" against. Properly vetted, AI can be that objective decision support, providing crucial insight so humans can make better, less biased hiring decisions.

The Authors: 

Dr. Lindsey Zuloaga, Ph.D., is Director of Data Science at HireVue.