Hi all! I have moved over to www.tortureddata.com, so please change any bookmarks that you may have to that address, and check it out for the latest updates!
After a long hiatus, I’m back with a new post! I really wanted to write about Chicago traffic for this post, due to my recent observation of new “bus only” lanes running down Madison and Clinton in the loop which I feel would have been better utilized elsewhere (for example Lasalle, which has 4 bus routes and takes 20 minutes to travel 4 blocks during rush hour). On a related note, there is a project in the works for a new “flyover” L track at the Belmont stop to reduce train delays due to red and brown lines crossing, but there are downsides to this development, including cost and aesthetics of the neighborhood. Ideally I was hoping to find the data sets to analyze ridership and traffic data to investigate both the Loop Link and the Flyover projects, but unfortunately the necessary data is not currently available. I made a request which hasn’t been moderated yet, but when it is posted you can help me get the data by signing in and up-voting my request here.
Since my data set of choice was not an option, I needed to come up with a new topic for analysis, and I approached this decision in a very scientific manner (googling “interesting data sets”). Luckily I found an amazing resource, 100+ Interesting Data Sets for Statistics. The second data set on the list is a compilation of the last statements of all inmates who were executed in Texas since the 1980’s. That seemed like an interesting corpus for some exploratory text analysis, so I decided to work with that.
Getting the data proved to be a bit of challenge. On the website’s main page there is a chart with some demographic information about each inmate, and in this chart is a link to another page with the text of each inmates last words. There are 500+ rows and links, so I definitely did not want to click through each link and copy/paste all that data. This gave me an opportunity to test out a tool I found a while ago called import.io. This tool enabled me to provide the web addresses of the data I wanted, and then it would sift through each link and harvest the text into a file that I could download as a .csv.
Once I had all the data in a flat file format, I started familiarizing myself with the content. Reading through a few of the last statements, I noticed that there are certain themes that appear. Some offenders ask for forgiveness. Some maintain innocence, or speak to mitigating circumstances around their crime. Some talk about their faith. Many say thank you to family and friends. I thought it might be interesting to see if these themes have any correlation to the demographic data in the data set, like age, race, location, etc., however since the themes are not in the data set, I would need to create them. I decided to use topic modeling to score the statements for themes, and then I could use the results of that model to do the demographic analysis.
Of the 512 records that I was able to import (19 of the 531 on the website failed import), 153 inmates either declined to make a statement or made one less than 15 words, so these were excluded from the data set. That left me with 359 last statements, including the race, age, date of execution and county of the prisoners who made them.
The next step was to clean the text data and put it in a format that can be used for analysis. The text is cleaned for punctuation, numbers and whitespace. Stopwords are removed (words like “a”, “and”, “but”, etc.) and stemming is performed (grouping words that have the same root but different endings, for example run, runs and running). From there I created a term-document matrix, which transforms the data from a compilation of documents to a matrix that contains the number of times each word appears in each document.
Looking at this matrix, I can see that by far the most frequently used word in an offender’s last statement is love. Other frequently used words include family, thank, forgive, god/lord, and hope (some of the words have unexpected endings, like familia instead of family, due to the stem completion function).
Next I transformed the values in the matrix from counts into term frequency inverse document frequency (tf-idf) values. This statistic gives higher weight to terms in a particular document if they are frequent in that document, but not in the corpus. Words that are frequent in the corpus overall get lower weights.
With the matrix prepared, I could finally perform the topic analysis. I used a package called “topicmodels” which contains a function to perform LDA (latent dirichlet allocation), a type of topic modeling. In LDA, the terms in the documents are tested for co-occurrence, and words that occur together often are grouped into a topic. This method is unsupervised, meaning the outcome data is not provided by the user, instead the user would need to review the words in each group to give meaning to the topics.
This package requires that the number of topics be specified. To choose a number of topics, I used three different measures; log-likelihood (measures how well the model generalizes to a test sample), entropy (measures how the topics are distributed in the documents), and human judgement of semantic meaning. First, I ran the LDA model for 2 through 20 topics (for each topic count I ran 20 iterations and averaged the measures) and collected the log-likelihood and entropy measurements. I then plotted these:
The plots on the left show the average log-likelihood (top, higher is better) and entropy (bottom, lower is better) for each topic count. The plots on the right are the distance of each point from the diagonal, which is helpful for determining the point of diminishing returns. From these, it appears that log-likelihood is optimized at 8 topics, and entropy is optimized at 7. From there I ran models with 7 and 8 topics, and used my judgement to determine which version produced topics with the most semantic meaning. I ended up selecting a model with 8 topics:
Here are a few examples of statements in the corpus, and how the topic model scored them:
Now that I’ve created the topics, given them meaningful labels based on the common words and verified that these categories make sense by looking through a few examples, I want to visualize the documents and topics in the data set to see how they relate to each other:
In this diagram each node is a document, and the nodes are colored to demonstrate which topic was most prevalent in that particular document. The nodes are connected to other nodes that are most similar in topical makeup, so this allows you to see which topics are often mentioned together, and which are more discrete. It appears that topics 7 (Encouragement) and 2 (Our Father) are often mentioned together, while topics 1 (Mitigating Circumstances) and 8 (Message to Family) are rarely grouped with any other topic.
Next I would like to look at the breakdown of the topics by a few variables that I have in my data set.
Here are two charts showing the topics through the two time variables, date and age:
There aren’t many clear trends across time. It looks like topics 3 and 8 have generally decreased, and topics 4 and 6 have generally increased, but there is so much variability that it is difficult to draw conclusions from these charts.
Looking at the prevalence of the topics by offender age, it does seem that two of the topics have general patterns. As an offender gets older, there is a clear increase in topic 1 and a decrease in topic 7. This is interesting as these are the only two topics that included the word “innocent” with high probability. It seems that the number of offenders who mention innocence (the combination of these two topics) remains relatively constant, but the messaging changes with age. When an offender is younger, they state their innocence along with pleas to their friends and family to stay strong and continue the fight for justice. As the age of the offenders increases, they use this messaging less and instead detail their own circumstances and how it is that they ended up where they are.
Next I looked into regional differences. I had county data in my data set, which I was able to tie to larger public health regions to reduce the number of pivots.
From there I combined regions with sparse data into larger areas to bolster sample size, which resulted in 6 regions for the analysis:
The chart above shows the deviation from the mean for each topic in the various regions. The Northeast and Southeast are relatively average in terms of which topics are in offenders’ last statements. Offenders in the South use topic 6 more than average (Apology and Forgiveness), and topic 1 (Mitigating Circumstances) less, so it seems that this is the most empathetic region. Offenders in the Northwest use topic 6 less than average, so this might be the least empathetic region. Offenders in the East use topic 5 (Religion) less than most, so this seems to be the least religious region. The most variable region is the Central region, in which offenders use topics 2 (Our Father) and 5 (Religion) more than average, and topics 6 (Apology and Forgiveness) and 7 (Encouragement) less than average, which seems to imply that this region is the more focused on the afterlife and less on the current life than the others.
To check whether there are differences in the topic usage by race I ran ANOVAs on all 8 topics. I found that topics 1 and 3 show marginally significant differences between races (p < .1) and topics 6 and 7 are significant at the p <.005 level for race differences. Looking at the two innocence related topics (1 and 7), this tells us that both white and black offenders speak of innocence more often than Hispanic offenders, although they use different language. White offenders tend to refer to the circumstances, while black offenders tend offer words of encouragement to their loved ones. Hispanic offenders tend to apologize and ask for forgiveness by far the most often of the three races (topic 6).
With this methodology, these statements could be reviewed along with other statistics for bias in conviction rates. Specifically when it comes to the topics related to innocence, determining whether a specific population (race, age, region) is more likely than others to maintain innocence could help to research whether wrongful convictions are occurring at a higher rate in that population. In this case, I would only use these algorithmically generated topics as a starting point, as the number of documents is relatively low and the most accurate way to determine the content of the statements would be to have human raters scoring them. Topic modeling however allows for expedited initial research, and in cases where the number of documents would be too large to have human raters, it may be the only way to generate an outcome variable.
I know the subject matter was a bit dark, especially considering my last post was on “The Bachelor”, but I hope that this post was interesting and informative. I welcome any feedback in the comments, and I hope to have a new post out soon about Chicago traffic data!
The short answer is probably…if you’re female.
If you are a fan of the TV shows The Bachelor/The Bachelorette, then you know that a central theme on the shows is which of the contestants is there for “the right reason;” the right reason being finding love. However considering the completely unnatural set up of the show, I have my doubts about how much of the final choice is really due to “finding love”. I recently found some data on the contestants on Wikipedia to start exploring this question; is it really love, or are there other factors at play in the final choice?
From Wiki, I was able to get the names of the contestants from some of the more recent seasons, their ages, hometowns, occupations, and when they were eliminated. I then pulled a data set from the Bureau of Labor Statistics which gave me average salaries for various occupations by state. I had to match the occupation titles from this data set to those given by the Bachelor/Bachelorette contestants, and I will admit sometimes things got a bit fuzzy (for example, one of the Bachelorettes is reportedly a “Mineral Coordinator”; the closest I could get from occupations available in the labor data was a “Geological and Petroleum Technician”). However most of the time the matches were pretty clear, and from there I was able to get an approximation of each Bachelor/Bachelorette’s salary.
Before I get to into what I found regarding salaries, here is a diagram of the various job titles that the contestants hold by gender:
In the wordcloud above, an occupation would be attributed to the gender in which it occurs more frequently. The magnitude of the gender discrepancy is depicted by the size of the word. For example, “No Income” is attributed to the female contestants, and shows up in relatively large font, indicating that there are far more females than males with no income (the reported occupations in this category tended to be students, however there were also a few odd occupations in this category like “Dog Lover” and “Free Spirit”). Female contestants tended to be models, dancers, hairstylists, teachers and nurses more often then males, while the male contestants tended to be athletes, sales reps, business owners, executives, and financial advisers.
Considering the job titles that occur most frequently for each gender, it won’t be too surprising that there is a gap in the salaries of the men vs. women. Below is a density plot depicting this difference:
This chart shows that the most common female salary is around 50k (the purple peak), while the most common male salary is closer to 60k (the green peak). Additionally we can see that the female contestants have more salaries under 65k (where the purple line is above the green), and the male contestants beat out the females in salaries over 65k (where the green line is above the purple).
Next I wanted to look into the interaction between salary and how far the contestants got on the show. I labeled the contestants as “Finalists” (either the winner or the runner-up), “Early Elimination” (eliminated in episode 1 or 2 of the season) or “Late Elimination” (everyone else after throwing out those who quit or were removed). I was expecting to see that contestants with higher salaries tended get further, however, that was not actually the case for both genders.
It looks like the bachelorettes tend to choose men with higher salaries in the end, however the pattern in the chart above is not strong enough to be statistically significant. It appears that the bachelorettes do not discriminate during the eliminations, as the men who are eliminated early versus late have the same average salary. It’s only when comparing those who were eliminated early or late to the finalists that we see a salary difference (80k to 90k, although again, this difference was not significant and could be due to chance variations in the data).
However, I found the opposite effect for the bachelors. The bachelors tend to choose women with lower salaries; women who are eliminated early have significantly higher salaries (64k on average) when compared to women who are finalists (54k on average, p < .05), and the women who are eliminated late in the process fall in between.
Before making the conclusion that the salary would impact success on the show, I wanted to check whether the result could be confounded by another variable that I had access to, age. A bar chart of the contestant’s age follows a very similar pattern to the chart above:
Looking at this chart, it does seem that age and salary could be confounded. While it appears that the bachelors prefer women with lower salaries, it could just be that they prefer younger women who are earlier in their careers and therefore may make less. Same for the bachelorettes, their apparent preference for men with higher salaries could just be a preference for men who are older and therefore are more advanced in their careers. To check whether this variable could be a confound, I looked at the relationship between salary and age for the men and women.
For male contestants, age does seem to be correlated with salary (the older you are, the higher your salary), so the bachelorettes’ apparent preference for men with higher salaries could be confounded by this variable and could actually be a preference for men who are older. This makes it very difficult to say whether your salary would impact your success on the bachelorette if you were a male contestant, or whether age would be the more important factor. Put differently, if you were on the show competing against other contestants of your same age, it is unclear from this data whether salary would play a factor in who is selected to continue.
However, the regression line for the female contestants seems to show that there is no relationship between age and salary, indicating that the effect of the bachelors choosing women with lower salaries is independent of their preference for younger women. Therefore, if you were a female contestant on the bachelor competing with other women of your same age, you would be likely to stick around longer if you make less the other contestants. The bachelors on the show may have the best intentions to find love with one of the contestants, but what the data seems to show is that “love” blossoms when the contestant is younger with a lower than average salary! Perhaps these bachelors should re-examine their “right reasons.”
*While this is all in good fun, there are quite a few caveats to this analysis:
- There are very few observations in each category, especially the finalists, which makes it difficult to generate legitimate statistical results from the data.
- There are many factors that I do not have access to, so while I may have been able to check for a correlation between salary and age, I was not able to check for a correlation with any other factors.
- The salary data that I used are averages for the contestant’s home states, not the actual contestant’s data, and many of the careers have huge salary ranges. For example, a model could be working a few odd jobs a month, or making multiple millions as a supermodel; an average does not convey where in the spectrum a contestant may fall.
For anyone who isn’t familiar with Kaggle, it is a website that provides data and modeling challenges from various organizations to data scientists who want to practice their skills, and occasionally win money. I recently participated in a Kaggle challenge sponsored by the African Soil Information Service entitled African Soil Property Prediction. The goal was to predict five soil properties (soil organic carbon, pH, calcium, phosphorus and sand content) based on diffuse reflectance infrared spectroscopy measurements along with a few spacial predictors for each soil sample. The spectroscopy measurements were discretized into 3578 points, and adding the 16 spacial predictors resulted in 3594 columns in the data. There were 1157 soil samples in the training set, and 727 in the testing set that was provided.
I like to have at least 10 samples per each variable, so my first step was to attempt to reduce the number of columns to be more in line with the number of samples I had, ideally from 3594 to around 115 (if anyone else has thoughts on rules of thumb for sample:predictor ratios please share in the comments!). To attempt to reduce the number of columns to be more in line with the number of samples I used a Haar wavelet transform to clean up and reduce the size of the spectroscopy signal. Here is a visual of how the wavelet transform cleaned the signal (original signal on the left, cleaned signal on the right):
With each run of this transform, the signal decreased in size by half, so in the last frame the signal has been reduced to 111 data points from the original 3578, which is much more in line with the 10 samples per variable ratio I was aiming for. The wavelet transform reduced the size and cleaned out much of the noise, however most of the characteristics of the signal itself are still intact. I then fed those points into a few different algorithms using the Caret package in R. Unfortunately none of the methods that I tried resulted in very good performance, and in the end I did not finish with a very impressive ranking.
At the end of the contest, the winner is required to post his/her code. In this contest, I was surprised to see that the winner used this same methodology to pre-process his data considering how poorly it worked for me. However, the winner used this methodology in combination with other pre-processing methodologies and various different algorithms in a 50 or so algorithm ensemble model.
I have to applaud his efforts, it was certainly an impressive model, however I don’t necessarily feel that it should have been the winning solution. A 50 algorithm ensemble is not really a practical solution to the problem. Many of the contests on Kaggle are run for fun and for flexing analytical muscle, in which case more power to the person who can construct the most models in the least amount of time. However for a contest in which the sponsor intends to use and hopefully implement the result, this method isn’t ideal.
For example, take the Netflix prize. This was a competition to improve upon the Netflix recommendation engine by 10%. The competition ran for three years before someone finally achieved the 10% improvement, and the winning team was awarded $1 million. However, the winning team’s algorithm was never implemented by Netflix, because the winning team created a 100 algorithm ensemble. Netflix ended up implementing only two of the algorithms that the team came up with. As far as the rest of the algorithms in the ensemble, Netflix made this statement in a blog post about the competition:
“…the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
Personally I would be in favor of some sort of complexity penalty that could be incorporated into the evaluation for contests in which the sponsor would like to actually implement the results. This way, the final performance measure wouldn’t come down to just model accuracy (which generally increases as you increase the number of algorithms in your ensemble), but instead it would be a measure of which algorithm (and pre-processing steps) can provide the most accuracy with the least complexity, making it more likely that it would be possible to implement in a production environment.
It would be a challenge to come up with a good way to measure complexity of a solution, and automate this measure so that scoring competition entries would be as simple as submitting your code. Any thoughts on this would be very welcome in the comments!
I’ve been working on a project presentation that utilized quite a few maps as visual aids, so I thought I would post about the different mapping tools I’ve found throughout this project and their advantages and disadvantages. Please leave me a comment if you know of others I should look into!
Probably my favorite of the mapping options that I have come across is Tableau. Tableau automatically recognizes several different location tags in a data set like zip codes, states, cities, etc., draws the shapes around those regions, and allows the user to display attributes of those regions by varying the fill color of the shapes to reflect a continuous metric (gradients) or a categorical metric (various color palettes). Alternatively you could use a location marker and vary the size of that marker to represent your data (useful if you have addresses rather than regions). One drawback of this tool is that users who do not purchase a license have to use Tableau Public, which requires that any data you put into the program becomes publicly available for download by any other Tableau Public user, so if you have proprietary or sensitive data then Tableau Public would not be a good option. Another limitation with this program is that it can get complicated to map multiple different data types on the same map, for example a map that has both specific address markers and region polygons.
Map created with Tableau displaying a categorical attribute at the zip code level (see this post for explanation of this map):
Map created with Tableau displaying a continuous attribute at the zip code level (see this post for explanation of this map):
Google Fusion Tables are also really useful for mapping data. This tool can geocode lists of addresses even if your data is a bit messy. One difficulty with this tool is that it does not have the repository of region boundaries that Tableau has. If you want to draw regions using Google Fusion Tables, you have to find (or create) and upload a kml file that contains the boundaries for each polygon you want to draw. Luckily it isn’t too difficult to find kml files for various geographic regions just by searching the web (kml for the Chicago wards is available through the Chicago data portal, kml for US zip codes and US states I have found through other various sources). One advantage of this tool over Tableau is that it is very simple to map both regions and points on the same map.
Map created with Google Fusion Tables with a gradient representing the per capita rat complaints for each Chicago ward (dark red indicates more relative complaints, dark green indicates less relative complaints). This map also includes a marker for the Chicago Streets and Sanitation Office:
My Maps allows you to create and save maps by adding markers and various free form lines and shapes. This tool will also easily let you change marker style and color (and there are many options), so for example if you wanted to include a map on a wedding website, you can create a map with a bed symbol for the hotel, a heart symbol for the ceremony location, and a music symbol for the reception, and draw lines connecting all of these events.
The last tool I will mention is Radius Around a Point Maps, which allows you to plug in a location and a radius around the location and the tool will draw that radius. You can either do this by hand if you have a few, or you can upload a .csv with the latitude and longitude of each center and the distance of the radii you would like drawn around each. It is pretty limited in other functionality, but this was the best free tool I could find for drawing radii and it is definitely user friendly.
Map created using the radius around a point mapping tool showing a 50mi radius around each US State capital city:
After a brief tangent to explore the Twitter data, I’m back to the city of Chicago neighborhood data. I was originally planning to use other data sets from the Chicago data portal to profile the neighborhoods, but instead I found a few other helpful data sources:
Census Fact Finder Download Center – This site has a lot of demographic data from the US Census Bureau, and many of the data sets can be downloaded at varying levels of detail, which is super helpful when you are specifically looking for zip code.
Federal Election Committee – I wanted to look at the political leanings of the various Chicago zip codes but I couldn’t find election result data at the zip code level, so this data set was a nice substitute. This contains all individual contributions to any political committee with the individual’s zip code included.
I created a master data set with data from the above sources, including things like age, race, income, political leaning, employment, % of population in various industries, commuting methods, etc. I then used k-means clustering to create groupings of neighborhoods that are similar demographically and this is what came out:
see interactive viz here
Cluster 1 : Not pictured, tiny zip codes with sparse data
Cluster 2: Orange – Sporadic
Cluster 3: Yellow – South and West
Cluster 4: Green – Surrounding Downtown
Cluster 5: Blue – Downtown
Cluster 6: Pink – Sporadic
Below are some charts displaying the demographic breakdowns of the various clusters:
Cluster 5 has the largest percentage of 25-34 year olds and the lowest percentage of children under age 15. This makes sense considering these are largely the downtown zip codes. Cluster 4 follows this trend to a lesser degree, and clusters 2, 3 and 6 are all very similar in their age demographics with higher percentages of children and lower percentages of people aged 25-34.
Looking at the race breakdown of each cluster, we can see that clusters 2 and 6 are the most diverse, while clusters 3, 4 and 5 tend to be dominated by one race.
Cluster 5 is the wealthiest and also has the largest difference between mean and median salaries, which means that the wealth in these zip codes is not normally distributed. Instead this distribution has a long tail to the right, implying that there are some people in these zip codes that are making considerably more than most and pulling the mean up.
Cluster 5 has the largest percentage of people in the labor force, which given the age demographic of this cluster is not too surprising. Cluster 3 has the lowest, but this cluster also had the largest percentage of the population under 19, so also not too surprising.
Cluster 5 contains the largest walking population by far, which isn’t too surprising considering these people live and most likely work downtown. Clusters 2 and 6 are very similar, mostly driving alone, and clusters 3 and 4 are similar in their transportation patterns as well, still mostly driving alone but with larger percentages of people choosing to take public transportation than clusters 2 and 6.
This chart displays the percentage of the population in each cluster that is employed in various industries. Clusters 2 and 3 are nearly identical in their industry breakdown. Cluster 4 is a middle ground between clusters 2 and 3 and cluster 5, with fewer people in manufacturing, retail and transportation than clusters 2 and 3 (thought not as little as cluster 5) and more people in finance and professional industries (though not as many as cluster 5). Cluster 5 has the largest percentage of the population in professional and finance realms and the fewest in the construction, manufacturing, retail, transportation and entertainment. Cluster 6 has the most equal distribution between industries.
This chart displays the percentage of each cluster that is republican leaning (defined by the percent of contributions to political committees that went to republican committees) and the wage gap (defined by the male to female median salary ratio). Cluster 5 is both the most republican leaning, and also has the largest wage gap, where males in the work force make 131.8% more than females in the work force. Generally the wage gap decreases as the republican population decreases, however cluster 6 is an outlier to this pattern with the lowest wage gap and the second highest percent of republicans (tied with cluster 4).
I know I mentioned in the last post that I was planning to keep working on the Chicago data, but I got distracted by a cool bit of code that April Chen sent to me, which pulls in posts from Twitter for text analysis in R. You can enter a specific search term, or pull in a specific user’s past posts. It seems to be a bit limited in that you can only receive up to 199 tweets per request, and it also seems to only allow you to go back in time by one week. I was able to get decent sample sizes of ~1000 tweets or so by requesting a sample from each day of the past week for each search term.
Here are a few word clouds I created which show which other words are commonly used with certain hashtags:
I also looked at a few comparison clouds, which show the differences in how often words are used between searches.
Here is #liberal vs. #conservative:
And here is a cloud showing the differences between the tweets from the Progressive Insurance Corporate twitter account and the customers who tweet something @Progressive:
This cloud shows that Flo is a popular term among customers, which at first glance would lead me to believe that customers love Flo. However the word “bitch” also shows up in the cloud, could these be related?
The chart above displays the top 15 words in the cloud, and the connections between the words illustrate which words are correlated to each other. Indeed, customers think Flo is a bitch 😦