Content and Source of Data

This dataset is rooted in a 10-question questionnaire that every single American citizen should have answered, issued by the US Census Bureau. These answers have been organized in several ways by the US Census Bureau and is hosted in several formats here. We are using the 2012-2016 detailed tables- Block Groups- California dataset, which is in a geodatabase format. The geodatabase format has information about the both geography and the metadata of all the block groups in a specific state. A block group is a collection of several blocks, which a small geographically defined area, typically with a population of around 1000 people. The geographic data was unfortunately ommited when we have exported all the metadata over to csv files.

The US census questionaire asks for sex, age, gender, annual income, civil status, education, and employment status and a couple of more questions. The US census bureau has then restructured these answers into anonymous features describing the averages of some answers and the count of people fitting certain characteristic. With over 7500 features there are a a lot of variables to consider and we want you to identify interresting correlations within the dataset that you think could be valuable to the global community. Feel free to go outside this dataset to try find inspiration and support your hypothesis.

Objectives

The goal of this challenge is to look for interesting correlations within the dataset. With features describing everything from average income to education to age distribution and living arrangements, there are a lot of opportunities. We want you guys to find insights that you think could be utilized to improve the general living in the areas we are investigating. The challenge is therefore set up like this.

  1. Pick one or multiple variables (y) from the dataset you think will be valuable to be able to predict.
  2. Create or train a model to predict said variables using either the entire dataset or subsets of the dataset (X) that you are given.
  3. Try to identify causal relationships between the different variables
  4. Make sure to explain why you chose both your X’s and y’s
  5. Present your findings in an interresting manner

Data Overview

Every dataset has two identification tags, GEOID and OBJECTID, the tags are in separate formats but convey the same information so you could use either or to relate data between datasets.

The rest of the features are typically integers signifying the count of people from each block group who matches the respective features. However, there are exceptions to this, so you should read the long variable names carefully.

One feature could be “PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS): Total: Total population – (Estimate)”. This would also have a corresponding variable with the same name, just ending in “(Margin of Error)”. As the name suggests this would give you the “Margin of Error” for that specific feature. For every actual feature, there are two variables: one estimate and one margin of error. (It might be an idea to erase all margin of error variables before you try and fit your models.)

The data is split into 25 subsets, each describing different socioeconomic and geographical characteristics of the block groups.

Name Size Description
BG_METADATA 3 x 7730 Dataset mapping short variable names to long variable names
TOTAL DATASET 7730 x 23123 This dataset is the combined data in all subsets described beneath
X00_COUNTS 6 x 23123 Total population and number of households in each block group
X01_AGE_AND_SEX 162 x 23123 Population distribution with regards to age and sex
X02_RACE 72 x 23123 Population distribution with regards to race
X03_HISPANIC_OR_LATINO 50 x 23123 I don’t really know to be honest
X07_MIGRATION 160 x 23123 Population distribution with regards to geographical mobility
X08_COMMUTING 580 x 23123 Describing various aspect of commuting
X09_CHILDREN_HOUSEHOLD_RELATIONSHIPS 232 x 23123 Population distribution based on various aspects of internal relations in households
X11_HOUSEHOLD_FAMILY_SUBFAMILIES 630 x 23123 More internal relations, see for yourselves
X12_MARITIAL_STATUS_AND_HISTORY 40 x 23123 Population distribution with regards to maritial status and history
X14_SCHOOL_ENROLLMENT 538 x 23123 Population distribution with regards to how many are enrolled in school
X15_EDUCATIONAL_ATTAINMENT 352 x 23123 Population distribution with regards to education level of population
X16_LANGUAGE_SPOKEN_AT_HOME 164 x 23123 Population distribution with regards to language spoken at home
X17_POVERTY 298 x 23123 Population distribution with regards to poverty levels
X19_INCOME 410 x 23123 Population distribution with regards to income levels, also average income per capita and household.
X20_EARNINGS 108 x 23123 Population distribution with regards to earnings
X21_VETERAN_STATUS 174 x 23123 Population distribution with regards to veteran status
X22_FOOD_STAMPS 16 x 23123 Population distribution with regards to food stamps
X23_EMPLOYMENT_STATUS 664 x 23123 Population distribution with regards to employmentstatus
X24 _INDUSTRY_OCCUPATION 680 x 23123 Population distribution with regards to industry of occupation
X27_HEALTH_INSURANCE 134 x 23123 Population distribution with regards to health insurance
X99_IMPUTATION 568 x 23123 Population distribution with regards to imputation

The data has been cleaned by the US census bureau to some extend but still needs some cleaning. Also be aware that half of the variables represent margin of error.

Quick tip, use BG_METADATA to convert your column names from short strings to long strings like this:

label_df=pd.read_csv(data_dir+"BG_METADATA_2016.csv")
columnnames={}
for row in label_df.iterrows():
    columnnames[row[1][1]]=row[1][2]
df=pd.read_csv(data_dir+"X15_EDUCATIONAL_ATTAINMENT.csv")
df.rename(columns=columnnames,inplace=True)

Deliverables

The assessment of this challenge is based on your overall performance. Therefore, we consider both your approach and your results. We expect you to deliver a model capable of predicting y based on X

You should output your findings and results into an appropriate format and upload it, together with all the code used, onto the designated folder, and then zip it. Methodology and approach will be extremely important so make sure to describe why you chose to approach the problem like you did. Please use good variable names.

Tips and tricks

As you will present your findings to a panel of judges we encourage you to present your findings from a different angle than just “We are trying to win this challenge”. Find a narrative and try to answer questions like these: Why did you choose the challenge at all? Why are your findings of significance? What can you do to leverage your findings? Is there any way you use your findings to find a solution to a problem?

Questions like these help you show that you have done more than just solved a simple regression problem and leaves the judges with a lasting impression.