This dataset is rooted in a 10-question questionnaire that every single American citizen should have answered, issued by the US Census Bureau. These answers have been organized in several ways by the US Census Bureau and is hosted in several formats here. We are using the 2012-2016 detailed tables- Block Groups- California dataset, which is in a geodatabase format. The geodatabase format has information about the both geography and the metadata of all the block groups in a specific state. A block group is a collection of several blocks, which a small geographically defined area, typically with a population of around 1000 people. The geographic data was unfortunately ommited when we have exported all the metadata over to csv files.
The US census questionaire asks for sex, age, gender, annual income, civil status, education, and employment status and a couple of more questions. The US census bureau has then restructured these answers into anonymous features describing the averages of some answers and the count of people fitting certain characteristic. With over 7500 features there are a a lot of variables to consider and we want you to identify interresting correlations within the dataset that you think could be valuable to the global community. Feel free to go outside this dataset to try find inspiration and support your hypothesis.
The goal of this challenge is to look for interesting correlations within the dataset. With features describing everything from average income to education to age distribution and living arrangements, there are a lot of opportunities. We want you guys to find insights that you think could be utilized to improve the general living in the areas we are investigating. The challenge is therefore set up like this.
Every dataset has two identification tags, GEOID and OBJECTID, the tags are in separate formats but convey the same information so you could use either or to relate data between datasets.
The rest of the features are typically integers signifying the count of people from each block group who matches the respective features. However, there are exceptions to this, so you should read the long variable names carefully.
One feature could be “PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS): Total: Total population – (Estimate)”. This would also have a corresponding variable with the same name, just ending in “(Margin of Error)”. As the name suggests this would give you the “Margin of Error” for that specific feature. For every actual feature, there are two variables: one estimate and one margin of error. (It might be an idea to erase all margin of error variables before you try and fit your models.)
The data is split into 25 subsets, each describing different socioeconomic and geographical characteristics of the block groups.
Name | Size | Description |
---|---|---|
BG_METADATA | 3 x 7730 | Dataset mapping short variable names to long variable names |
TOTAL DATASET | 7730 x 23123 | This dataset is the combined data in all subsets described beneath |
X00_COUNTS | 6 x 23123 | Total population and number of households in each block group |
X01_AGE_AND_SEX | 162 x 23123 | Population distribution with regards to age and sex |
X02_RACE | 72 x 23123 | Population distribution with regards to race |
X03_HISPANIC_OR_LATINO | 50 x 23123 | I don’t really know to be honest |
X07_MIGRATION | 160 x 23123 | Population distribution with regards to geographical mobility |
X08_COMMUTING | 580 x 23123 | Describing various aspect of commuting |
X09_CHILDREN_HOUSEHOLD_RELATIONSHIPS | 232 x 23123 | Population distribution based on various aspects of internal relations in households |
X11_HOUSEHOLD_FAMILY_SUBFAMILIES | 630 x 23123 | More internal relations, see for yourselves |
X12_MARITIAL_STATUS_AND_HISTORY | 40 x 23123 | Population distribution with regards to maritial status and history |
X14_SCHOOL_ENROLLMENT | 538 x 23123 | Population distribution with regards to how many are enrolled in school |
X15_EDUCATIONAL_ATTAINMENT | 352 x 23123 | Population distribution with regards to education level of population |
X16_LANGUAGE_SPOKEN_AT_HOME | 164 x 23123 | Population distribution with regards to language spoken at home |
X17_POVERTY | 298 x 23123 | Population distribution with regards to poverty levels |
X19_INCOME | 410 x 23123 | Population distribution with regards to income levels, also average income per capita and household. |
X20_EARNINGS | 108 x 23123 | Population distribution with regards to earnings |
X21_VETERAN_STATUS | 174 x 23123 | Population distribution with regards to veteran status |
X22_FOOD_STAMPS | 16 x 23123 | Population distribution with regards to food stamps |
X23_EMPLOYMENT_STATUS | 664 x 23123 | Population distribution with regards to employmentstatus |
X24 _INDUSTRY_OCCUPATION | 680 x 23123 | Population distribution with regards to industry of occupation |
X27_HEALTH_INSURANCE | 134 x 23123 | Population distribution with regards to health insurance |
X99_IMPUTATION | 568 x 23123 | Population distribution with regards to imputation |
The data has been cleaned by the US census bureau to some extend but still needs some cleaning. Also be aware that half of the variables represent margin of error.
Quick tip, use BG_METADATA to convert your column names from short strings to long strings like this:
label_df=pd.read_csv(data_dir+"BG_METADATA_2016.csv")
columnnames={}
for row in label_df.iterrows():
columnnames[row[1][1]]=row[1][2]
df=pd.read_csv(data_dir+"X15_EDUCATIONAL_ATTAINMENT.csv")
df.rename(columns=columnnames,inplace=True)
The assessment of this challenge is based on your overall performance. Therefore, we consider both your approach and your results. We expect you to deliver a model capable of predicting y based on X
You should output your findings and results into an appropriate format and upload it, together with all the code used, onto the designated folder, and then zip it. Methodology and approach will be extremely important so make sure to describe why you chose to approach the problem like you did. Please use good variable names.
As you will present your findings to a panel of judges we encourage you to present your findings from a different angle than just “We are trying to win this challenge”. Find a narrative and try to answer questions like these: Why did you choose the challenge at all? Why are your findings of significance? What can you do to leverage your findings? Is there any way you use your findings to find a solution to a problem?
Questions like these help you show that you have done more than just solved a simple regression problem and leaves the judges with a lasting impression.
There are no restrictions on what you can use, but we would prefer code to be written either in R or python. This is simply because these are the languages we can provide the most assistance with.
Personnally I prefer to work with pandas, numpy, sklearn, and matplotlib either in spyder or jupyter notebooks running on python3.6
If you decide to use more advanced models you could try and use pytorch, tensorflow or keras but this will probably prove to be overkill for most situations.