Citadel Datathon Team 11: An exploration of the effects of human activity on water pollution

Executive Summary


When working with the dataset for chemicals, what struck us was that we have no data for certain chemicals for certain counties, or even worse, we had entire states with no data at all for water pollution. A possible reason for this is that certain counties do not have the necessary resources or just do not care about getting water quality data. This is extremely concerning given recent events in Flint, where more frequent water monitoring could have mitigated the disaster. To this end, we decided that it would be meaningful to look at the data we did have and see if we could predict water quality in other regions by extrapolating the data that we did have.

Technical Models

We tried to predict values of pollutants based on explanatory factors of industry size, with counties as observations - however this was a poor fit for the data and a better model should be implemented. This may be due to distributional issues of our dataset, which makes the fit between data and model unwell.

We investigate the exposure of populations to contamination (for above average levels of contamination concentration) and distribution of chemicals across the USA. Then we look for a possible explanation for these levels of contamination, through investigating industrial and political factors. We identify as Conneticut, California and Florida as our most heavily polluted states visually. We build a linear model on the values of contamination by arsenic and nitrates and classifiers on whether a county is “contaminated” by these chemicals or not. In particular, we tried ANOVA, random forest, XGBoost, logistic regression and LDA.


In addition to the given dataset, a dataset detailing which major political party held the state governorship was used to try to explain further the changes in water quality seen. Two conclusion were drawn from this. First, Republican governorships tend to lead to larger variances in year-on-year improvements in water quality. This can be interpreted as a product of their laissez-faire approach to environmental policy. Further, it was found that the effect of political affiliation on water quality was vastly different depending on the state investigated. For example, in Connecticut, the results of a linear regression gave a much more strongly significant result than in California. A large part of the contamination in California comes from Arsenic produced in wildfires, which stay constant whatever the administration. Signals found from political allegiance were found to be no match for the force of mother nature.

1. Exploratory data analysis

1.1 Exposure to pollution

Let \(C = \{0,1\}\) be the states of contamination, where 1 in particular denotes over than mean levels of contamination, and 0 otherwise. We first identify which states are most polluted. Let \(M_{county,state}\) be the number of people in a selected county and state, exposed to level 1 of contamination (above mean), and let \(N_{count,state}\) be the total number of population exposed to different states of contamination. The below geomap illustrates the population exposed to above mean levels of water contamination from years 1999 - 2016.

Since the above geomap is not weighted, it is more insightful to consider the ratio

\[ R_{country} = \frac{M_{county}}{N_{county}} , \mbox{ and } R_{state} = \frac{M_{county, \cdot}}{N_{county, \cdot}}, \] where the notation \(\cdot\) subscript describes \(M_{state, \cdot} = \sum_{state} M_{county, state}\). The below plot shows this.

1.2 Exposed to types of chemicals

Once we identify our most polluted states the natural questions concern the type of pollution and the sources. Here we generate 3 graphs that demonstrate considerable distribution variation across our polluted states, in this case Florida, California and Connecticut.

Forest fires release large amounts of arsenic into the environment, which may explain the high levels in California. Connecticut’s population also experiences high levels of above mean exposure in their population of arsenic. Interestingly, Florida exhibits contrasting levels, with low levels of abnormally high arsenic and uranium, and high levels of Halo-Acetic acid and trihalomethane.

1.3 Exposure to industry

Education and health, according to the industry occupation data, is the most populated industry in terms of employment. The three states of concern follow the national trend on this and the other industries. Real estates, scientific waste, and arts recreation industries are the remaining three that pass 10% threshold in national average, the three states follow similar distribution. Noticeably, retail trade in Florida has a relatively high proportion, and manufacturing in Florida has a significantly low proportion. Agriculture in California plays a tiny role but it is large compared to the other states or national average.

1.4 Exposure to earnings

Connecticut seems to have higher, for some industries including financial and management, much higher wage than nationwide, and outperforms California in nearly every industry. Florida has the least wage in all industries compared to the other two, and being similar to the national average. Industries of significant environmental impacts, e.g. construction, manufacturing, and transport have similar inequality structure across states --- Connecticut being much higher if not higher than California, whilst Florida being only 60 to 80 percents of Connecticut.

2. Modelling

2.1 Values of arsenic and nitrates based on size of industry

Linear Models:

We now focus our attention on evaluating whether the size of an industry measured in headcount is explanatory for the value of pollutants in the local water system. Focusing our attention on the previously identified most polluted states, we model the value of each chemical in a county as the response variable, with headcount employed in relevant industries as our explanatory factors.

In particular, we model

\[ y_{(i,j)} = \alpha_{0} + \beta_{1} \pmb{x}_{1}^{j} + \beta_{2} \pmb{x}_{2}^{j} +...+ \beta_{n} \pmb{x}_{n}^{j} + \epsilon_{j} \] Where \(y_{(i,j)}\) = value of pollutant \(i\) in county \(j\), \(x_{k}^{j}\) = size of industry \(k\) in county \(j\), \(n\) = number of relevant industries out of 14 in total. Example results from an instance of this model with nitrates as the response variable are as follows: Residuals:

Cautious of the results about, we proceed to check our assumptions using a Q-Q plot and plotting our residual values against our fitted values. In these graphs we see clear evidence that the assumptions required for such a model is not met. We witness tails that deviate from a gaussian distribution, and non-random relationships in the latter plot. As a result we deem this model unreliable and unsuitable for data.

To improve, we could attempt to use gamma distributions to model the data, rather than a standard linear model. The intercept could also be treated as a function of water usage. We could also include droughts as an explanatory factor, as intuition would tell us that it would impact concentrations of pollutant.

2.2 Pollution by arsenic based on size of industry

Suppose \(Y_{i}^{j}\in \{0,1\}\) is the response that is 1 if we observe \(\frac{M_{county}^{j}}{N_{county}^{j}} > 0.01\) by contaminant \(j\), and 0 otherwise. Using the same design matrix as our previous linear regression, we can formulate a logistic regression model

\[ Y_{i} \sim Bernoulli(p_{i}), \] where \(logit\{p_{i}\} = \pmb{x}_{j}^{T}\beta\) with the feature \(\pmb{x_{j}}\in\mathbb{R}^{p}\) and \(\beta\) as coefficients. However, we obtain extremely low \(R^{2}\) values and the model fit appears to be bad. We suspect that this is because iterative weighted least squares (IWLS) failed to converge.

We can also find what variables are important using a random forest. We see that the following variables are particular important by looking at the importance plots:

Using the well-known algorithm XGBoost, we can approximate a classifier \(f^{*}:\mathbb{R}^{p}\rightarrow \{0,1 \}\) by using the concept of boosting. With 774 observations from different counties, we split the data into 600 observations for training and the rest for testing. We obtain the following importance plot and test results:

We obtain an AUC of 0.6, which is a reasonable model. However, the most important explanatory variables are also different from what we obtained from random forest, indicating possible improvements for model fit.

3. Effects of politics

We identify the political alignment of a state or county as a potential explanatory variable for water contamination because we would like to think about how government policies influence pollution levels. Given the current political narratives, we briefly investigated whether a Republican affiliation will explain rises in contamination levels.

On the below plot, it shows the evolution of the political situation of several states, and we were able to find heuristics for California: that a strong Democrat presence enabled a constant drop in contamination levels from 2000 to 2016.

When looking at the pollution index that we had defined, we found that there were years where the pollution jumped up or down after having stayed stable for quite some time. The linear regression and random-tree analysis we were using to try to explain these trends were not giving good enough results, so we came to the conclusion that perhaps there was something else causing the trends. A candidate we focussed on was the political party in charge of the state at the time of recording of water quality. The Democratic Party and Republican Parties have wildly different ideas of how to approach environmental regulations, a fact which has come to head in the Trump administration and the withdrawal from the Paris climate agreements. A quick plot of the parties on a map and checking this against the data we saw of the pollution seemed to confirm our suspicions.

We obtained a dataset from kaggle ( which gave us the governors of the states at the years we were interested in, We then focussed on arsenic, a key contaminant that is measured across many states, and we saw that on average, when a Republican is governor, the concentration of arsenic in water decreased by 0.6±4.5 micrograms/L per year, while in the years that a Democrat is governor, we saw that the concentration of arsenic in water decreases by 0.0±0.3 micrograms/L per year. The large errors make these values hard to interpret, but if we look at the at the data in more depth, we can make a little more sense of it. First, states that are often controlled by the democrats have more consistently lower arsenic levels, making it harder to lower the concentration in the first place (2.9±1.5 micrograms/L for Democrat states vs 3.2±4.1 micrograms/L for Republican states). Further, the large variance in Republican year-on-year decreases in arsenic concentration shows that, perhaps due to their laissez-faire approach to protecting the environment, the effects are far less controlled.

The data was also analysed from another angle using linear regression. We firstly explored the national dataset. Restricting the dataset to the county-level where we have observations of contamination levels, and state political party affiliation (Republican or Democratic). The political affiliation was only defined at state-level as political structures can vary wildly across US states, making it difficult to have a uniform measure at the county-level. Define the dummy level D such that D=1 if the state governor is Democrat and 0 if Republican, for all years. Then consider the regression \[ y_{i,t}= \alpha + \sum_{k=1}^{15} \beta_kyear_t + \theta_0D_{i,t} + \sum_{k=1}^{15} \theta_kyear_tD_{i,t} \] Where \(y_{i,t}\) is the logarithm of contamination index, \(i\) represents a country, \(t\) represents time, hereby year.

We observed the following results:

span class="math inline">\(R^2\) is small at 0.0091, but this is potentially due to low numerical variations between contaminated and non-contaminated statuses.

  1. \(\{\theta_k\}_{k=1}^{15}\) all being significantly negative, whereas \(\{\beta_k\}_{k=1}^{15}\) are around zero, meaning that under Democrat governorship, contamination is significantly lower.

We also consider the regression at the state level. Florida has been consistently Republican in our observation years (2000-2015) so no regression could be ran. For California and Connecticut, we restrain the state dataset and consider regression \[ y_{i,t} = \alpha + \beta D_{i,t}. \] We obtained the following outcome:

The \(\beta\) in California is negatively insignificant, whereas in Connecticut is positively significant. Hence, we can conclude that the political impact in Connecticut is much stronger than California’s. Arsenic levels in Californian water supplies contribute significantly to the water pollution levels. A large amount of arsenic is released during wildfires, a phenomenon common in California. Therefore, we can deduce that the reason politics has little effect on Californian water quality is because wildfires stay mostly constant whatever the political party, giving them little power to improve an already high-quality water supply. In conclusion, any political effect that we could seemingly measure seems to be dwarfed by mother nature herself.