Predicting Income Bracket Based on US Census Data

Takuma Fujiwara
10 min readMay 14, 2021

--

Authored by: Julian Fritz, Takuma Fujiwara, Chris Karouta, Samuel Rizzo

Introduction & Problem Statement

In the United States today, information about income is hard to obtain as the culture values privacy over those details. However, knowing income statistics for areas can be very useful and important to the government and private sector, so we tasked ourselves with determining income information using demographic data. Demographic data about individuals is much more readily available than income data, so we wanted to create the best model possible that takes demographic information as the input and predicts whether that individual’s income is above or below a threshold.

This model would be useful to the government and private sector for a number of reasons. When a disaster or pandemic strikes, the government is tasked with providing relief, and they would want to provide the fastest relief to areas with low income. But especially during a pandemic or disaster incomes can change quickly, and with a delay in reported income, a model that uses demographic data could be the government’s best bet for determining the low income areas. The model is also potentially useful to private businesses, for example with targeted advertising or perhaps determining where to build their next store or office space.

Our model will be trained on demographic data that contains features such as age, education level, marital status, hours worked per week, etc. which is from a well-known and commonly used dataset that has adult US Census data from 1994. The dataset also contains labels for each individual of whether their income is above $50,000, which makes it very convenient to train our model on. However, the dataset is very outdated so we tasked ourselves with creating an additional 2020 dataset to train our model on. This way we would have an up-to-date model and the opportunity to compare 1994 and 2020 in terms of predicting income from demographics.

Data Collection

We collected two sets of census data: one set from 1994 and another from 2020. For the 1994 data set, we used the “Adult Data Set” originally donated by Ronny Kohavi and Barry Becker to UCI. The “Adult Data Set” is a data set commonly used by researchers, tutorials, and university curriculums due to its cleanliness and ease of use for binary classification models. We were able to find this data set on Kaggle and downloaded it from here: https://www.kaggle.com/wenruliu/adult-income-dataset.

The 2020 data set proved to be harder to find. Any work that we found regarding prediction of income based on demographics was referencing the “Adult Data Set” from 1994, and we could not find any updated data set that was cleaned like this set. Much time was spent trying to extract data from the decennial census conducted by the Census Bureau, but there were a few problems:

  • The decennial census stores data by county and not by individual, so it was almost impossible to generate a data set like the “Adult Data Set” where each data point is an individual.
  • At the time of the project, the 2010 decennial census was the most recent decennial census, and the 2020 decennial census results were still being processed.
  • Decennial census data were stored state by state (plus US territories), so there were more than 50 different .csv files to parse through.

We gave up on using the decennial census data, and searched for other sources. We were able to find another set of data collected by the Census Bureau called the ASEC survey. The features that were available matched the features available in the 1994 “Adult Data Set”, so we believe that the 1994 data set was also derived from ASEC or a similar survey, since 1994 is not a year when a decennial census is conducted. Using the ASEC data, we did our own preprocessing to create a new data set that resembles the 1994 data set. The original ASEC census data can be found here: https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html.

Data Pre-Processing

First, we pre-processed the raw 2020 ASEC data to look like the 1994 “Adult Data Set” data. The resulting data set can be found uploaded here: https://www.kaggle.com/takumafujiwara/2020-census-data.

The following are the steps taken to pre-process the 2020 data to look like the 1994 data:

  1. The original ASEC data had 840 features, so we extracted the 14 features that are in the 1994 data set. We referenced this document: https://www2.census.gov/programs-surveys/cps/datasets/2020/march/ASEC2020ddl_pub_full.pdf to find the meaning of the 840 features, and find the features that are important.
  2. Rows were extracted using the same conditions as the 1994 data set: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). This greatly reduced the rows from 157959 rows to 38576 rows, and it also cleaned the data.
  3. Column names were renamed to match the names in the 1994 data set.
  4. Categorical features were re-coded to match those in the 1994 data set.
  5. Binary features that indicated whether an individual was making more or less than 50k, and another binary feature indicating an individual making more or less than 90k were derived from income features.

By following the steps above, we were able to obtain a data set that is decently sized, cleaned, and as easy to work with as the “Adult Data Set”.

Once the 1994 and 2020 datasets contained the same features, we moved on to pre-processing techniques. For numerical features, we tested the effects of min-max scaling. However, after using 5 fold cross validation to test the data with and without min-max scaling, the data without the scaling outperformed the data with the scaling.

For the categorical features, we had to make a decision between one hot encoding or label encoding. One hot encoding struggles when the amount of categories in a feature is high because each category will be given its own feature in the encoded version of the data. For instance, the occupation feature in our data would have taken up a tremendous amount of data due to the fact that there are hundreds of different answers to the occupation feature. Due to one hot encoding’s shortcomings and the fact that label encoding can handle nominal data much better than one hot encoding can, we went with label encoding for our categorical features.

Results/Model Comparison

After the preprocessing, we created multiple models and tuned their hyperparameters to find the ones that fit the model the best. Not only were the hyperparameters tuned, but each model’s roc auc score is the average score from 20 runs with 20 different random seeds to get a more accurate understanding of the models’ performances and avoid outliers.

The bar chart above compares the models between each other as well as between the year 1994 and 2020. Most of the models, besides MLP, fit the data rather well with a roc auc score of above .85. Catboost and XGBoost performed the best for both years. Our data might have had high bias as all boosting models had high AUC, and boosting models reduce model bias the best.

However, all models underperformed in 2020 compared to the 1994 data. This might have been because income became less dependent on demographics in 2020 compared to 1994 since demographics in 2020 were less likely to predict the income bracket correctly, or variations with data collection in 2020 compared to 1994.

Feature Importance

Here, we look at the average feature importance from Random Forest, XGBoost, Boosted Trees, and CatBoost to get a better understanding of the importance on a grand scale. In both years, education level and relationship status were very important in determining the income of a person, and race, country of birth, gender, and work sector were all assigned a low importance by the models.

However, compared to 1994, relationship status, capital gains reported, and marital status all declined in importance, meaning the models training on 2020 data assigned less importance to those features than others To replace those features, education level, age, hours worked per week, and occupation all increased in importance in 2020.

We must emphasize that these variables were solely important in determining the binary income bracket level, either <$50k or >$50k in 1994 value, and the feature importances may change when determining a numerical income value.

Here we just look at the Random Forest model comparing both 1994 feature importance on the left to 2020 on the right to see it on a smaller scale.

Race, gender, and country of birth were not influential in both years, whereas the model chose to focus more on: education, age, hours per week, capital gains, and relationship status.Looking at the changes between 1994 and 2020, education, hours worked per week, and age all increased in importance while capital gains and relationship status decreased.

Stacking Models

After tuning our 6 different models, we stacked the tuned models to see if that will improve our score. Logistic Regression, Random Forest, Boosted Tree, MLP, CatBoost, and XGBoost were stacked together and we used Logistic Regression as the final estimator. ROC AUC score improved for both 1994 and 2020 data set when compared to CatBoost, which was our best performing single model. For 1994, the score improved by 0.0138 to achieve our best score of 0.941. For 2020, the score improved by 0.0459 to achieve our best score of 0.910.

Applying Outdated Model

Part of the reason we chose to do this project is because of the lack of current demographic-income datasets. The 1994 dataset is commonly used by universities, tutorials, and machine learning competitions, yet it is so outdated it is likely not useful for any modern applications. We wanted to test that by applying the 1994-trained model on the 2020 dataset, to see how well an outdated model could predict if an individual in 2020 makes more than $90,000. We took the 1994 stacked model, which was our best model able to score 0.94, and ran its prediction on the 2020 dataset and found that it scored just under 0.75.

The red bar on the right in this chart shows how much worse the score of 0.75 is compared to our other models. Seeing how poorly the outdated model performed, it reinforces our motivation for creating the 2020 dataset so that we can have a model that works for new data. Additionally, we ran the 2020-trained stacked model on the 1994 dataset and got a score of 0.85, seen in the chart as the blue bar on the right. That score is significantly better than the 1994 model got on 2020 data, but also worse than the 1994 model on 1994 data. This is consistent with the idea that the 2020 dataset has less predictive power than the 1994 dataset, causing all models run on the 1994 dataset to score higher than on 2020.

Conclusion

We took a popular but outdated dataset for predicting income threshold based on demographic data and created a much less outdated version and created accurate models for both. Our results showed that the best model we could get was by stacking the Logistic Regression, MLP, Boosted Trees, Random Forest, XGBoost, and CatBoost together. With this stacked model we achieved an average ROC AUC test score of 0.91 on the 2020 dataset, which reached our goal of attaining at least 0.9 to have a well performing model. The 1994 stacked model scored higher at 0.94, but the data is so outdated that this model is only useful for finding comparisons with a current model.

Along with the stacked model, all of our base models performed significantly better on the 1994 dataset than the 2020. This indicates that the 2020 dataset does not have as much predictive power as the 1994 dataset. We believe this could be due to people’s income in 2020 not being as correlated with demographics as it was in 1994, but we are careful to draw conclusions when we only know the correlation and not the causation. The lower predictive power in the 2020 dataset could just be due to how the samples were collected, which samples were collected, etc.

To make an additional comparison, we applied the 1994 stacked model on the 2020 which scored 0.75, far worse than it did on 1994 data. This shows how an outdated model on current data is a very poor estimator, and highlights the importance that we made a 2020 dataset so that our model can have modern applications. We also ran the 2020 model on the 1994 data and got a significantly higher score of 0.85, consistent with the idea that the 1994 dataset has more predictive power.

We also compared the feature importances for each of the years. We saw that while education level was important in both years, it was significantly more important in 2020, which is consistent with what one would expect with the increasing number of individuals getting college educated in the US. In addition we saw that relationship status was hugely important for 1994 but that importance, along with marital status importance, declined in 2020. This could be due to the younger unmarried generation being more likely to have a high income in 2020 than in 1994 due to increased college education rates.

Overall, we are satisfied with the performance of our stacked 2020 model and are glad that we found a lot of comparisons with the outdated model. The 2020 dataset that we created is posted on Kaggle and we hope that our work will be useful to the Kaggle community in the future.

Link to our 2020 dataset: https://www.kaggle.com/takumafujiwara/2020-census-data

Link to our GitHub: https://github.com/TakumaFujiwara/DSLabFinalProject

References

https://www2.census.gov/programs-surveys/cps/datasets/2020/march/ASEC2020ddl_pub_full.pdf

--

--

Takuma Fujiwara
Takuma Fujiwara

No responses yet