We present steps to generate public data sets for benchmarking and illustration purposes for air pollution and health studies. In most of these studies, the health care data cannot be shared with the public; as a result, there are no public data sets to be used as benchmark data set for testing the packages or illustrating their functionalities. CMS has generated synthetic data for the 2008-2010 range for Medicare data. This report uses part of these data, census, and exposure data to compile the study data set.
There are three major categories for air pollution and health studies data sets: Exposure, Confounders, and Outcome data. We collect these data from different resources and compile them into the study data set. The compilation and aggregation can be done based on spatial and temporal features. This report is based on the annual average for 2010, aggregated at the county level for the Contiguous United States. Figure 1 shows the schematic data pipeline.
Each of these sources includes numerous fields. The research question will determine which fields should be used in the study. We go through each resource and briefly talk about the fields of interest in the following. We group these fields by Federal Information Processing Standards FIPS code (Environmental Systems Research Institute (2021)). Each county in the United States has its dedicated FIPS code. We use the shapefile provided by the United States Census Bureau (United States Census Bureau (2021)). Also, in this report, we limit the study region to the Contiguous United States. The states’ FIPS code are mentioned on Natural Resources Conservation Service (2021). As a result, we drop the following states:
02
: Alaska,15
: Hawaii,60
: American Samoa,66
: Guam,69
: Northern Mariana Islands,72
: Puerto Rico, and78
: Virgin Islands.# Load libraries
# Geospatial
library(sf)
library(rgdal)
library(dplyr)
library(ncdf4)
# SAS Data
library(haven)
# Census
library(tidycensus)
# General
library(fst)
library(PCICt)
library(memoise)
library(tidyverse)
library(lubridate)
library(data.table)
# Setting up environment and helper functions
source("setup_env_vals.R")
source("r0_utility_functions.R")
The following code loads the shapefiles data and plots counties in New England.
# create the file path
fpath_c <- file.path(get_options("input_dir"),
"public/Geospatial", "gz_2010_us_050_00_500k/")
# read shapefiles
county_shape_file <- rgdal::readOGR(fpath_c)
#> OGR data source with driver: ESRI Shapefile
#> Source: "/Users/nak443/Documents/Naeem_folder_mac_h/Research_projects/F2021_001_Harvard/experiment_001_20211205_harvard_project/input_data/public/Geospatial/gz_2010_us_050_00_500k", layer: "gz_2010_us_050_00_500k"
#> with 3221 features
#> It has 6 fields
# transform shapefiles into sp object
county_shape_file <- spTransform(county_shape_file,
CRS("+proj=longlat +datum=WGS84"))
non_c_states <- c("02","15","60","66","69","72","78")
cs_inland <- county_shape_file[!county_shape_file$STATE %in% non_c_states, ]
# Generate FIPS code
cs_inland$FIPS <- paste(cs_inland$STATE,cs_inland$COUNTY,sep = "")
# select states in New England
NE_c <- cs_inland[cs_inland$STATE %in% c("9","23","25","33","44","50"),]
plot(NE_c)
The following table shows the list of the states and the number of counties per state.
cs_inland_val <- data.frame(cs_inland[, c("STATE","COUNTY", "NAME","FIPS")])
state_county <- cs_inland_val %>% group_by(STATE) %>% count()
state_fips <- read.csv(file.path(get_options("input_dir"),
"public/Geospatial", "FIPS_state_crosswalk.csv"))
state_fips$STATE <- sprintf("%02d", state_fips$FIPS)
state_fips$FIPS <- NULL
joined_data <- inner_join(state_county, state_fips, by="STATE")
state_name_county <- data.frame(joined_data$NAME, joined_data$POSTAL.CODE,
joined_data$STATE, joined_data$n)
names(state_name_county) <- c("Name", "Code", "FIPS", "Num of Counties")
state_name_county
In summary, in 2010, there are 49 contiguous states (48 + DC), and 3109 counties. The crosswalk file to convert FIPS code into State name is available from different resources including Natural Resources Conservation Service (2021).
In this report, we are interested in the causal effect of air pollution on the mortality rate. The exposure parameter is \(PM_{2.5}\). Di et al. (2019) provided daily, and annual \(PM_{2.5}\) estimates at \(1\ km \times 1 \ km\) grid cells in the entire United States. The data can be downloaded from Di et al. (2021).
# Set up memoization
cd <- cachem::cache_disk(pr_cache)
m_match_exposure_to_sitecode <- memoise(match_exposure_to_sitecode, cache = cd)
m_map_point_shape <- memoise(map_point_shape, cache = cd)
# Create file path for each rds file and read data.
fpath <- file.path(get_options("input_dir"),"public/Di_2019","2010.rds")
usgpath <- file.path(get_options("input_dir"),"public/Di_2019","USGridSite.rds")
di_pm25_2010 <- readRDS(fpath)
us_grid <- readRDS(usgpath)
# match exposure to sitecode
pm25_2010 <- m_match_exposure_to_sitecode(site_code = us_grid,
exp_data = di_pm25_2010,
exp_name = "pm25")
# Generate FIPS code for the shapefile
cs_inland$FIPS <- paste(cs_inland$STATE,cs_inland$COUNTY,sep = "")
# Convert PM2.5 data frame to SpatialPoints data frame
coordinates(pm25_2010) <- ~Lon+Lat
# Join PM2.5 points with shapefile polygons
cs_inland_pm_2010 <- m_map_point_shape(shape_object = cs_inland,
point_object = pm25_2010,
value_name = "pm25",
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE")
# Group and aggregate data for FIPS code level
pm_data <- cs_inland_pm_2010 %>%
group_by(FIPS) %>%
summarise(qd_mean_pm25 = mean(pm25))
# merge the new values with the shape file.
merged_obj <- merge(cs_inland, pm_data, by=c("FIPS"))
spplot(merged_obj, zcol = "qd_mean_pm25",
col.regions=heat.colors(51, rev = TRUE),
xlab="Longitude", ylab="Latitude",
main="Mean PM2.5 in the Contiguous United States (2010)")
There are 3109 counties with exposure data. Number of counties with missing data: 0 counties.
The main reference for getting the census data is the United States Census Bureau. There are numerous studies and surveys for different geographical resolutions. Reviewing these details is beyond the scope of this report; however, it is strongly recommended to get familiar with tables and their labels. The census bureau has a convenient API to download data. tidycensus is an R package that allows users to interface with a select number of the US Census Bureau data APIs and return tidyverse-ready data frames (Walker and Herman (2021)). The census data comes for different geographical resolutions. We download data for counties from acs5 source.
Here is the list of variables that are available in 2010 for acs5:
census_api_key(Sys.getenv("CENSUS_API_KEY"))
#> To install your API key for use in future sessions, run this function with `install = TRUE`.
census_vars = c("B01001_020", "B01001_021", "B01001_022", "B01001_023",
"B01001_024", "B01001_025",
"B01001_044", "B01001_045", "B01001_046", "B01001_047",
"B01001_048", "B01001_049",
"B17001_015", "B17001_016",
"B17001_029", "B17001_030",
"B01001I_014", "B01001I_015", "B01001I_016",
"B01001I_029", "B01001I_030", "B01001I_031",
"B01001B_014", "B01001B_015", "B01001B_016",
"B01001B_029", "B01001B_030", "B01001B_031",
"B01001H_014", "B01001H_015", "B01001H_016",
"B01001H_029", "B01001H_030", "B01001H_031",
"B01001C_014", "B01001C_015", "B01001C_016",
"B01001C_029", "B01001C_030", "B01001C_031",
"B01001D_014", "B01001D_015", "B01001D_016",
"B01001D_029", "B01001D_030", "B01001D_031",
"B15001_035", "B15001_036", "B15001_037",
"B15001_076", "B15001_077", "B15001_078",
"B19049_005",
"B25077_001",
"B01003_001")
v10 <- load_variables(2010, "acs5", cache = TRUE)
var_table <- v10[v10$name %in% census_vars,]
var_table$label <- sub("Estimate!!","",var_table$label)
var_table$label <- sub("Total!!","",var_table$label)
var_table
The mortality rate (CMS data) is based on Medicare data, mainly reported for patients over 65 years old. In the following we download the data and create the variables.
census_2010 <- get_acs(geography = "county",
variables = census_vars,
year = 2010,
survey = "acs5",
output = "wide")
#> Getting data from the 2006-2010 5-year ACS
# Drop Margin of error column.
census_2010_e <- census_2010 %>% select(-ends_with("M"))
census_2010_e$STATE <- substr(census_2010_e$GEOID, 1, 2)
census_2010_e$COUNTY <- substr(census_2010_e$GEOID, 3, 5)
inland_states <- state_county$STATE
census_2010_e_inland <- census_2010_e[census_2010_e$STATE %in% inland_states, ]
# add new variables
# add new variables
census_2010_e_m <- census_2010_e_inland %>%
add_column(tmp_total_pop = (.$B01001_020E + .$B01001_021E + .$B01001_022E +
.$B01001_023E + .$B01001_024E + .$B01001_025E +
.$B01001_044E + .$B01001_045E + .$B01001_046E +
.$B01001_047E + .$B01001_048E + .$B01001_049E))
census_2010_e_m <- census_2010_e_m %>%
add_column(cs_poverty = (.$B17001_015E + .$B17001_016E +
.$B17001_029E + .$B17001_030E)/
(.$tmp_total_pop),
cs_hispanic = (.$B01001I_014E + .$B01001I_015E +
.$B01001I_016E + .$B01001I_029E +
.$B01001I_030E + .$B01001I_031E)/
(.$tmp_total_pop),
cs_black = (.$B01001B_014E + .$B01001B_015E +
.$B01001B_016E + .$B01001B_029E +
.$B01001B_030E + .$B01001B_031E)/
(.$tmp_total_pop),
cs_white = (.$B01001H_014E + .$B01001H_015E +
.$B01001H_016E + .$B01001H_029E +
.$B01001H_030E + .$B01001H_031E)/
(.$tmp_total_pop),
cs_native = (.$B01001C_014E + .$B01001C_015E +
.$B01001C_016E + .$B01001C_029E +
.$B01001C_030E + .$B01001C_031E)/
(.$tmp_total_pop),
cs_asian = (.$B01001D_014E + .$B01001D_015E +
.$B01001D_016E + .$B01001D_029E +
.$B01001D_030E + .$B01001D_031E)/
(.$tmp_total_pop),
cs_ed_below_highschool = (
.$B15001_036E + .$B15001_037E +
.$B15001_077E + .$B15001_078E)/
(.$B15001_035E + .$B15001_076E),
cs_household_income = .$B19049_005E,
cs_median_house_value = .$B25077_001E,
cs_total_population = .$B01003_001E
)
## Extract area of each county
tmp_census <- census_2010_e_m
colnames(tmp_census)[which(names(tmp_census) == "GEOID")] <- "FIPS"
tmp_obj <- merge(cs_inland, tmp_census, by=c("FIPS"))
fips_area <- data.frame(tmp_obj[, c("FIPS","CENSUSAREA")])
tmp_census <- tmp_obj <- NULL
colnames(fips_area)[which(names(fips_area) == "CENSUSAREA")] <- "cs_area"
# One county is missing data (FIPS: 48301).
census_2010_e_m <- census_2010_e_m %>%
mutate_all(~replace(., is.na(.), 0))
tmp <- rowSums(census_2010_e_m[, c("cs_hispanic", "cs_black", "cs_white",
"cs_native", "cs_asian")])
census_2010_e_m$cs_other <- ifelse((1-tmp)<0, 0, 1-tmp)
# Drop initial columns
census_2010_processed <- census_2010_e_m %>%
select(!matches("B[0-9]+[A-Z]*_[0-9]+E"))
census_2010_processed$tmp_total_pop <- NULL
colnames(census_2010_processed)[which(names(census_2010_processed) == "GEOID")] <- "FIPS"
census_2010_processed$STATE <- NULL
census_2010_processed$COUNTY <- NULL
# Adding area of county to the data.
census_2010_processed <- merge(census_2010_processed, fips_area, by="FIPS")
census_2010_processed[census_2010_processed$FIPS == "48301",]
There is a problem with Loving County data in Texas. We use average of surrounding counties to impute missing values.
compute_cols <- c("cs_poverty", "cs_hispanic", "cs_black",
"cs_white",
"cs_native", "cs_asian",
"cs_household_income",
"cs_other", "cs_ed_below_highschool")
tmp_four <- census_2010_processed[census_2010_processed$FIPS %in% c("48495",
"48475",
"48389",
"35025"),
compute_cols]
tmp_mean <- colMeans(tmp_four)
tmp_mean[["cs_household_income"]] <- floor(tmp_mean[["cs_household_income"]])
for (item in compute_cols){
census_2010_processed[census_2010_processed$FIPS=="48301", item] <- getElement(tmp_mean, item)
}
# compute population density
# https://www.socialexplorer.com/data/C2000/metadata/?ds=SE&var=T003_001
census_2010_processed["cs_population_density"] <- (
census_2010_processed$cs_total_population/census_2010_processed$cs_area)
census_2010_processed["cs_log_population_density"] <- log10(census_2010_processed["cs_population_density"])
census_2010_processed["cs_log_total_population"] <- log10(census_2010_processed["cs_total_population"])
census_data <- census_2010_processed
# Take a look at data
# merge the new values with the shape file.
merged_obj <- merge(cs_inland, census_data, by=c("FIPS"))
spplot(merged_obj, zcol = "cs_median_house_value",
col.regions=terrain.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Median House Value in the Contiguous United States (2010)")
spplot(merged_obj, zcol = "cs_area",
col.regions=terrain.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Area of counties in the Contiguous US (2010) - mi^2")
spplot(merged_obj, zcol = "cs_log_total_population",
col.regions=terrain.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Log10 of total population of each counties in the Contiguous US (2010)")
spplot(merged_obj, zcol = "cs_log_population_density",
col.regions=terrain.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Log10 of population density in the Contiguous US (2010)")
There are 3109 counties’ data in the census data. The following shows the available column names and number of missing data per column in the census data.
sapply(census_data, function(x) sum(is.na(x)))
#> FIPS NAME cs_poverty
#> 0 0 0
#> cs_hispanic cs_black cs_white
#> 0 0 0
#> cs_native cs_asian cs_ed_below_highschool
#> 0 0 0
#> cs_household_income cs_median_house_value cs_total_population
#> 0 0 0
#> cs_other cs_area cs_population_density
#> 0 0 0
The Centers for Disease Control and Prevention (CDC), provides the Behavioral Risk Factor Surveillance System (Centers for Disease Control and Prevention (2021)), which is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors. In this report, we are interested in the participants’ body mass index (BMI) and smoking status. The BRFSS data also is provided at the county level. We download the data in SAS format from this webpage and load it into R using the haven R package (Wickham and Miller (2021)). The definition of variables is mentioned in this codebook.
brfss_2010 <- read_xpt(file.path(get_options("input_dir"),
"public/CDC_data","CDBRFS10.XPT"))
varlist <- c("_STATE", "CTYCODE", "_BMI4","_SMOKER3")
brfss_2010_subset <- extract_brfss_vars(brfss_2010, varlist = varlist)
brfss_2010 <- NULL
# Modify column names
names(brfss_2010_subset) <- c("state","county","bmi","smoker")
# Polish vars
brfss_data_2010 <- polish_bfrss_vars(brfss_2010_subset)
The BRFSS data is a phone survey, as a result, there are many missing values. From entire CDC data, 7.49% of data has missing county values, and there are 696 counties without any reported values.
merged_obj <- merge(cs_inland, brfss_data_2010, by=c("FIPS"))
merged_obj$missing <- as.factor(ifelse(is.na(merged_obj$cdc_mean_bmi), 1, 0))
spplot(merged_obj, zcol = "missing",
col.regions=c("snow","red"),
col = "grey61",
xlab="Longitude", ylab="Latitude",
main="Counties with Missing CDC Data")
Discussing methods for imputing missing values is beyond the scope of this report. In this report, we generate a normal distribution of each parameter for each state and choose a value at random for counties with missing values.
brfss_data <- data.frame(
merged_obj[, c("FIPS", "STATE", "COUNTY", "cdc_mean_bmi",
"cdc_pct_cusmoker", "cdc_pct_sdsmoker",
"cdc_pct_fmsmoker", "cdc_pct_nvsmoker",
"cdc_pct_nnsmoker")])
# 9999 is assigned for those who answered the survey but refused to answer
# specific questions, we treat them as missing values
brfss_data$cdc_mean_bmi <- ifelse(brfss_data$cdc_mean_bmi == 9999, NA,
brfss_data$cdc_mean_bmi)
brfss_data <- impute_cdc(
data = brfss_data,
param_list = c("cdc_mean_bmi", "cdc_pct_cusmoker",
"cdc_pct_sdsmoker","cdc_pct_fmsmoker",
"cdc_pct_nvsmoker",
"cdc_pct_nnsmoker"))
merged_obj <- merge(cs_inland, brfss_data, by=c("FIPS"))
spplot(merged_obj, zcol = "cdc_mean_bmi",
col.regions=heat.colors(51, rev = TRUE),
xlab="Longitude", ylab="Latitude",
main="Mean Body Mass Index in the Contiguous United States (2010)")
Climatology Lab at the University of California, Merced, provides the GridMET data (Abatzoglou (2013)). The data set is daily surface meteorological data covering the contiguous United States. This report is interested in average annual and seasonal maximum temperature, average annual and seasonal maximum humidity, and annual and seasonal specific humidity. The Winter and Summer time line is according to the following:
Data comes in NetCDF4 format. We use ncdf4 to load the data (Pierce (2019)).
cd <- cachem::cache_disk(pr_cache)
m_aggregate_netcdf_data <- memoise(aggregate_netcdf_data, cache = cd)
m_map_point_shape <- memoise(map_point_shape, cache = cd)
year <- 2010
# Temperature
tmmx_path <- file.path(get_options("input_dir"), "public/gridmet_data",
paste0("tmmx_",year, ".nc"))
# Last year (ly)
tmmx_path_ly <- file.path(get_options("input_dir"),
"public/gridmet_data",
paste0("tmmx_",year-1,".nc"))
# Humidity
rmax_path <- file.path(get_options("input_dir"), "public/gridmet_data",
paste0("rmax_",year, ".nc"))
rmax_path_ly <- file.path(get_options("input_dir"),
"public/gridmet_data",
paste0("rmax_",year-1,".nc"))
sph_path <- file.path(get_options("input_dir"), "public/gridmet_data",
paste0("sph_",year,".nc"))
sph_path_ly <- file.path(get_options("input_dir"),
"public/gridmet_data",
paste0("sph_",year-1,".nc"))
tmmx <- compile_gridmet_data(nc_path = tmmx_path,
param_name = "air_temperature",
start_date = paste0(year,"-01-01"),
end_date = paste0(year,"-12-30"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_tmmx")
tmmx_summer <- compile_gridmet_data(nc_path = tmmx_path,
param_name = "air_temperature",
start_date = paste0(year,"-06-01"),
end_date = paste0(year, "-08-31"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_summer_tmmx")
tmmx_winter <- compute_winter(path_cy = tmmx_path,
path_ly = tmmx_path_ly,
year_cy = year,
param_name = "air_temperature",
field_name = "winter_tmmx")
rmx <- compile_gridmet_data(nc_path = rmax_path,
param_name = "relative_humidity",
start_date = paste0(year,"-01-01"),
end_date = paste0(year,"-12-30"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_rmx")
rmx_summer <- compile_gridmet_data(nc_path = rmax_path,
param_name = "relative_humidity",
start_date = paste0(year,"-06-01"),
end_date = paste0(year, "-08-31"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_summer_rmx")
rmx_winter <- compute_winter(path_cy = rmax_path,
path_ly = rmax_path_ly,
year_cy = year,
param_name = "relative_humidity",
field_name = "winter_rmx")
sph <- compile_gridmet_data(nc_path = sph_path,
param_name = "specific_humidity",
start_date = paste0(year,"-01-01"),
end_date = paste0(year,"-12-30"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_sph")
sph_summer <- compile_gridmet_data(nc_path = sph_path,
param_name = "specific_humidity",
start_date = paste0(year,"-06-01"),
end_date = paste0(year, "-08-31"),
agg_fun = mean,
shape_obj = cs_inland,
extra_fields_name = c("STATE","COUNTY",
"NAME","FIPS"),
group_field_name = "FIPS",
field_na_drop = "STATE",
agg_field_name = "mean_summer_sph")
sph_winter <- compute_winter(path_cy = sph_path,
path_ly = sph_path_ly,
year_cy = year,
param_name = "specific_humidity",
field_name = "winter_sph")
## Merge data
multi_merge <- function(x, y){
df <- left_join(x, y, by= "FIPS")
return(df)
}
df_gridmet <- Reduce(multi_merge, list(tmmx,
tmmx_summer,
tmmx_winter,
rmx,
rmx_summer,
rmx_winter,
sph,
sph_summer,
sph_winter))
cn <- colnames(df_gridmet)
colnames(df_gridmet) <- c(cn[1], paste("gmet_", cn[2:10], sep = ""))
There are 3105 counties with GridMET data. Number of missing counties: 4 counties.
If we aggregate data and take a look at the difference between GridMET data and Initial Counties:
df_gridmet$STATE <- substr(df_gridmet$FIPS, 1, 2)
df_gridmet$COUNTY <- substr(df_gridmet$FIPS, 3, 5)
county_per_state_gridmet <- df_gridmet %>%
group_by(STATE) %>% count()
setdiff(county_per_state_gridmet, state_county)
According to the results, number of counties for state of Virginia (FIPS = 51
) is different with original shape files.
Number of counties for state of Virginia: 134
Number of counties for state of Virginia in the GridMET data: 130
Let’s print out the counties that are not in the GridMET data.
gr_missing_counties <- setdiff(cs_inland_val$COUNTY, df_gridmet$COUNTY)
cs_inland_val[cs_inland_val$STATE == 51 & cs_inland_val$COUNTY %in% gr_missing_counties, ]
We can take a look at these counties on the map:
vr_inland <- cs_inland[cs_inland$STATE==51,]
vr_inland$missing <- ifelse(vr_inland$COUNTY %in% gr_missing_counties, 1, 0)
vr_inland$missing <- as.factor(vr_inland$missing)
spplot(vr_inland, zcol = "missing",
col.regions=c("snow","red"),
col = "grey61",
xlab="Longitude", ylab="Latitude",
main="Counties in the State of Virginia with Missing GridMET Values")
As one can see, it seems there are counties inside other counties, and because of small sizes none of the GridMET grid points was located inside those counties. With further research, we learn that these are independent cities located in the Commonwealth of Virginia. For the purpose of this report we assign the surrounding counties values for this counties.
assign_data <- function(data, from_fips, to_fips){
tmp <- data[data$FIPS == from_fips,]
tmp$FIPS <- to_fips
tmp$STATE <- substr(to_fips, 1, 2)
tmp$COUNTY <- substr(to_fips, 3, 5)
return(tmp)
}
gridmet_data <- rbind(df_gridmet,
assign_data(df_gridmet, 51005, 51580),
assign_data(df_gridmet, 51163, 51678),
assign_data(df_gridmet, 51059, 51610),
assign_data(df_gridmet, 51019, 51515))
gridmet_data <- gridmet_data[, !colnames(gridmet_data) %in% c("STATE","COUNTY")]
# merge the new values with the shape file.
merged_obj <- merge(cs_inland, gridmet_data, by=c("FIPS"))
There are 3109 counties with GridMET data. Number of missing counties: 0 counties.
spplot(merged_obj, zcol = "gmet_mean_winter_tmmx",
col.regions=topo.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Mean max temp. during winter in the Contiguous US (2009-10)")
spplot(merged_obj, zcol = "gmet_mean_summer_tmmx",
col.regions=topo.colors(51, rev = FALSE),
xlab="Longitude", ylab="Latitude",
main="Mean max temp. during summer in the Contiguous US (2010)")
Centers for Medicare and Medicaid Services(CMS) provides synthetic data at the county level for 2008-2010 (Centers for Medicare & Medicaid Services (2021)). CMS data comes in SSA code that is different than FIPS code. We use SSA to Federal Information Processing Series (FIPS) State and County Crosswalk file to convert SSA code to FIPS code (National Bureau of Economic Research (2021)).
# Add functions to memoization
cd <- cachem::cache_disk(pr_cache)
m_extract_CMS_data <- memoise(extract_CMS_data, cache = cd)
m_aggregate_cms_data <- memoise(aggregate_cms_data, cache = cd)
# Data directory
# Combine fst data.
CMS_DATA_DIR <- file.path(get_options("input_dir"),"public/cms_data")
ssa2fips_data <- read.csv(file.path(get_options("input_dir"),"public","cms_data","ssa_fips_state_county2011.csv"))
ssa2fips_data$cbsa <- NULL
ssa2fips_data$cbsaname <- NULL
# remove empty rows
ssa2fips_data <- ssa2fips_data[!is.na(ssa2fips_data$ssacounty),]
ssa2fips_data$FIPS <- sprintf("%05d", ssa2fips_data$fipscounty)
ssa2fips_data$SSA <- sprintf("%05d", ssa2fips_data$ssacounty)
ssa2fips_cross <- ssa2fips_data[, c("SSA","FIPS")]
cms_2010 <- m_extract_CMS_data(CMS_DATA_DIR, 2010)
cms_2010_fips <- merge(cms_2010, ssa2fips_cross, by = "SSA")
agg_data_cms_2010 <- m_aggregate_cms_data(cms_2010_fips)
# select inland data
agg_data_cms_2010$STATE <- substr(agg_data_cms_2010$FIPS, 1, 2)
agg_data_cms_2010$COUNTY <- substr(agg_data_cms_2010$FIPS, 3, 5)
inland_states <- state_county$STATE
agg_data_cms_2010 <- agg_data_cms_2010[agg_data_cms_2010$STATE %in% inland_states, ]
agg_data_cms_2010 <- agg_data_cms_2010[, !colnames(agg_data_cms_2010) %in% c("STATE","COUNTY")]
There are 3054 counties with CMS data. Number of missing counties: 55 counties.
The following figure represents counties with missing values.
merged_obj <- merge(cs_inland, agg_data_cms_2010, by=c("FIPS"))
merged_obj$missing <- ifelse(is.na(merged_obj$cms_mortality_pct), 1, 0)
merged_obj$missing <- as.factor(merged_obj$missing)
spplot(merged_obj, zcol = "missing",
col.regions=c("snow","red"),
col = "grey61",
xlab="Longitude", ylab="Latitude",
main="Counties with Missing CMS Data")
# Number of missing values by state
cms_attr <- c("cms_mortality_pct", "cms_white_pct", "cms_black_pct",
"cms_others_pct", "cms_hispanic_pct", "cms_female_pct")
We use median value of state for imputing missing values.
cms_data <- data.frame(merged_obj[, c("STATE","COUNTY", "FIPS", "missing",
cms_attr)])
for (item in cms_attr){
cms_data <- cms_data %>%
group_by(STATE) %>%
mutate(new_val=ifelse(is.na(.data[[item]]),
median(.data[[item]], na.rm=TRUE),
.data[[item]]))
cnames <- colnames(cms_data)
cms_data <- cms_data[ , !(names(cms_data) %in% c(item))]
colnames(cms_data)[which(cnames=="new_val")-1] <- item
}
cms_data <- cms_data[, !colnames(cms_data) %in% c("STATE","COUNTY","missing")]
merged_obj <- merge(cs_inland, cms_data, by=c("FIPS"))
spplot(merged_obj, zcol = "cms_mortality_pct",
col.regions=heat.colors(51, rev = TRUE),
xlab="Longitude", ylab="Latitude",
main="Mortality rate in the Contiguous United States (2010)")
We generated county-level data for each resources for 2010. Now we need to join all data based on FIPS code.
study_dataset <- Reduce(multi_merge, list(pm_data,
census_data,
brfss_data,
gridmet_data,
cms_data))
Number of data samples: 3109.
Number of missing value: 0.
sapply(study_dataset, function(x) sum(is.na(x)))
#> FIPS qd_mean_pm25 NAME
#> 0 0 0
#> cs_poverty cs_hispanic cs_black
#> 0 0 0
#> cs_white cs_native cs_asian
#> 0 0 0
#> cs_ed_below_highschool cs_household_income cs_median_house_value
#> 0 0 0
#> cs_total_population cs_other cs_area
#> 0 0 0
#> cs_population_density STATE COUNTY
#> 0 0 0
#> cdc_mean_bmi cdc_pct_cusmoker cdc_pct_sdsmoker
#> 0 0 0
#> cdc_pct_fmsmoker cdc_pct_nvsmoker cdc_pct_nnsmoker
#> 0 0 0
#> gmet_mean_tmmx gmet_mean_summer_tmmx gmet_mean_winter_tmmx
#> 0 0 0
#> gmet_mean_rmx gmet_mean_summer_rmx gmet_mean_winter_rmx
#> 0 0 0
#> gmet_mean_sph gmet_mean_summer_sph gmet_mean_winter_sph
#> 0 0 0
#> cms_mortality_pct cms_white_pct cms_black_pct
#> 0 0 0
#> cms_others_pct cms_hispanic_pct cms_female_pct
#> 0 0 0
We also categorize the states into four different regions:
state_fips <- state_fips[, !colnames(state_fips) %in% c("FIPS","NAME")]
state_fips <- rename(state_fips, STATE_CODE = POSTAL.CODE)
setDT(state_fips)
# define regions
NORTHEAST=c("NY","MA","PA","RI","NH","ME","VT","CT","NJ")
SOUTH=c("DC","VA","NC","WV","KY","SC","GA","FL","AL","TN","MS","AR","MD","DE",
"OK","TX","LA")
MIDWEST=c("OH","IN","MI","IA","MO","WI","MN","SD","ND","IL","KS","NE")
WEST=c("MT","CO","WY","ID","UT","NV","CA","OR","WA","AZ","NM")
state_fips[STATE_CODE %in% NORTHEAST, region := "NORTHEAST"]
state_fips[STATE_CODE %in% SOUTH, region := "SOUTH"]
state_fips[STATE_CODE %in% MIDWEST, region := "MIDWEST"]
state_fips[STATE_CODE %in% WEST, region := "WEST"]
study_dataset <- full_join(study_dataset, state_fips, by="STATE")
merged_obj <- merge(cs_inland, study_dataset, by=c("FIPS"))
merged_obj$region <- as.factor(merged_obj$region)
spplot(merged_obj, zcol = "region",
col.regions=c("lightgoldenrodyellow","lightcoral", "yellowgreen",
"lavender"),
xlab="Longitude", ylab="Latitude",
main="Study Regions")