Summary

We present steps to generate public data sets for benchmarking and illustration purposes for air pollution and health studies. In most of these studies, the health care data cannot be shared with the public; as a result, there are no public data sets to be used as benchmark data set for testing the packages or illustrating their functionalities. CMS has generated synthetic data for the 2008-2010 range for Medicare data. This report uses part of these data, census, and exposure data to compile the study data set.

Data

There are three major categories for air pollution and health studies data sets: Exposure, Confounders, and Outcome data. We collect these data from different resources and compile them into the study data set. The compilation and aggregation can be done based on spatial and temporal features. This report is based on the annual average for 2010, aggregated at the county level for the Contiguous United States. Figure 1 shows the schematic data pipeline.

Figure 1: Overview of data pipeline.

Each of these sources includes numerous fields. The research question will determine which fields should be used in the study. We go through each resource and briefly talk about the fields of interest in the following. We group these fields by Federal Information Processing Standards FIPS code (Environmental Systems Research Institute (2021)). Each county in the United States has its dedicated FIPS code. We use the shapefile provided by the United States Census Bureau (United States Census Bureau (2021)). Also, in this report, we limit the study region to the Contiguous United States. The states’ FIPS code are mentioned on Natural Resources Conservation Service (2021). As a result, we drop the following states:

02: Alaska,
15: Hawaii,
60: American Samoa,
66: Guam,
69: Northern Mariana Islands,
72: Puerto Rico, and
78: Virgin Islands.

# Load libraries

# Geospatial 
library(sf)
library(rgdal)
library(dplyr)
library(ncdf4)

# SAS Data
library(haven)

# Census
library(tidycensus)

# General
library(fst)
library(PCICt)
library(memoise)
library(tidyverse)
library(lubridate)
library(data.table)

# Setting up environment and helper functions
source("setup_env_vals.R")
source("r0_utility_functions.R")

The following code loads the shapefiles data and plots counties in New England.

# create the file path
fpath_c <- file.path(get_options("input_dir"),
                   "public/Geospatial", "gz_2010_us_050_00_500k/")

# read shapefiles
county_shape_file <-  rgdal::readOGR(fpath_c)
#> OGR data source with driver: ESRI Shapefile 
#> Source: "/Users/nak443/Documents/Naeem_folder_mac_h/Research_projects/F2021_001_Harvard/experiment_001_20211205_harvard_project/input_data/public/Geospatial/gz_2010_us_050_00_500k", layer: "gz_2010_us_050_00_500k"
#> with 3221 features
#> It has 6 fields

# transform shapefiles into sp object
county_shape_file <- spTransform(county_shape_file,
                                 CRS("+proj=longlat +datum=WGS84"))
non_c_states <- c("02","15","60","66","69","72","78")
cs_inland <- county_shape_file[!county_shape_file$STATE %in% non_c_states, ]

# Generate FIPS code 
cs_inland$FIPS <- paste(cs_inland$STATE,cs_inland$COUNTY,sep = "")

# select states in New England
NE_c <- cs_inland[cs_inland$STATE %in% c("9","23","25","33","44","50"),]
plot(NE_c)

The following table shows the list of the states and the number of counties per state.

cs_inland_val <- data.frame(cs_inland[, c("STATE","COUNTY", "NAME","FIPS")])
state_county <- cs_inland_val %>% group_by(STATE) %>% count()
state_fips <- read.csv(file.path(get_options("input_dir"),
                   "public/Geospatial", "FIPS_state_crosswalk.csv"))
state_fips$STATE <- sprintf("%02d", state_fips$FIPS)
state_fips$FIPS <- NULL
joined_data <- inner_join(state_county, state_fips, by="STATE")
state_name_county <- data.frame(joined_data$NAME, joined_data$POSTAL.CODE,
                                joined_data$STATE, joined_data$n)
names(state_name_county) <- c("Name", "Code", "FIPS", "Num of Counties")
state_name_county

In summary, in 2010, there are 49 contiguous states (48 + DC), and 3109 counties. The crosswalk file to convert FIPS code into State name is available from different resources including Natural Resources Conservation Service (2021).

Exposure Data

In this report, we are interested in the causal effect of air pollution on the mortality rate. The exposure parameter is \(PM_{2.5}\). Di et al. (2019) provided daily, and annual \(PM_{2.5}\) estimates at \(1\ km \times 1 \ km\) grid cells in the entire United States. The data can be downloaded from Di et al. (2021).


# Set up memoization
cd <- cachem::cache_disk(pr_cache)
m_match_exposure_to_sitecode <- memoise(match_exposure_to_sitecode, cache = cd)
m_map_point_shape <- memoise(map_point_shape, cache = cd)

# Create file path for each rds file and read data.
fpath <- file.path(get_options("input_dir"),"public/Di_2019","2010.rds")
usgpath <- file.path(get_options("input_dir"),"public/Di_2019","USGridSite.rds")

di_pm25_2010 <- readRDS(fpath)
us_grid <- readRDS(usgpath)

# match exposure to sitecode
pm25_2010 <- m_match_exposure_to_sitecode(site_code = us_grid,
                                          exp_data = di_pm25_2010,
                                          exp_name = "pm25")

# Generate FIPS code for the shapefile
cs_inland$FIPS <- paste(cs_inland$STATE,cs_inland$COUNTY,sep = "")

# Convert PM2.5 data frame to SpatialPoints data frame
coordinates(pm25_2010) <- ~Lon+Lat

# Join PM2.5 points with shapefile polygons
cs_inland_pm_2010 <- m_map_point_shape(shape_object = cs_inland, 
                                       point_object = pm25_2010, 
                                       value_name = "pm25",
                                       extra_fields_name = c("STATE","COUNTY",
                                                              "NAME","FIPS"), 
                                       group_field_name = "FIPS",
                                       field_na_drop = "STATE")

# Group and aggregate data for FIPS code level
pm_data <- cs_inland_pm_2010 %>%
           group_by(FIPS) %>%
           summarise(qd_mean_pm25 = mean(pm25))


# merge the new values with the shape file.
merged_obj <- merge(cs_inland, pm_data, by=c("FIPS"))

spplot(merged_obj, zcol = "qd_mean_pm25",
       col.regions=heat.colors(51, rev = TRUE),
       xlab="Longitude", ylab="Latitude",
       main="Mean PM2.5 in the Contiguous United States (2010)")

There are 3109 counties with exposure data. Number of counties with missing data: 0 counties.

Census Data

The main reference for getting the census data is the United States Census Bureau. There are numerous studies and surveys for different geographical resolutions. Reviewing these details is beyond the scope of this report; however, it is strongly recommended to get familiar with tables and their labels. The census bureau has a convenient API to download data. tidycensus is an R package that allows users to interface with a select number of the US Census Bureau data APIs and return tidyverse-ready data frames (Walker and Herman (2021)). The census data comes for different geographical resolutions. We download data for counties from acs5 source.

Here is the list of variables that are available in 2010 for acs5:

census_api_key(Sys.getenv("CENSUS_API_KEY"))
#> To install your API key for use in future sessions, run this function with `install = TRUE`.
census_vars = c("B01001_020", "B01001_021", "B01001_022", "B01001_023",
                "B01001_024", "B01001_025",
                "B01001_044", "B01001_045", "B01001_046", "B01001_047",
                "B01001_048", "B01001_049",
                "B17001_015", "B17001_016",
                "B17001_029", "B17001_030",
                "B01001I_014", "B01001I_015", "B01001I_016",
                "B01001I_029", "B01001I_030", "B01001I_031",
                "B01001B_014", "B01001B_015", "B01001B_016",
                "B01001B_029", "B01001B_030", "B01001B_031",
                "B01001H_014", "B01001H_015", "B01001H_016",
                "B01001H_029", "B01001H_030", "B01001H_031",
                "B01001C_014", "B01001C_015", "B01001C_016",
                "B01001C_029", "B01001C_030", "B01001C_031",
                "B01001D_014", "B01001D_015", "B01001D_016",
                "B01001D_029", "B01001D_030", "B01001D_031",
                "B15001_035", "B15001_036", "B15001_037",
                "B15001_076", "B15001_077", "B15001_078",
                "B19049_005",
                "B25077_001",
                "B01003_001")

v10 <- load_variables(2010, "acs5", cache = TRUE)
var_table <- v10[v10$name %in% census_vars,]
var_table$label <- sub("Estimate!!","",var_table$label)
var_table$label <- sub("Total!!","",var_table$label)
var_table

The mortality rate (CMS data) is based on Medicare data, mainly reported for patients over 65 years old. In the following we download the data and create the variables.


census_2010 <- get_acs(geography = "county", 
                       variables = census_vars,
                       year = 2010,
                       survey = "acs5",
                       output = "wide")
#> Getting data from the 2006-2010 5-year ACS

# Drop Margin of error column. 
census_2010_e <- census_2010 %>% select(-ends_with("M"))

census_2010_e$STATE <- substr(census_2010_e$GEOID, 1, 2)
census_2010_e$COUNTY <- substr(census_2010_e$GEOID, 3, 5)

inland_states <- state_county$STATE
census_2010_e_inland <- census_2010_e[census_2010_e$STATE %in% inland_states, ]

# add new variables
# add new variables
census_2010_e_m <- census_2010_e_inland %>%
  add_column(tmp_total_pop = (.$B01001_020E + .$B01001_021E + .$B01001_022E +
                              .$B01001_023E + .$B01001_024E + .$B01001_025E +
                              .$B01001_044E + .$B01001_045E + .$B01001_046E +
                              .$B01001_047E + .$B01001_048E + .$B01001_049E))

census_2010_e_m <- census_2010_e_m %>%
                   add_column(cs_poverty = (.$B17001_015E + .$B17001_016E +
                                            .$B17001_029E + .$B17001_030E)/
                                            (.$tmp_total_pop),
                              cs_hispanic = (.$B01001I_014E + .$B01001I_015E +
                                             .$B01001I_016E + .$B01001I_029E +
                                             .$B01001I_030E + .$B01001I_031E)/
                                            (.$tmp_total_pop),
                              cs_black = (.$B01001B_014E + .$B01001B_015E + 
                                          .$B01001B_016E + .$B01001B_029E + 
                                          .$B01001B_030E + .$B01001B_031E)/
                                         (.$tmp_total_pop),
                              cs_white = (.$B01001H_014E + .$B01001H_015E + 
                                          .$B01001H_016E + .$B01001H_029E + 
                                          .$B01001H_030E + .$B01001H_031E)/
                                         (.$tmp_total_pop),
                              cs_native = (.$B01001C_014E + .$B01001C_015E +
                                           .$B01001C_016E + .$B01001C_029E + 
                                           .$B01001C_030E + .$B01001C_031E)/
                                         (.$tmp_total_pop),
                              cs_asian = (.$B01001D_014E + .$B01001D_015E + 
                                          .$B01001D_016E + .$B01001D_029E + 
                                          .$B01001D_030E + .$B01001D_031E)/
                                         (.$tmp_total_pop),
                              cs_ed_below_highschool = (
                                .$B15001_036E + .$B15001_037E +
                                .$B15001_077E + .$B15001_078E)/ 
                                (.$B15001_035E + .$B15001_076E),
                              cs_household_income = .$B19049_005E,
                              cs_median_house_value = .$B25077_001E,
                              cs_total_population = .$B01003_001E
)

## Extract area of each county
tmp_census <- census_2010_e_m
colnames(tmp_census)[which(names(tmp_census) == "GEOID")] <- "FIPS"
tmp_obj <- merge(cs_inland, tmp_census, by=c("FIPS"))
fips_area <- data.frame(tmp_obj[, c("FIPS","CENSUSAREA")])
tmp_census <- tmp_obj <- NULL
colnames(fips_area)[which(names(fips_area) == "CENSUSAREA")] <- "cs_area"

# One county is missing data (FIPS: 48301). 
census_2010_e_m <- census_2010_e_m %>%
 mutate_all(~replace(., is.na(.), 0))

tmp <- rowSums(census_2010_e_m[, c("cs_hispanic", "cs_black", "cs_white",
                                  "cs_native", "cs_asian")])

census_2010_e_m$cs_other <- ifelse((1-tmp)<0, 0, 1-tmp)

# Drop initial columns
census_2010_processed <- census_2010_e_m %>%
                             select(!matches("B[0-9]+[A-Z]*_[0-9]+E"))

census_2010_processed$tmp_total_pop <- NULL

colnames(census_2010_processed)[which(names(census_2010_processed) == "GEOID")] <- "FIPS"

census_2010_processed$STATE <- NULL
census_2010_processed$COUNTY <- NULL

# Adding area of county to the data.
census_2010_processed <- merge(census_2010_processed, fips_area, by="FIPS")

census_2010_processed[census_2010_processed$FIPS == "48301",]

There is a problem with Loving County data in Texas. We use average of surrounding counties to impute missing values.


compute_cols <- c("cs_poverty", "cs_hispanic", "cs_black",
                  "cs_white",
                  "cs_native", "cs_asian",
                  "cs_household_income",
                  "cs_other", "cs_ed_below_highschool")

tmp_four <- census_2010_processed[census_2010_processed$FIPS %in% c("48495",
                                                                    "48475",
                                                                    "48389",
                                                                    "35025"),
                                  compute_cols]

tmp_mean <- colMeans(tmp_four)
tmp_mean[["cs_household_income"]] <- floor(tmp_mean[["cs_household_income"]])

for (item in compute_cols){
  census_2010_processed[census_2010_processed$FIPS=="48301", item] <- getElement(tmp_mean, item)
}

# compute population density
# https://www.socialexplorer.com/data/C2000/metadata/?ds=SE&var=T003_001
census_2010_processed["cs_population_density"] <- (
  census_2010_processed$cs_total_population/census_2010_processed$cs_area)

census_2010_processed["cs_log_population_density"] <- log10(census_2010_processed["cs_population_density"])
census_2010_processed["cs_log_total_population"] <- log10(census_2010_processed["cs_total_population"])

census_data <- census_2010_processed

# Take a look at data
# merge the new values with the shape file.
merged_obj <- merge(cs_inland, census_data, by=c("FIPS"))

spplot(merged_obj, zcol = "cs_median_house_value",
       col.regions=terrain.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Median House Value in the Contiguous United States (2010)")

spplot(merged_obj, zcol = "cs_area",
       col.regions=terrain.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Area of counties in the Contiguous US (2010) - mi^2")

spplot(merged_obj, zcol = "cs_log_total_population",
       col.regions=terrain.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Log10 of total population of each counties in the Contiguous US (2010)")

spplot(merged_obj, zcol = "cs_log_population_density",
       col.regions=terrain.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Log10 of population density in the Contiguous US (2010)")

There are 3109 counties’ data in the census data. The following shows the available column names and number of missing data per column in the census data.

sapply(census_data, function(x) sum(is.na(x)))
#>                   FIPS                   NAME             cs_poverty 
#>                      0                      0                      0 
#>            cs_hispanic               cs_black               cs_white 
#>                      0                      0                      0 
#>              cs_native               cs_asian cs_ed_below_highschool 
#>                      0                      0                      0 
#>    cs_household_income  cs_median_house_value    cs_total_population 
#>                      0                      0                      0 
#>               cs_other                cs_area  cs_population_density 
#>                      0                      0                      0

CDC Data

The Centers for Disease Control and Prevention (CDC), provides the Behavioral Risk Factor Surveillance System (Centers for Disease Control and Prevention (2021)), which is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors. In this report, we are interested in the participants’ body mass index (BMI) and smoking status. The BRFSS data also is provided at the county level. We download the data in SAS format from this webpage and load it into R using the haven R package (Wickham and Miller (2021)). The definition of variables is mentioned in this codebook.


brfss_2010 <- read_xpt(file.path(get_options("input_dir"),
                                 "public/CDC_data","CDBRFS10.XPT"))

varlist <- c("_STATE", "CTYCODE", "_BMI4","_SMOKER3")
brfss_2010_subset <- extract_brfss_vars(brfss_2010, varlist = varlist)
brfss_2010 <- NULL

# Modify column names
names(brfss_2010_subset) <- c("state","county","bmi","smoker")

# Polish vars
brfss_data_2010 <- polish_bfrss_vars(brfss_2010_subset)

Missing Values

The BRFSS data is a phone survey, as a result, there are many missing values. From entire CDC data, 7.49% of data has missing county values, and there are 696 counties without any reported values.

merged_obj <- merge(cs_inland, brfss_data_2010, by=c("FIPS"))
merged_obj$missing <- as.factor(ifelse(is.na(merged_obj$cdc_mean_bmi), 1, 0))

spplot(merged_obj, zcol = "missing",
       col.regions=c("snow","red"),
       col = "grey61",
       xlab="Longitude", ylab="Latitude",
       main="Counties with Missing CDC Data")

Discussing methods for imputing missing values is beyond the scope of this report. In this report, we generate a normal distribution of each parameter for each state and choose a value at random for counties with missing values.

brfss_data <- data.frame(
           merged_obj[, c("FIPS", "STATE", "COUNTY", "cdc_mean_bmi", 
                          "cdc_pct_cusmoker", "cdc_pct_sdsmoker", 
                          "cdc_pct_fmsmoker", "cdc_pct_nvsmoker", 
                          "cdc_pct_nnsmoker")])

# 9999 is assigned for those who answered the survey but refused to answer 
# specific questions, we treat them as missing values
brfss_data$cdc_mean_bmi <- ifelse(brfss_data$cdc_mean_bmi == 9999, NA,
                                  brfss_data$cdc_mean_bmi)

brfss_data <- impute_cdc(
  data = brfss_data,
  param_list = c("cdc_mean_bmi", "cdc_pct_cusmoker", 
                 "cdc_pct_sdsmoker","cdc_pct_fmsmoker",
                 "cdc_pct_nvsmoker",
                 "cdc_pct_nnsmoker"))

merged_obj <- merge(cs_inland, brfss_data, by=c("FIPS"))
spplot(merged_obj, zcol = "cdc_mean_bmi",
       col.regions=heat.colors(51, rev = TRUE),
       xlab="Longitude", ylab="Latitude",
       main="Mean Body Mass Index in the Contiguous United States (2010)")

GridMET data

Climatology Lab at the University of California, Merced, provides the GridMET data (Abatzoglou (2013)). The data set is daily surface meteorological data covering the contiguous United States. This report is interested in average annual and seasonal maximum temperature, average annual and seasonal maximum humidity, and annual and seasonal specific humidity. The Winter and Summer time line is according to the following:

Summer: June 1 - August 31
Winter: December 1 - February 28 (Or 29 for Leap years)

Data comes in NetCDF4 format. We use ncdf4 to load the data (Pierce (2019)).


cd <- cachem::cache_disk(pr_cache)
m_aggregate_netcdf_data <- memoise(aggregate_netcdf_data, cache = cd)
m_map_point_shape <- memoise(map_point_shape, cache = cd)

year <- 2010

# Temperature 
tmmx_path <- file.path(get_options("input_dir"), "public/gridmet_data",
                       paste0("tmmx_",year, ".nc"))
# Last year (ly)
tmmx_path_ly <- file.path(get_options("input_dir"),
                          "public/gridmet_data",
                          paste0("tmmx_",year-1,".nc"))

# Humidity 
rmax_path <- file.path(get_options("input_dir"), "public/gridmet_data",
                       paste0("rmax_",year, ".nc"))

rmax_path_ly <- file.path(get_options("input_dir"),
                          "public/gridmet_data",
                          paste0("rmax_",year-1,".nc"))

sph_path <- file.path(get_options("input_dir"), "public/gridmet_data",
                      paste0("sph_",year,".nc"))

sph_path_ly <- file.path(get_options("input_dir"),
                         "public/gridmet_data",
                         paste0("sph_",year-1,".nc"))

tmmx <- compile_gridmet_data(nc_path = tmmx_path, 
                             param_name = "air_temperature",
                             start_date = paste0(year,"-01-01"),
                             end_date = paste0(year,"-12-30"),
                             agg_fun = mean,
                             shape_obj = cs_inland,
                             extra_fields_name = c("STATE","COUNTY",
                                                   "NAME","FIPS"),
                             group_field_name = "FIPS",
                             field_na_drop = "STATE",
                             agg_field_name = "mean_tmmx")

tmmx_summer <- compile_gridmet_data(nc_path = tmmx_path,
                                    param_name = "air_temperature",
                                    start_date = paste0(year,"-06-01"),
                                    end_date = paste0(year, "-08-31"),
                                    agg_fun = mean,
                                    shape_obj = cs_inland,
                                    extra_fields_name = c("STATE","COUNTY",
                                                          "NAME","FIPS"),
                                    group_field_name = "FIPS",
                                    field_na_drop = "STATE",
                                    agg_field_name = "mean_summer_tmmx")

tmmx_winter <- compute_winter(path_cy = tmmx_path,
                              path_ly = tmmx_path_ly,
                              year_cy = year,
                              param_name = "air_temperature",
                              field_name = "winter_tmmx")

rmx <- compile_gridmet_data(nc_path = rmax_path, 
                            param_name = "relative_humidity",
                            start_date = paste0(year,"-01-01"),
                            end_date = paste0(year,"-12-30"),
                            agg_fun = mean,
                            shape_obj = cs_inland,
                            extra_fields_name = c("STATE","COUNTY",
                                                  "NAME","FIPS"),
                            group_field_name = "FIPS",
                            field_na_drop = "STATE",
                            agg_field_name = "mean_rmx")

rmx_summer <- compile_gridmet_data(nc_path = rmax_path,
                                   param_name = "relative_humidity",
                                   start_date = paste0(year,"-06-01"),
                                   end_date = paste0(year, "-08-31"),
                                   agg_fun = mean,
                                   shape_obj = cs_inland,
                                   extra_fields_name = c("STATE","COUNTY",
                                                         "NAME","FIPS"),
                                   group_field_name = "FIPS",
                                   field_na_drop = "STATE",
                                   agg_field_name = "mean_summer_rmx")

rmx_winter <- compute_winter(path_cy = rmax_path,
                             path_ly = rmax_path_ly,
                             year_cy = year,
                             param_name = "relative_humidity",
                             field_name = "winter_rmx")

sph <- compile_gridmet_data(nc_path = sph_path, 
                            param_name = "specific_humidity",
                            start_date = paste0(year,"-01-01"),
                            end_date = paste0(year,"-12-30"),
                            agg_fun = mean,
                            shape_obj = cs_inland,
                            extra_fields_name = c("STATE","COUNTY",
                                                  "NAME","FIPS"),
                            group_field_name = "FIPS",
                            field_na_drop = "STATE",
                            agg_field_name = "mean_sph")

sph_summer <- compile_gridmet_data(nc_path = sph_path,
                                   param_name = "specific_humidity",
                                   start_date = paste0(year,"-06-01"),
                                   end_date = paste0(year, "-08-31"),
                                   agg_fun = mean,
                                   shape_obj = cs_inland,
                                   extra_fields_name = c("STATE","COUNTY",
                                                         "NAME","FIPS"),
                                   group_field_name = "FIPS",
                                   field_na_drop = "STATE",
                                   agg_field_name = "mean_summer_sph")

sph_winter <- compute_winter(path_cy = sph_path,
                             path_ly = sph_path_ly,
                             year_cy = year,
                             param_name = "specific_humidity",
                             field_name = "winter_sph")

## Merge data
multi_merge <- function(x, y){
  df <- left_join(x, y, by= "FIPS")
  return(df)
}

df_gridmet <- Reduce(multi_merge, list(tmmx,
                                       tmmx_summer,
                                       tmmx_winter,
                                       rmx,
                                       rmx_summer,
                                       rmx_winter,
                                       sph,
                                       sph_summer,
                                       sph_winter))

cn <- colnames(df_gridmet)
colnames(df_gridmet) <- c(cn[1], paste("gmet_", cn[2:10], sep = ""))

There are 3105 counties with GridMET data. Number of missing counties: 4 counties.

Missing Values

If we aggregate data and take a look at the difference between GridMET data and Initial Counties:

df_gridmet$STATE <- substr(df_gridmet$FIPS, 1, 2)
df_gridmet$COUNTY <- substr(df_gridmet$FIPS, 3, 5)

county_per_state_gridmet <- df_gridmet %>%
                            group_by(STATE) %>% count()

setdiff(county_per_state_gridmet, state_county)

According to the results, number of counties for state of Virginia (FIPS = 51) is different with original shape files.

Number of counties for state of Virginia: 134
Number of counties for state of Virginia in the GridMET data: 130

Let’s print out the counties that are not in the GridMET data.

gr_missing_counties <- setdiff(cs_inland_val$COUNTY, df_gridmet$COUNTY)
cs_inland_val[cs_inland_val$STATE == 51 & cs_inland_val$COUNTY %in% gr_missing_counties, ]

We can take a look at these counties on the map:

vr_inland <- cs_inland[cs_inland$STATE==51,]
vr_inland$missing <- ifelse(vr_inland$COUNTY %in% gr_missing_counties, 1, 0)
vr_inland$missing <- as.factor(vr_inland$missing)
spplot(vr_inland, zcol = "missing",
       col.regions=c("snow","red"),
       col = "grey61",
       xlab="Longitude", ylab="Latitude",
       main="Counties in the State of Virginia with Missing GridMET Values")

As one can see, it seems there are counties inside other counties, and because of small sizes none of the GridMET grid points was located inside those counties. With further research, we learn that these are independent cities located in the Commonwealth of Virginia. For the purpose of this report we assign the surrounding counties values for this counties.

Covington (580) –> Alleghany (005)
Lexington (678) –> Rockbridge (163)
Falls Church (610) –> Fairfax (059)
Bedford (515) –> Bedford (019)


assign_data <- function(data, from_fips, to_fips){
  tmp <- data[data$FIPS == from_fips,]
  tmp$FIPS <- to_fips
  tmp$STATE <- substr(to_fips, 1, 2)
  tmp$COUNTY <- substr(to_fips, 3, 5)
  return(tmp)
}

gridmet_data <- rbind(df_gridmet,
                      assign_data(df_gridmet, 51005, 51580),
                      assign_data(df_gridmet, 51163, 51678),
                      assign_data(df_gridmet, 51059, 51610),
                      assign_data(df_gridmet, 51019, 51515))

gridmet_data <- gridmet_data[, !colnames(gridmet_data) %in% c("STATE","COUNTY")]

# merge the new values with the shape file.
merged_obj <- merge(cs_inland, gridmet_data, by=c("FIPS"))

There are 3109 counties with GridMET data. Number of missing counties: 0 counties.

spplot(merged_obj, zcol = "gmet_mean_winter_tmmx",
       col.regions=topo.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Mean max temp. during winter in the Contiguous US (2009-10)")

spplot(merged_obj, zcol = "gmet_mean_summer_tmmx",
       col.regions=topo.colors(51, rev = FALSE),
       xlab="Longitude", ylab="Latitude",
       main="Mean max temp. during summer in the Contiguous US (2010)")

CMS Data

Centers for Medicare and Medicaid Services(CMS) provides synthetic data at the county level for 2008-2010 (Centers for Medicare & Medicaid Services (2021)). CMS data comes in SSA code that is different than FIPS code. We use SSA to Federal Information Processing Series (FIPS) State and County Crosswalk file to convert SSA code to FIPS code (National Bureau of Economic Research (2021)).


# Add functions to memoization
cd <- cachem::cache_disk(pr_cache)
m_extract_CMS_data <- memoise(extract_CMS_data, cache = cd)
m_aggregate_cms_data <- memoise(aggregate_cms_data, cache = cd)

# Data directory
# Combine fst data.
CMS_DATA_DIR <- file.path(get_options("input_dir"),"public/cms_data")

ssa2fips_data <- read.csv(file.path(get_options("input_dir"),"public","cms_data","ssa_fips_state_county2011.csv"))

ssa2fips_data$cbsa <- NULL
ssa2fips_data$cbsaname <- NULL

# remove empty rows
ssa2fips_data <- ssa2fips_data[!is.na(ssa2fips_data$ssacounty),]

ssa2fips_data$FIPS <- sprintf("%05d", ssa2fips_data$fipscounty)
ssa2fips_data$SSA <- sprintf("%05d", ssa2fips_data$ssacounty)

ssa2fips_cross <- ssa2fips_data[, c("SSA","FIPS")]
cms_2010 <- m_extract_CMS_data(CMS_DATA_DIR, 2010)
cms_2010_fips <- merge(cms_2010, ssa2fips_cross, by = "SSA")
agg_data_cms_2010 <- m_aggregate_cms_data(cms_2010_fips)

# select inland data
agg_data_cms_2010$STATE <- substr(agg_data_cms_2010$FIPS, 1, 2)
agg_data_cms_2010$COUNTY <- substr(agg_data_cms_2010$FIPS, 3, 5)
inland_states <- state_county$STATE
agg_data_cms_2010 <- agg_data_cms_2010[agg_data_cms_2010$STATE %in% inland_states, ]

agg_data_cms_2010 <- agg_data_cms_2010[, !colnames(agg_data_cms_2010) %in% c("STATE","COUNTY")]

There are 3054 counties with CMS data. Number of missing counties: 55 counties.

Missing Values

The following figure represents counties with missing values.

merged_obj <- merge(cs_inland, agg_data_cms_2010, by=c("FIPS"))
merged_obj$missing <- ifelse(is.na(merged_obj$cms_mortality_pct), 1, 0)
merged_obj$missing <- as.factor(merged_obj$missing)

spplot(merged_obj, zcol = "missing",
       col.regions=c("snow","red"),
       col = "grey61",
       xlab="Longitude", ylab="Latitude",
       main="Counties with Missing CMS Data")


# Number of missing values by state
cms_attr <- c("cms_mortality_pct", "cms_white_pct", "cms_black_pct",
              "cms_others_pct", "cms_hispanic_pct", "cms_female_pct")

We use median value of state for imputing missing values.

cms_data <- data.frame(merged_obj[, c("STATE","COUNTY", "FIPS", "missing",
                                         cms_attr)])

for (item in cms_attr){
cms_data <- cms_data %>% 
               group_by(STATE) %>% 
               mutate(new_val=ifelse(is.na(.data[[item]]),
                      median(.data[[item]], na.rm=TRUE),
                      .data[[item]]))

cnames <- colnames(cms_data)
cms_data <- cms_data[ , !(names(cms_data) %in% c(item))]
colnames(cms_data)[which(cnames=="new_val")-1] <- item
}

cms_data <- cms_data[, !colnames(cms_data) %in% c("STATE","COUNTY","missing")]

merged_obj <- merge(cs_inland, cms_data, by=c("FIPS"))
spplot(merged_obj, zcol = "cms_mortality_pct",
       col.regions=heat.colors(51, rev = TRUE),
       xlab="Longitude", ylab="Latitude",
       main="Mortality rate in the Contiguous United States (2010)")

Compile Data

We generated county-level data for each resources for 2010. Now we need to join all data based on FIPS code.


study_dataset <- Reduce(multi_merge, list(pm_data,
                                          census_data,
                                          brfss_data,
                                          gridmet_data,
                                          cms_data))

Number of data samples: 3109.
Number of missing value: 0.

sapply(study_dataset, function(x) sum(is.na(x)))
#>                   FIPS           qd_mean_pm25                   NAME 
#>                      0                      0                      0 
#>             cs_poverty            cs_hispanic               cs_black 
#>                      0                      0                      0 
#>               cs_white              cs_native               cs_asian 
#>                      0                      0                      0 
#> cs_ed_below_highschool    cs_household_income  cs_median_house_value 
#>                      0                      0                      0 
#>    cs_total_population               cs_other                cs_area 
#>                      0                      0                      0 
#>  cs_population_density                  STATE                 COUNTY 
#>                      0                      0                      0 
#>           cdc_mean_bmi       cdc_pct_cusmoker       cdc_pct_sdsmoker 
#>                      0                      0                      0 
#>       cdc_pct_fmsmoker       cdc_pct_nvsmoker       cdc_pct_nnsmoker 
#>                      0                      0                      0 
#>         gmet_mean_tmmx  gmet_mean_summer_tmmx  gmet_mean_winter_tmmx 
#>                      0                      0                      0 
#>          gmet_mean_rmx   gmet_mean_summer_rmx   gmet_mean_winter_rmx 
#>                      0                      0                      0 
#>          gmet_mean_sph   gmet_mean_summer_sph   gmet_mean_winter_sph 
#>                      0                      0                      0 
#>      cms_mortality_pct          cms_white_pct          cms_black_pct 
#>                      0                      0                      0 
#>         cms_others_pct       cms_hispanic_pct         cms_female_pct 
#>                      0                      0                      0

Add Regions

We also categorize the states into four different regions:

North East
South
Midwest
West


state_fips <- state_fips[, !colnames(state_fips) %in% c("FIPS","NAME")]
state_fips <- rename(state_fips, STATE_CODE = POSTAL.CODE)
setDT(state_fips)

# define regions
NORTHEAST=c("NY","MA","PA","RI","NH","ME","VT","CT","NJ")  
SOUTH=c("DC","VA","NC","WV","KY","SC","GA","FL","AL","TN","MS","AR","MD","DE",
        "OK","TX","LA")
MIDWEST=c("OH","IN","MI","IA","MO","WI","MN","SD","ND","IL","KS","NE")
WEST=c("MT","CO","WY","ID","UT","NV","CA","OR","WA","AZ","NM")

state_fips[STATE_CODE %in% NORTHEAST, region := "NORTHEAST"]
state_fips[STATE_CODE %in% SOUTH, region := "SOUTH"]
state_fips[STATE_CODE %in% MIDWEST, region := "MIDWEST"]
state_fips[STATE_CODE %in% WEST, region := "WEST"]

study_dataset <- full_join(study_dataset, state_fips, by="STATE")

merged_obj <- merge(cs_inland, study_dataset, by=c("FIPS"))
merged_obj$region <- as.factor(merged_obj$region)
spplot(merged_obj, zcol = "region",
       col.regions=c("lightgoldenrodyellow","lightcoral", "yellowgreen",
                     "lavender"),
       xlab="Longitude", ylab="Latitude",
       main="Study Regions")

References

Abatzoglou, John T. 2013. “Development of Gridded Surface Meteorological Data for Ecological Applications and Modelling.” International Journal of Climatology 33 (1): 121–31. https://doi.org/10.1002/joc.3413.

Centers for Disease Control and Prevention. 2021. “Behavioral Risk Factor Surveillance System.” https://www.cdc.gov/brfss/annual_data/annual_2010.htm.

Centers for Medicare & Medicaid Services. 2021. “CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF).” https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.

Di, Qian, Heresh Amini, Liuhua Shi, Itai Kloog, Rachel Silvern, James Kelly, M Benjamin Sabath, et al. 2019. “An Ensemble-Based Model of Pm2. 5 Concentration Across the Contiguous United States with High Spatiotemporal Resolution.” Environment International 130: 104909. https://doi.org/10.1016/j.envint.2019.104909.

Di, Qian, Yaguang Wei, Alexandra Shtein, Carolynne Hultquist, Xiaoshi Xing, Heresh Amini, Liuhua Shi, et al. 2021. “Daily and Annual Pm2.5 Concentrations for the Contiguous United States, 1-Km Grids, V1 (2000 - 2016).” NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/0rvr-4538.

Environmental Systems Research Institute. 2021. “What Are FIPS Codes?” https://support.esri.com/en/technical-article/000002594.

National Bureau of Economic Research. 2021. “SSA to Federal Information Processing Series (FIPS) State and County Crosswalk).” https://www.nber.org/research/data/ssa-federal-information-processing-series-fips-state-and-county-crosswalk.

Natural Resources Conservation Service. 2021. “State FIPS Codes.” https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696.

Pierce, David. 2019. Ncdf4: Interface to Unidata netCDF (Version 4 or Earlier) Format Data Files. https://CRAN.R-project.org/package=ncdf4.

United States Census Bureau. 2021. “Cartographic Boundary Files - Shapefile.” https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.2010.html.

Walker, Kyle, and Matt Herman. 2021. Tidycensus: Load US Census Boundary and Attribute Data as ’Tidyverse’ and ’Sf’-Ready Data Frames. https://CRAN.R-project.org/package=tidycensus.

Wickham, Hadley, and Evan Miller. 2021. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files. https://CRAN.R-project.org/package=haven.

Synthetic Medicare Data for Environmental Health Studies

Naeem Khoshnevis, Xiao Wu, Danielle Braun

2022-01-31

Summary

Data

Exposure Data

Census Data

CDC Data

Missing Values

GridMET data

Missing Values

CMS Data

Missing Values

Compile Data

Add Regions

References