Chapter 3 Data transformation

The transformation and pre-processing of our data is performed in python. Detailed code can be access here: data transformation notebook

3.1 Epidemiology Table

For the epidemiology data, the raw data is consist of both country level and US state level datapoints. So we divided it into two tables, one for global country level and the other for the US state level, and we keep only the most relevant columns. The preprocessing steps is as following:

  • Filter column location_key. If it is a key corresponding to a country, we add the row into our country-level table; it it is a US state key, we add the row into our state-level table.
  • Select columns: date, location_key, new_confirmed, new_deceased, cumulative_confirmed, cumulative_deceased

After cleaning, the country-level table has 163609 rows and 6 columns while the state-level table has 38537 rows and 6 columns.

Country-level table:

## # A tibble: 163,609 × 7
##    date       location_key country_name new_confirmed new_deceased
##    <date>     <chr>        <chr>                <dbl>        <dbl>
##  1 2020-01-01 AD           Andorra                  0            0
##  2 2020-01-02 AD           Andorra                  0            0
##  3 2020-01-03 AD           Andorra                  0            0
##  4 2020-01-04 AD           Andorra                  0            0
##  5 2020-01-05 AD           Andorra                  0            0
##  6 2020-01-06 AD           Andorra                  0            0
##  7 2020-01-07 AD           Andorra                  0            0
##  8 2020-01-08 AD           Andorra                  0            0
##  9 2020-01-09 AD           Andorra                  0            0
## 10 2020-01-10 AD           Andorra                  0            0
## # … with 163,599 more rows, and 2 more variables: cumulative_confirmed <dbl>,
## #   cumulative_deceased <dbl>

State-level table:

## # A tibble: 38,537 × 7
##    date       location_key state_name new_confirmed new_deceased
##    <date>     <chr>        <chr>              <dbl>        <dbl>
##  1 2020-01-22 US_AK        Alaska                 0            0
##  2 2020-01-23 US_AK        Alaska                 0            0
##  3 2020-01-24 US_AK        Alaska                 0            0
##  4 2020-01-25 US_AK        Alaska                 0            0
##  5 2020-01-26 US_AK        Alaska                 0            0
##  6 2020-01-27 US_AK        Alaska                 0            0
##  7 2020-01-28 US_AK        Alaska                 0            0
##  8 2020-01-29 US_AK        Alaska                 0            0
##  9 2020-01-30 US_AK        Alaska                 0            0
## 10 2020-01-31 US_AK        Alaska                 0            0
## # … with 38,527 more rows, and 2 more variables: cumulative_confirmed <dbl>,
## #   cumulative_deceased <dbl>

Description of each column is as following (The country-level and the state-level table has the same structure):

Column Description
date The date of the record
location_key The code corresponding to the country/state
new_confirmed # of new Covid confirmed cases of the date in the country/state
new_deceased # of new Covid deceased cases of the date in the country/state
cumulative_confirmed # of cumulative Covid confirmed cases of the date in the country/state
cumulative_deceased # of cumulative Covid deceased cases of the date in the country/state

3.2 Economy Table

The raw data of the economy table are country-level economic statistics. We extracted only the same set of countries as in other tables to keep consistency across tables. The resulting table has 203 rows and three columns.

## # A tibble: 202 × 4
##    location_key country_name              gdp_usd gdp_per_capita_usd
##    <chr>        <chr>                       <dbl>              <dbl>
##  1 AD           Andorra                3154057987              40886
##  2 AE           United Arab Emirates 421142267937              43103
##  3 AF           Afghanistan           19101353832                502
##  4 AG           Antigua and Barbuda    1727759259              17790
##  5 AL           Albania               15278077446               5352
##  6 AM           Armenia               13672802157               4622
##  7 AO           Angola                94635415869               2973
##  8 AR           Argentina            449663446954              10006
##  9 AS           American Samoa          636000000              11466
## 10 AT           Austria              455140042000              51448
## # … with 192 more rows

Description of each column is as following (The country-level and the state-level table has the same structure):

Column Description
location_key The code corresponding to the country
gdp_usd Gross domestic product of the country in US dollars
gdp_per_capita_usd Gross domestic product of the country in US dollars divided by its total population

3.3 First Mitigation Date Table

The raw data are a table recording the mitigation policies and emergency declarations status of each state in the US. We derive the date that the first mitigation policy is published in each state from the raw data and formulate a new table of 51 rows and 2 columns.

## # A tibble: 51 × 3
##    location_key state_name           first_mitigate_date
##    <chr>        <chr>                <date>             
##  1 US_AK        Alaska               2020-03-09         
##  2 US_AL        Alabama              2020-03-13         
##  3 US_AR        Arkansas             2020-03-11         
##  4 US_AZ        Arizona              2020-03-11         
##  5 US_CA        California           2020-03-04         
##  6 US_CO        Colorado             2020-03-11         
##  7 US_CT        Connecticut          2020-03-10         
##  8 US_DC        District of Columbia 2020-03-11         
##  9 US_DE        Delaware             2020-03-13         
## 10 US_FL        Florida              2020-03-01         
## # … with 41 more rows

Description of each column is as following:

Column Description
state_code The code corresponding to the state (e.g. US_CA)
first_mitigate_date the date that the first mitigation policy is published in the state

3.4 Health Table

The raw data give several health indicators such as smoke prevalence in different regions. As country-level and state-level information are mixed together in the raw data, we need to separate them into two tables.

Preprocessing steps:

  • Filter column location_key. If it is a key corresponding to a country, we add the row into our country-level table; if it is a US state key, we add the row into our state-level table.
  • As the state-level information only contains one health indicator, life_expectancy, and all other columns are all NAs.

Country-level table:

## # A tibble: 210 × 15
##    location_key country_name   life_expectancy smoking_prevale… diabetes_preval…
##    <chr>        <chr>                    <dbl>            <dbl>            <dbl>
##  1 AD           Andorra                   NA               33.5              7.7
##  2 AE           United Arab E…            77.8             28.9             16.3
##  3 AF           Afghanistan               64.5             NA                9.2
##  4 AG           Antigua and B…            76.9             NA               13.1
##  5 AL           Albania                   78.9             28.7              9  
##  6 AM           Armenia                   74.9             24.1              6.1
##  7 AO           Angola                    60.8             NA                4.5
##  8 AR           Argentina                 76.5             21.8              5.9
##  9 AT           Austria                   81.8             29.6              6.6
## 10 AU           Australia                 82.7             14.7              5.6
## # … with 200 more rows, and 10 more variables: infant_mortality_rate <dbl>,
## #   adult_male_mortality_rate <dbl>, adult_female_mortality_rate <dbl>,
## #   pollution_mortality_rate <dbl>, comorbidity_mortality_rate <dbl>,
## #   hospital_beds_per_1000 <dbl>, nurses_per_1000 <dbl>,
## #   physicians_per_1000 <dbl>, health_expenditure_usd <dbl>,
## #   out_of_pocket_health_expenditure_usd <dbl>

State-level table:

## # A tibble: 53 × 2
##    location_key life_expectancy
##    <chr>                  <dbl>
##  1 US_AK                   78  
##  2 US_AL                   75.1
##  3 US_AR                   75.6
##  4 US_AZ                   78.7
##  5 US_CA                   80.8
##  6 US_CO                   80  
##  7 US_CT                   80.4
##  8 US_DE                   77.8
##  9 US_FL                   78.9
## 10 US_GA                   77.2
## # … with 43 more rows

3.5 Mobility Table

Only state-level information are available in the raw data. After cleaning the data and select only the valid state code, we generate a table of 33762 rows and 8 columns.

Description of each column is as following:

Column Description
date The date of the data being recorded
location_key The code corresponding to the country/state
mobility_retail_and_recreation percentage change in visits to retail and recreation
mobility_grocery_and_pharmacy percentage change in visits to grocery and pharmacy
mobility_parks percentage change in visits to parks
mobility_transit_stations percentage change in visits to transit stations
mobility_workplaces percentage change in visits to workplaces
mobility_residential percentage change in visits to residential areas

3.6 Vaccination Table

The raw data of vaccination records also have country-level and state-level mixed together. So we separated them into two tables. Moreover, we found out that country-level records do no have information about vaccine of different providers (e.g. Pfizer, Moderna, Janssen), so we remove these columns from the country-level table.

## # A tibble: 34,202 × 7
##    date       location_key country_name new_persons_vaccinated cumulative_perso…
##    <date>     <chr>        <chr>                         <dbl>             <dbl>
##  1 2021-01-25 AD           Andorra                          NA               576
##  2 2021-02-01 AD           Andorra                         460              1036
##  3 2021-02-10 AD           Andorra                         255              1291
##  4 2021-02-12 AD           Andorra                         331              1622
##  5 2021-02-19 AD           Andorra                         519              2141
##  6 2021-02-24 AD           Andorra                         249              2390
##  7 2021-02-26 AD           Andorra                           0                NA
##  8 2021-03-08 AD           Andorra                          49              2439
##  9 2021-03-10 AD           Andorra                        1211              3650
## 10 2021-03-15 AD           Andorra                        3448              7098
## # … with 34,192 more rows, and 2 more variables:
## #   new_persons_fully_vaccinated <dbl>,
## #   cumulative_persons_fully_vaccinated <dbl>
## # A tibble: 18,471 × 13
##    date       location_key state_name new_persons_vaccinated cumulative_persons…
##    <date>     <chr>        <chr>                       <dbl>               <dbl>
##  1 2021-01-12 US_AK        Alaska                         NA               22486
##  2 2021-01-13 US_AK        Alaska                       2055               24541
##  3 2021-01-14 US_AK        Alaska                      10967               35508
##  4 2021-01-15 US_AK        Alaska                       4038               39546
##  5 2021-01-19 US_AK        Alaska                      15888               55434
##  6 2021-01-20 US_AK        Alaska                       1477               56911
##  7 2021-01-21 US_AK        Alaska                       2538               59449
##  8 2021-01-22 US_AK        Alaska                       7813               67262
##  9 2021-01-23 US_AK        Alaska                       5752               73014
## 10 2021-01-24 US_AK        Alaska                       4911               77925
## # … with 18,461 more rows, and 8 more variables:
## #   new_persons_fully_vaccinated <dbl>,
## #   cumulative_persons_fully_vaccinated <dbl>,
## #   new_persons_fully_vaccinated_pfizer <dbl>,
## #   cumulative_persons_fully_vaccinated_pfizer <dbl>,
## #   new_persons_fully_vaccinated_moderna <dbl>,
## #   cumulative_persons_fully_vaccinated_moderna <dbl>, …