Chapter 3 Data transformation
The transformation and pre-processing of our data is performed in python. Detailed code can be access here: data transformation notebook
3.1 Epidemiology Table
For the epidemiology data, the raw data is consist of both country level and US state level datapoints. So we divided it into two tables, one for global country level and the other for the US state level, and we keep only the most relevant columns. The preprocessing steps is as following:
- Filter column
location_key. If it is a key corresponding to a country, we add the row into our country-level table; it it is a US state key, we add the row into our state-level table. - Select columns:
date,location_key,new_confirmed,new_deceased,cumulative_confirmed,cumulative_deceased
After cleaning, the country-level table has 163609 rows and 6 columns while the state-level table has 38537 rows and 6 columns.
Country-level table:
## # A tibble: 163,609 × 7
## date location_key country_name new_confirmed new_deceased
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-01-01 AD Andorra 0 0
## 2 2020-01-02 AD Andorra 0 0
## 3 2020-01-03 AD Andorra 0 0
## 4 2020-01-04 AD Andorra 0 0
## 5 2020-01-05 AD Andorra 0 0
## 6 2020-01-06 AD Andorra 0 0
## 7 2020-01-07 AD Andorra 0 0
## 8 2020-01-08 AD Andorra 0 0
## 9 2020-01-09 AD Andorra 0 0
## 10 2020-01-10 AD Andorra 0 0
## # … with 163,599 more rows, and 2 more variables: cumulative_confirmed <dbl>,
## # cumulative_deceased <dbl>
State-level table:
## # A tibble: 38,537 × 7
## date location_key state_name new_confirmed new_deceased
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-01-22 US_AK Alaska 0 0
## 2 2020-01-23 US_AK Alaska 0 0
## 3 2020-01-24 US_AK Alaska 0 0
## 4 2020-01-25 US_AK Alaska 0 0
## 5 2020-01-26 US_AK Alaska 0 0
## 6 2020-01-27 US_AK Alaska 0 0
## 7 2020-01-28 US_AK Alaska 0 0
## 8 2020-01-29 US_AK Alaska 0 0
## 9 2020-01-30 US_AK Alaska 0 0
## 10 2020-01-31 US_AK Alaska 0 0
## # … with 38,527 more rows, and 2 more variables: cumulative_confirmed <dbl>,
## # cumulative_deceased <dbl>
Description of each column is as following (The country-level and the state-level table has the same structure):
| Column | Description |
|---|---|
| date | The date of the record |
| location_key | The code corresponding to the country/state |
| new_confirmed | # of new Covid confirmed cases of the date in the country/state |
| new_deceased | # of new Covid deceased cases of the date in the country/state |
| cumulative_confirmed | # of cumulative Covid confirmed cases of the date in the country/state |
| cumulative_deceased | # of cumulative Covid deceased cases of the date in the country/state |
3.2 Economy Table
The raw data of the economy table are country-level economic statistics. We extracted only the same set of countries as in other tables to keep consistency across tables. The resulting table has 203 rows and three columns.
## # A tibble: 202 × 4
## location_key country_name gdp_usd gdp_per_capita_usd
## <chr> <chr> <dbl> <dbl>
## 1 AD Andorra 3154057987 40886
## 2 AE United Arab Emirates 421142267937 43103
## 3 AF Afghanistan 19101353832 502
## 4 AG Antigua and Barbuda 1727759259 17790
## 5 AL Albania 15278077446 5352
## 6 AM Armenia 13672802157 4622
## 7 AO Angola 94635415869 2973
## 8 AR Argentina 449663446954 10006
## 9 AS American Samoa 636000000 11466
## 10 AT Austria 455140042000 51448
## # … with 192 more rows
Description of each column is as following (The country-level and the state-level table has the same structure):
| Column | Description |
|---|---|
| location_key | The code corresponding to the country |
| gdp_usd | Gross domestic product of the country in US dollars |
| gdp_per_capita_usd | Gross domestic product of the country in US dollars divided by its total population |
3.3 First Mitigation Date Table
The raw data are a table recording the mitigation policies and emergency declarations status of each state in the US. We derive the date that the first mitigation policy is published in each state from the raw data and formulate a new table of 51 rows and 2 columns.
## # A tibble: 51 × 3
## location_key state_name first_mitigate_date
## <chr> <chr> <date>
## 1 US_AK Alaska 2020-03-09
## 2 US_AL Alabama 2020-03-13
## 3 US_AR Arkansas 2020-03-11
## 4 US_AZ Arizona 2020-03-11
## 5 US_CA California 2020-03-04
## 6 US_CO Colorado 2020-03-11
## 7 US_CT Connecticut 2020-03-10
## 8 US_DC District of Columbia 2020-03-11
## 9 US_DE Delaware 2020-03-13
## 10 US_FL Florida 2020-03-01
## # … with 41 more rows
Description of each column is as following:
| Column | Description |
|---|---|
| state_code | The code corresponding to the state (e.g. US_CA) |
| first_mitigate_date | the date that the first mitigation policy is published in the state |
3.4 Health Table
The raw data give several health indicators such as smoke prevalence in different regions. As country-level and state-level information are mixed together in the raw data, we need to separate them into two tables.
Preprocessing steps:
- Filter column
location_key. If it is a key corresponding to a country, we add the row into our country-level table; if it is a US state key, we add the row into our state-level table. - As the state-level information only contains one health indicator,
life_expectancy, and all other columns are all NAs.
Country-level table:
## # A tibble: 210 × 15
## location_key country_name life_expectancy smoking_prevale… diabetes_preval…
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD Andorra NA 33.5 7.7
## 2 AE United Arab E… 77.8 28.9 16.3
## 3 AF Afghanistan 64.5 NA 9.2
## 4 AG Antigua and B… 76.9 NA 13.1
## 5 AL Albania 78.9 28.7 9
## 6 AM Armenia 74.9 24.1 6.1
## 7 AO Angola 60.8 NA 4.5
## 8 AR Argentina 76.5 21.8 5.9
## 9 AT Austria 81.8 29.6 6.6
## 10 AU Australia 82.7 14.7 5.6
## # … with 200 more rows, and 10 more variables: infant_mortality_rate <dbl>,
## # adult_male_mortality_rate <dbl>, adult_female_mortality_rate <dbl>,
## # pollution_mortality_rate <dbl>, comorbidity_mortality_rate <dbl>,
## # hospital_beds_per_1000 <dbl>, nurses_per_1000 <dbl>,
## # physicians_per_1000 <dbl>, health_expenditure_usd <dbl>,
## # out_of_pocket_health_expenditure_usd <dbl>
State-level table:
## # A tibble: 53 × 2
## location_key life_expectancy
## <chr> <dbl>
## 1 US_AK 78
## 2 US_AL 75.1
## 3 US_AR 75.6
## 4 US_AZ 78.7
## 5 US_CA 80.8
## 6 US_CO 80
## 7 US_CT 80.4
## 8 US_DE 77.8
## 9 US_FL 78.9
## 10 US_GA 77.2
## # … with 43 more rows
3.5 Mobility Table
Only state-level information are available in the raw data. After cleaning the data and select only the valid state code, we generate a table of 33762 rows and 8 columns.
Description of each column is as following:
| Column | Description |
|---|---|
| date | The date of the data being recorded |
| location_key | The code corresponding to the country/state |
| mobility_retail_and_recreation | percentage change in visits to retail and recreation |
| mobility_grocery_and_pharmacy | percentage change in visits to grocery and pharmacy |
| mobility_parks | percentage change in visits to parks |
| mobility_transit_stations | percentage change in visits to transit stations |
| mobility_workplaces | percentage change in visits to workplaces |
| mobility_residential | percentage change in visits to residential areas |
3.6 Vaccination Table
The raw data of vaccination records also have country-level and state-level mixed together. So we separated them into two tables. Moreover, we found out that country-level records do no have information about vaccine of different providers (e.g. Pfizer, Moderna, Janssen), so we remove these columns from the country-level table.
## # A tibble: 34,202 × 7
## date location_key country_name new_persons_vaccinated cumulative_perso…
## <date> <chr> <chr> <dbl> <dbl>
## 1 2021-01-25 AD Andorra NA 576
## 2 2021-02-01 AD Andorra 460 1036
## 3 2021-02-10 AD Andorra 255 1291
## 4 2021-02-12 AD Andorra 331 1622
## 5 2021-02-19 AD Andorra 519 2141
## 6 2021-02-24 AD Andorra 249 2390
## 7 2021-02-26 AD Andorra 0 NA
## 8 2021-03-08 AD Andorra 49 2439
## 9 2021-03-10 AD Andorra 1211 3650
## 10 2021-03-15 AD Andorra 3448 7098
## # … with 34,192 more rows, and 2 more variables:
## # new_persons_fully_vaccinated <dbl>,
## # cumulative_persons_fully_vaccinated <dbl>
## # A tibble: 18,471 × 13
## date location_key state_name new_persons_vaccinated cumulative_persons…
## <date> <chr> <chr> <dbl> <dbl>
## 1 2021-01-12 US_AK Alaska NA 22486
## 2 2021-01-13 US_AK Alaska 2055 24541
## 3 2021-01-14 US_AK Alaska 10967 35508
## 4 2021-01-15 US_AK Alaska 4038 39546
## 5 2021-01-19 US_AK Alaska 15888 55434
## 6 2021-01-20 US_AK Alaska 1477 56911
## 7 2021-01-21 US_AK Alaska 2538 59449
## 8 2021-01-22 US_AK Alaska 7813 67262
## 9 2021-01-23 US_AK Alaska 5752 73014
## 10 2021-01-24 US_AK Alaska 4911 77925
## # … with 18,461 more rows, and 8 more variables:
## # new_persons_fully_vaccinated <dbl>,
## # cumulative_persons_fully_vaccinated <dbl>,
## # new_persons_fully_vaccinated_pfizer <dbl>,
## # cumulative_persons_fully_vaccinated_pfizer <dbl>,
## # new_persons_fully_vaccinated_moderna <dbl>,
## # cumulative_persons_fully_vaccinated_moderna <dbl>, …