Chapter 2 Data sources
The primary data sources for this project include Wikidata, DataCommons, Eurostat, WorldBank, WorldPop, LawAtlas Project, University of Oxford, Google LLC, and etc. Since this project involves large amounts of data acorss countries around the world as well as all states within the USA, we utilized the aggregated dataset COVID-19 Open Dataset, which did not clean or modify the data from the original source, which we verified.
From the above data source, we collected and used six datasets: Economic indicators by countries, Covid-19 cases records, government emergency declarations, health indicators by regions, movement of people, and population vaccination records. Furthermore, we used an index data table which contains references to id/keys from other tables.
2.1 Economic Indicators by countries
Dataset: economy.csv
The economy dataset is consisted of 365 rows with 4 columns, as described below. Each row corresponds to a country. This dataset mainly describes important ecocnomic indicators associated with each country.
Schema
| Name | Name | Description | Example |
|---|---|---|---|
| key | string |
Unique string identifying the region | US |
| gdp | integer [USD] |
Gross domestic product; monetary value of all finished goods and services | 24450604878 |
| gdp_per_capita | integer [USD] |
Gross domestic product divided by total population | 1148 |
| human_capital_index | double [0-1] |
Mobilization of the economic and professional potential of citizens | 0.765 |
Sources of data
Show data sources
| Data | Source | License and Terms of Use |
|---|---|---|
| Metadata | Wikipedia | Terms of Use |
| Metadata | Eurostat | CC BY |
| Economy | Wikidata | CC0 |
| Economy | DataCommons | Attribution required |
| Economy | WorldBank | CC BY |
2.2 Covid-19 cases records
Dataset: epidemiology.csv
The epidemiology dataset is fairly large. It consists of 9517268 rows with 10 columns, as described below. This dataset is time-series, with dates associated with each country/region. The major information includes newly confirmed, deceased, tested number of individuals with COVID-19 as well as the cumulative amounts for each category.
Schema
| Name | Type | Description | Example |
|---|---|---|---|
| date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
| key | string |
Unique string identifying the region | CN_HB |
| new_confirmed1 | integer |
Count of new cases confirmed after positive test on this date | 34 |
| new_deceased1 | integer |
Count of new deaths from a positive COVID-19 case on this date | 2 |
| new_recovered1 | integer |
Count of new recoveries from a positive COVID-19 case on this date | 13 |
| new_tested2 | integer |
Count of new COVID-19 tests performed on this date | 13 |
| cumulative_confirmed3 | integer |
Cumulative sum of cases confirmed after positive test to date | 6447 |
| cumulative_deceased3 | integer |
Cumulative sum of deaths from a positive COVID-19 case to date | 133 |
| cumulative_recovered3 | integer |
Cumulative sum of recoveries from a positive COVID-19 case to date | 133 |
| cumulative_tested2,3 | integer |
Cumulative sum of COVID-19 tests performed to date | 133 |
1Values can be negative, typically indicating a correction or an adjustment in the way
they were measured. For example, a case might have been incorrectly flagged as recovered one date so
it will be subtracted from the following date.
2Some health authorities only report PCR testing. This variable usually refers to cumulative
number of tests and not tested persons, but some health authorities only report tested persons.
3Cumulative count will not always amount to the sum of daily counts, because many authorities
make changes to criteria for counting cases, but not always make adjustments to the data. There is
also potential missing data. All of that makes the cumulative counts drift away from the sum of all
daily counts over time, which is why the cumulative values, if reported, are kept in a separate
column.
Sources of data
Show data sources
2.3 Government emergency declarations
Dataset: lawatlas-emergency-declarations.csv
The emergency declarations dataset contains emergency declarations and mitigation policies for each US state starting on January 20, 2020. The data are aggregated by the Policy Surveillance Program at the Temple University Center for Public Health Law Research, and are published and maintained at LawAtlas.org. This dataset contains 8364 rows with 104 columns. It is a time-series data frame with dates corresponding to each location, along with boolean variables indicating the whether a state of action is effective.
Schema
Some major columns include:
| Name | Type | Description | Example |
|---|---|---|---|
| date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
| key | string |
Unique string identifying the region | US_CA |
| lawatlas_mitigation_policy | integer [0-1] |
Has the state instituted legal action aimed at mitigating the spread of COVID-19? | 0 |
| lawatlas_state_emergency | integer [0-1] |
Is there an emergency declaration in effect in the state? | 0 |
| lawatlas_emerg_statewide | integer [0-1] |
Does the emergency declaration apply statewide? | 0 |
| lawatlas_travel_requirement | integer [0-1] |
Is there a restriction on travelers? | 0 |
Sources of data
2.4 Health indicators by regions
Dataset: health.csv
The health dataset mainly contains health related indicators for each region. It contains 3503 rows with 14 columns. Each row corresponds to one country.
Schema
| Name | Type | Description | Example |
|---|---|---|---|
| key | string |
Unique string identifying the region | BN |
| life_expectancy | double [years] |
Average years that an individual is expected to live | 75.722 |
| smoking_prevalence | double [%] |
Percentage of smokers in population | 16.9 |
| diabetes_prevalence | double [%] |
Percentage of persons with diabetes in population | 13.3 |
| infant_mortality_rate | double |
Infant mortality rate (per 1,000 live births) | 9.8 |
| adult_male_mortality_rate | double |
Mortality rate, adult, male (per 1,000 male adults) | 143.719 |
| adult_female_mortality_rate | double |
Mortality rate, adult, female (per 1,000 male adults) | 98.803 |
| pollution_mortality_rate | double |
Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population) | 13.3 |
| comorbidity_mortality_rate | double [%] |
Mortality from cardiovascular disease, cancer, diabetes or cardiorespiratory disease between exact ages 30 and 70 | 16.6 |
| hospital_beds | double |
Hospital beds (per 1,000 people) | 2.7 |
| nurses | double |
Nurses and midwives (per 1,000 people) | 5.8974 |
| physicians | double |
Physicians (per 1,000 people) | 1.609 |
| health_expenditure | double [USD] |
Health expenditure per capita | 671.4115 |
| out_of_pocket_health_expenditure | double [USD] |
Out-of-pocket health expenditure per capita | 34.756348 |
Note that the majority of the health indicators are only available at the country level.
Sources of data
2.5 Movement of people
Google’s Mobility Reports are joined with our known location keys, and can be downloaded at the following locations:
Dataset: mobility.csv
This dataset contains various metrics related to movement of people. It has 4196096 rows with 8 columns. The dataset is time-series based and each row corresponds to a dates and the associated location. Detailed description of columns are as follows.
Google COVID-19 Community Mobility Reports Terms of use
In order to download or use the data or reports, you must agree to the Google Terms of Service.
Schema
| Name | Type | Description | Example |
|---|---|---|---|
| date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
| key | string |
Unique string identifying the region | US_CA |
| mobility_grocery_and_pharmacy | double [%] |
Percentage change in visits to places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies compared to baseline | -15 |
| mobility_parks | double [%] |
Percentage change in visits to places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens compared to baseline | -15 |
| mobility_transit_stations | double [%] |
Percentage change in visits to places like public transport hubs such as subway, bus, and train stations compared to baseline | -15 |
| mobility_retail_and_recreation | double [%] |
Percentage change in visits to restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters compared to baseline | -15 |
| mobility_residential | double [%] |
Percentage change in visits to places of residence compared to baseline | -15 |
| mobility_workplaces | double [%] |
Percentage change in visits to places of work compared to baseline | -15 |
Changes for each day are compared to a baseline value for that day of the week:
- The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020.
- The datasets show trends over several months with the most recent data representing approximately 2-3 days ago—this is how long it takes to produce the datasets.
Sources of data
Show data sources
| Data | Source | License and Terms of Use |
|---|---|---|
| Google Mobility data | https://www.google.com/covid19/mobility/ | Google Terms of Service |
2.6 Population vaccination records
Dataset: vaccinations.csv
This dataset contains information related to deployment and administration of COVID-19 vaccines. It has a total of 1377366 rows with 32 columns. The dataset is time-series and each row corresponds to a date with the associated location. Detailed description for columns are as followed.
Schema
| Name | Type | Description | Example |
|---|---|---|---|
| date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2021-02-07 |
| key | string |
Unique string identifying the region | ID |
| new_persons_vaccinated* | integer |
Count of new persons which have received one or more doses | 7222 |
| cumulative_persons_vaccinated** | integer |
Cumulative sum of persons which have received one or more doses | 784318 |
| new_persons_fully_vaccinated* | integer |
Count of new persons which have received all doses required for maximum immunity | 1924 |
| cumulative_persons_fully_vaccinated** | integer |
Cumulative sum of persons which have received all doses required for maximum immunity | 139131 |
| new_vaccine_doses_administered* | integer |
Count of new vaccine doses administered to persons | 9146 |
| cumulative_vaccine_doses_administered** | integer |
Cumulative sum of vaccine doses administered to persons | 923449 |
**${statistic}_${vaccine}** |
integer |
Statistic value corresponding to a specific vaccine such as new_persons_vaccinated_moderna |
1035 |
*Values can be negative, typically indicating a correction or an adjustment in the way they were measured.
**Cumulative count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the cumulative counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.
Sources of data
Show data sources
| Data | Source | License and Terms of Use | Notes |
|---|---|---|---|
| Country-level data | Our World in Data | CC BY | |
| Argentina | Datos Argentina | Public domain | |
| Australia | COVID LIVE | CC BY | Country level data is not the sum of the states/territories as there is a portion of vaccinations managed by the Federal government that is delivered directly to aged and disability care and not counted as part of the states/territories. As of 2021-03-14, only doses administered are reported for country-level data but NSW, VIC and WA continue to report the count of persons fully and partially vaccinated. |
| Austria | Open Data Österreich | CC BY | |
| Belgium | Covid Vaccinations Belgium | CC BY | Regional data only available for Brussels, since the regions reported by the data source do not match our indexed subregions |
| Bolivia | Ministry of Health (via FinMango) | CC BY | |
| Brazil | coronavirusbra1.github.io via [@wcota/covid19br]2 | CC BY | |
| Brazil | Brazil Ministério da Saúde | Creative Commons Atribuição | |
| Bulgaria | Ministry of Health (via FinMango) | CC BY | |
| Canada | Department of Health Canada | Attribution required | |
| Colombia | Ministry of Health (via FinMango) | CC BY | |
| Czech Republic | Ministry of Health of the Czech Republic | Open Data | |
| France | data.gouv.fr | Open License 2.0 | |
| Germany | Robert Koch Institute (via FinMango) | Attribution Required | |
| India | COVID19-India | CC BY | |
| Israel | Israel Government Data Portal | Attribution Required | Admin level 2 regions are provided by the source and are aggregated to admin level 1. The total vaccination dose numbers provided by the source for admin level 2 do not match the country-wide total. This also impacts the aggregated level 1 totals. |
| Italy | Commissario straordinario per l’emergenza Covid-19 | CC BY | |
| Spain | Ministry of Health | Attribution required | |
| Slovenia | Ministry of Health (via FinMango) | CC BY | |
| Slovakia | https://korona.gov.sk, operated by Ministry of Investments, Regional Development and Informatization of the Slovak Republic] | Attribution required | |
| Sweden | Public Health Agency of Sweden | Fair Use | |
| Switzerland | Federal Office of Public Health | Fair Use | |
| United Kingdom (nations) | NHS | OGL | |
| United Kingdom (England) | NHS (via FinMango) | OGL | |
| United States | CDC | Public Domain |
2.7 Index reference table
Dataset: index.csv
This data table contains keys, codes and names for each region and country. It also contains the aggregate level (0: country, 1: state, etc.) With 22958 rows and 15 columns, this set is especially helpful in selecting and filtering the previous datasets in the data cleaning process.
Schema
| Name | Type | Description | Example |
|---|---|---|---|
| key | string |
Unique string identifying the region | US_CA_06001 |
| place_id | string |
A textual identifier that uniquely identifies a place in the Google Places database and on Google Maps (details) | ChIJd_Y0eVIvkIARuQyDN0F1LBA |
| wikidata | string |
Wikidata ID corresponding to this key | Q107146 |
| datacommons | string |
DataCommons ID corresponding to this key | geoId/06001 |
| country_code | string |
ISO 3166-1 alphanumeric 2-letter code of the country | US |
| country_name | string |
American English name of the country, subject to change | United States of America |
| subregion1_code | string |
(Optional) ISO 3166-2 or NUTS 2/3 code of the subregion | CA |
| subregion1_name | string |
(Optional) American English name of the subregion, subject to change | California |
| subregion2_code | string |
(Optional) FIPS code of the county (or local equivalent) | 06001 |
| subregion2_name | string |
(Optional) American English name of the county (or local equivalent), subject to change | Alameda County |
| 3166-1-alpha-2 | string |
ISO 3166-1 alphanumeric 2-letter code of the country | US |
| 3166-1-alpha-3 | string |
ISO 3166-1 alphanumeric 3-letter code of the country | USA |
| aggregation_level | integer [0-2] |
Level at which data is aggregated, i.e. country, state/province or county level | 2 |