Chapter 2 Data sources

The primary data sources for this project include Wikidata, DataCommons, Eurostat, WorldBank, WorldPop, LawAtlas Project, University of Oxford, Google LLC, and etc. Since this project involves large amounts of data acorss countries around the world as well as all states within the USA, we utilized the aggregated dataset COVID-19 Open Dataset, which did not clean or modify the data from the original source, which we verified.

From the above data source, we collected and used six datasets: Economic indicators by countries, Covid-19 cases records, government emergency declarations, health indicators by regions, movement of people, and population vaccination records. Furthermore, we used an index data table which contains references to id/keys from other tables.

2.1 Economic Indicators by countries

Dataset: economy.csv

The economy dataset is consisted of 365 rows with 4 columns, as described below. Each row corresponds to a country. This dataset mainly describes important ecocnomic indicators associated with each country.

Schema

Name Name Description Example
key string Unique string identifying the region US
gdp integer [USD] Gross domestic product; monetary value of all finished goods and services 24450604878
gdp_per_capita integer [USD] Gross domestic product divided by total population 1148
human_capital_index double [0-1] Mobilization of the economic and professional potential of citizens 0.765

Sources of data

Show data sources
Data Source License and Terms of Use
Metadata Wikipedia Terms of Use
Metadata Eurostat CC BY
Economy Wikidata CC0
Economy DataCommons Attribution required
Economy WorldBank CC BY

2.2 Covid-19 cases records

Dataset: epidemiology.csv

The epidemiology dataset is fairly large. It consists of 9517268 rows with 10 columns, as described below. This dataset is time-series, with dates associated with each country/region. The major information includes newly confirmed, deceased, tested number of individuals with COVID-19 as well as the cumulative amounts for each category.

Schema

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region CN_HB
new_confirmed1 integer Count of new cases confirmed after positive test on this date 34
new_deceased1 integer Count of new deaths from a positive COVID-19 case on this date 2
new_recovered1 integer Count of new recoveries from a positive COVID-19 case on this date 13
new_tested2 integer Count of new COVID-19 tests performed on this date 13
cumulative_confirmed3 integer Cumulative sum of cases confirmed after positive test to date 6447
cumulative_deceased3 integer Cumulative sum of deaths from a positive COVID-19 case to date 133
cumulative_recovered3 integer Cumulative sum of recoveries from a positive COVID-19 case to date 133
cumulative_tested2,3 integer Cumulative sum of COVID-19 tests performed to date 133

1Values can be negative, typically indicating a correction or an adjustment in the way they were measured. For example, a case might have been incorrectly flagged as recovered one date so it will be subtracted from the following date.
2Some health authorities only report PCR testing. This variable usually refers to cumulative number of tests and not tested persons, but some health authorities only report tested persons.
3Cumulative count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the cumulative counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.

Sources of data

Show data sources
Data Source License and Terms of Use
Country-level data ECDC Attribution required
Country-level data Our World in Data CC BY
Country-level data WHO Attribution required
Afghanistan HDX CC BY
Argentina Datos Argentina Public domain
Australia COVID LIVE CC BY
Austria Open Data Österreich CC BY
Bangladesh http://covid19tracker.gov.bd Public Domain
Belgium Belgian institute for health Attribution required
Brazil Brazil Ministério da Saúde Creative Commons Atribuição
Brazil (Rio de Janeiro) http://www.data.rio/ Dados abertos
Brazil (Ceará) https://saude.ce.gov.br Dados abertos
Canada Department of Health Canada Attribution required
Canada COVID-19 Canada Open Data Working Group CC BY
Chile Ministerio de Ciencia de Chile Terms of use
China DXY COVID-19 dataset MIT
Colombia Datos Abiertos Colombia Attribution required
Czech Republic Ministry of Health of the Czech Republic Open Data
Democratic Republic of Congo HDX CC BY
Estonia Health Board of Estonia Open Data
Finland Finnish institute for health and welfare CC BY
France data.gouv.fr Open License 2.0
Germany Robert Koch Institute Attribution Required
Haiti HDX CC-BY
Hong Kong Hong Kong Department of Health Attribution Required
Israel Israel Government Data Portal Attribution Required
Haiti HDX CC BY
India Wikipedia Attribution Required
India Covid 19 India Organisation CC BY
Indonesia https://covid19.go.id/peta-sebaran Public Domain
Italy Italy’s Department of Civil Protection CC BY
Iraq HDX CC BY
Japan https://github.com/swsoyee/2019-ncov-japan MIT
Japan https://github.com/kaz-ogiwara/covid19 MIT
Libya HDX CC BY
Luxembourg data.public.lu CC0
Malaysia Wikipedia Attribution Required
Mexico Secretaría de Salud Mexico Attribution Required
Netherlands RIVM Public Domain
New Zealand Ministry of Health CC-BY
Norway COVID19 EU Data MIT
Pakistan Wikipedia Attribution Required
Peru Datos Abiertos Peru ODC BY
Philippines Philippines Department of Health Attribution required
Poland COVID19 EU Data MIT
Portugal COVID-19: Portugal MIT
Romania https://github.com/adrianp/covid19romania CC0
Romania https://datelazi.ro/ Terms of Service
Russia https://стопкоронавирус.рф (via [@jeetiss](https://github.com/jeetiss/covid19-russia) CC BY
Slovenia https://www.gov.si Attribution Required
South Africa FinMango COVID-19 Data CC BY
South Korea Wikipedia Attribution Required
Spain Ministry of Health Attribution required
Spain (Canary Islands) Gobierno de Canarias Attribution required
Spain (Catalonia) Dades Obertes Catalunya CC0
Spain (Madrid) Datos Abiertos Madrid Attribution required
Sudan HDX CC BY
Sweden Public Health Agency of Sweden Fair Use
Switzerland OpenZH data CC BY
Taiwan Ministry of Health and Welfare Attribution Required
Thailand Ministry of Public Health Fair Use
Ukraine National Security and Defense Council of Ukraine CC BY
United Kingdom https://github.com/tomwhite/covid-19-uk-data The Unlicense
United Kingdom https://coronavirus.data.gov.uk/ Attribution required, Open Government Licence v3.0
USA NYT COVID Dataset Attribution required, non-commercial use
USA COVID Tracking Project CC BY
USA (Alaska) Alaska Department of Health and Social Services
USA (D.C.) Government of the District of Columbia Public Domain
USA (Delaware) Delaware Health and Social Services Public Domain
USA (Florida) Florida Health Public Domain
USA (Indiana) Indiana Department of Health CC BY
USA (Massachusetts) MCAD COVID-19 Information & Resource Center Public Domain
USA (New York) New York City Health Department Public Domain
USA (San Francisco) SF Open Data Public Domain Dedication and License
USA (Texas) Texas Department of State Health Services Attribution required
USA (Washington) Washington State Department of Health Public Domain
Venezuela HDX CC BY

2.3 Government emergency declarations

Dataset: lawatlas-emergency-declarations.csv

The emergency declarations dataset contains emergency declarations and mitigation policies for each US state starting on January 20, 2020. The data are aggregated by the Policy Surveillance Program at the Temple University Center for Public Health Law Research, and are published and maintained at LawAtlas.org. This dataset contains 8364 rows with 104 columns. It is a time-series data frame with dates corresponding to each location, along with boolean variables indicating the whether a state of action is effective.

Schema

Some major columns include:

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region US_CA
lawatlas_mitigation_policy integer [0-1] Has the state instituted legal action aimed at mitigating the spread of COVID-19? 0
lawatlas_state_emergency integer [0-1] Is there an emergency declaration in effect in the state? 0
lawatlas_emerg_statewide integer [0-1] Does the emergency declaration apply statewide? 0
lawatlas_travel_requirement integer [0-1] Is there a restriction on travelers? 0

Sources of data

Show data sources
Data Source License and Terms of Use
Emergency declarations and mitigation policies LawAtlas CC BY

2.4 Health indicators by regions

Dataset: health.csv

The health dataset mainly contains health related indicators for each region. It contains 3503 rows with 14 columns. Each row corresponds to one country.

Schema

Name Type Description Example
key string Unique string identifying the region BN
life_expectancy double [years] Average years that an individual is expected to live 75.722
smoking_prevalence double [%] Percentage of smokers in population 16.9
diabetes_prevalence double [%] Percentage of persons with diabetes in population 13.3
infant_mortality_rate double Infant mortality rate (per 1,000 live births) 9.8
adult_male_mortality_rate double Mortality rate, adult, male (per 1,000 male adults) 143.719
adult_female_mortality_rate double Mortality rate, adult, female (per 1,000 male adults) 98.803
pollution_mortality_rate double Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population) 13.3
comorbidity_mortality_rate double [%] Mortality from cardiovascular disease, cancer, diabetes or cardiorespiratory disease between exact ages 30 and 70 16.6
hospital_beds double Hospital beds (per 1,000 people) 2.7
nurses double Nurses and midwives (per 1,000 people) 5.8974
physicians double Physicians (per 1,000 people) 1.609
health_expenditure double [USD] Health expenditure per capita 671.4115
out_of_pocket_health_expenditure double [USD] Out-of-pocket health expenditure per capita 34.756348

Note that the majority of the health indicators are only available at the country level.

Sources of data

Show data sources
Data Source License and Terms of Use
Health Eurostat CC BY
Health Wikidata CC0
Health WorldBank CC BY

2.5 Movement of people

Google’s Mobility Reports are joined with our known location keys, and can be downloaded at the following locations:

Dataset: mobility.csv

This dataset contains various metrics related to movement of people. It has 4196096 rows with 8 columns. The dataset is time-series based and each row corresponds to a dates and the associated location. Detailed description of columns are as follows.

Google COVID-19 Community Mobility Reports Terms of use

In order to download or use the data or reports, you must agree to the Google Terms of Service.

Schema

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region US_CA
mobility_grocery_and_pharmacy double [%] Percentage change in visits to places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies compared to baseline -15
mobility_parks double [%] Percentage change in visits to places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens compared to baseline -15
mobility_transit_stations double [%] Percentage change in visits to places like public transport hubs such as subway, bus, and train stations compared to baseline -15
mobility_retail_and_recreation double [%] Percentage change in visits to restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters compared to baseline -15
mobility_residential double [%] Percentage change in visits to places of residence compared to baseline -15
mobility_workplaces double [%] Percentage change in visits to places of work compared to baseline -15

Changes for each day are compared to a baseline value for that day of the week:

  • The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020.
  • The datasets show trends over several months with the most recent data representing approximately 2-3 days ago—this is how long it takes to produce the datasets.

Sources of data

Show data sources
Data Source License and Terms of Use
Google Mobility data https://www.google.com/covid19/mobility/ Google Terms of Service

2.6 Population vaccination records

Dataset: vaccinations.csv

This dataset contains information related to deployment and administration of COVID-19 vaccines. It has a total of 1377366 rows with 32 columns. The dataset is time-series and each row corresponds to a date with the associated location. Detailed description for columns are as followed.

Schema

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2021-02-07
key string Unique string identifying the region ID
new_persons_vaccinated* integer Count of new persons which have received one or more doses 7222
cumulative_persons_vaccinated** integer Cumulative sum of persons which have received one or more doses 784318
new_persons_fully_vaccinated* integer Count of new persons which have received all doses required for maximum immunity 1924
cumulative_persons_fully_vaccinated** integer Cumulative sum of persons which have received all doses required for maximum immunity 139131
new_vaccine_doses_administered* integer Count of new vaccine doses administered to persons 9146
cumulative_vaccine_doses_administered** integer Cumulative sum of vaccine doses administered to persons 923449
**${statistic}_${vaccine}** integer Statistic value corresponding to a specific vaccine such as new_persons_vaccinated_moderna 1035

*Values can be negative, typically indicating a correction or an adjustment in the way they were measured.

**Cumulative count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the cumulative counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.

Sources of data

Show data sources
Data Source License and Terms of Use Notes
Country-level data Our World in Data CC BY
Argentina Datos Argentina Public domain
Australia COVID LIVE CC BY Country level data is not the sum of the states/territories as there is a portion of vaccinations managed by the Federal government that is delivered directly to aged and disability care and not counted as part of the states/territories.

As of 2021-03-14, only doses administered are reported for country-level data but NSW, VIC and WA continue to report the count of persons fully and partially vaccinated.
Austria Open Data Österreich CC BY
Belgium Covid Vaccinations Belgium CC BY Regional data only available for Brussels, since the regions reported by the data source do not match our indexed subregions
Bolivia Ministry of Health (via FinMango) CC BY
Brazil coronavirusbra1.github.io via [@wcota/covid19br]2 CC BY
Brazil Brazil Ministério da Saúde Creative Commons Atribuição
Bulgaria Ministry of Health (via FinMango) CC BY
Canada Department of Health Canada Attribution required
Colombia Ministry of Health (via FinMango) CC BY
Czech Republic Ministry of Health of the Czech Republic Open Data
France data.gouv.fr Open License 2.0
Germany Robert Koch Institute (via FinMango) Attribution Required
India COVID19-India CC BY
Israel Israel Government Data Portal Attribution Required Admin level 2 regions are provided by the source and are aggregated to admin level 1. The total vaccination dose numbers provided by the source for admin level 2 do not match the country-wide total. This also impacts the aggregated level 1 totals.
Italy Commissario straordinario per l’emergenza Covid-19 CC BY
Spain Ministry of Health Attribution required
Slovenia Ministry of Health (via FinMango) CC BY
Slovakia https://korona.gov.sk, operated by Ministry of Investments, Regional Development and Informatization of the Slovak Republic] Attribution required
Sweden Public Health Agency of Sweden Fair Use
Switzerland Federal Office of Public Health Fair Use
United Kingdom (nations) NHS OGL
United Kingdom (England) NHS (via FinMango) OGL
United States CDC Public Domain

2.7 Index reference table

Dataset: index.csv

This data table contains keys, codes and names for each region and country. It also contains the aggregate level (0: country, 1: state, etc.) With 22958 rows and 15 columns, this set is especially helpful in selecting and filtering the previous datasets in the data cleaning process.

Schema

Name Type Description Example
key string Unique string identifying the region US_CA_06001
place_id string A textual identifier that uniquely identifies a place in the Google Places database and on Google Maps (details) ChIJd_Y0eVIvkIARuQyDN0F1LBA
wikidata string Wikidata ID corresponding to this key Q107146
datacommons string DataCommons ID corresponding to this key geoId/06001
country_code string ISO 3166-1 alphanumeric 2-letter code of the country US
country_name string American English name of the country, subject to change United States of America
subregion1_code string (Optional) ISO 3166-2 or NUTS 2/3 code of the subregion CA
subregion1_name string (Optional) American English name of the subregion, subject to change California
subregion2_code string (Optional) FIPS code of the county (or local equivalent) 06001
subregion2_name string (Optional) American English name of the county (or local equivalent), subject to change Alameda County
3166-1-alpha-2 string ISO 3166-1 alphanumeric 2-letter code of the country US
3166-1-alpha-3 string ISO 3166-1 alphanumeric 3-letter code of the country USA
aggregation_level integer [0-2] Level at which data is aggregated, i.e. country, state/province or county level 2