Chapter 2 Data sources

The primary data sources for this project include Wikidata, DataCommons, Eurostat, WorldBank, WorldPop, LawAtlas Project, University of Oxford, Google LLC, and etc. Since this project involves large amounts of data acorss countries around the world as well as all states within the USA, we utilized the aggregated dataset COVID-19 Open Dataset, which did not clean or modify the data from the original source, which we verified.

From the above data source, we collected and used six datasets: Economic indicators by countries, Covid-19 cases records, government emergency declarations, health indicators by regions, movement of people, and population vaccination records. Furthermore, we used an index data table which contains references to id/keys from other tables.

2.1 Economic Indicators by countries

Dataset: economy.csv

The economy dataset is consisted of 365 rows with 4 columns, as described below. Each row corresponds to a country. This dataset mainly describes important ecocnomic indicators associated with each country.

Schema

Name	Name	Description	Example
key	`string`	Unique string identifying the region	US
gdp	`integer` `[USD]`	Gross domestic product; monetary value of all finished goods and services	24450604878
gdp_per_capita	`integer` `[USD]`	Gross domestic product divided by total population	1148
human_capital_index	`double` `[0-1]`	Mobilization of the economic and professional potential of citizens	0.765

Sources of data

Show data sources

Data	Source	License and Terms of Use
Metadata	Wikipedia	Terms of Use
Metadata	Eurostat	CC BY
Economy	Wikidata	CC0
Economy	DataCommons	Attribution required
Economy	WorldBank	CC BY

2.2 Covid-19 cases records

Dataset: epidemiology.csv

The epidemiology dataset is fairly large. It consists of 9517268 rows with 10 columns, as described below. This dataset is time-series, with dates associated with each country/region. The major information includes newly confirmed, deceased, tested number of individuals with COVID-19 as well as the cumulative amounts for each category.

Schema

Name	Type	Description	Example
date	`string`	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-30
key	`string`	Unique string identifying the region	CN_HB
new_confirmed¹	`integer`	Count of new cases confirmed after positive test on this date	34
new_deceased¹	`integer`	Count of new deaths from a positive COVID-19 case on this date	2
new_recovered¹	`integer`	Count of new recoveries from a positive COVID-19 case on this date	13
new_tested²	`integer`	Count of new COVID-19 tests performed on this date	13
cumulative_confirmed³	`integer`	Cumulative sum of cases confirmed after positive test to date	6447
cumulative_deceased³	`integer`	Cumulative sum of deaths from a positive COVID-19 case to date	133
cumulative_recovered³	`integer`	Cumulative sum of recoveries from a positive COVID-19 case to date	133
cumulative_tested^2,3	`integer`	Cumulative sum of COVID-19 tests performed to date	133

¹Values can be negative, typically indicating a correction or an adjustment in the way they were measured. For example, a case might have been incorrectly flagged as recovered one date so it will be subtracted from the following date.
²Some health authorities only report PCR testing. This variable usually refers to cumulative number of tests and not tested persons, but some health authorities only report tested persons.
³Cumulative count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the cumulative counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.

Sources of data

Show data sources

Data	Source	License and Terms of Use
Country-level data	ECDC	Attribution required
Country-level data	Our World in Data	CC BY
Country-level data	WHO	Attribution required
Afghanistan	HDX	CC BY
Argentina	Datos Argentina	Public domain
Australia	COVID LIVE	CC BY
Austria	Open Data Österreich	CC BY
Bangladesh	http://covid19tracker.gov.bd	Public Domain
Belgium	Belgian institute for health	Attribution required
Brazil	Brazil Ministério da Saúde	Creative Commons Atribuição
Brazil (Rio de Janeiro)	http://www.data.rio/	Dados abertos
Brazil (Ceará)	https://saude.ce.gov.br	Dados abertos
Canada	Department of Health Canada	Attribution required
Canada	COVID-19 Canada Open Data Working Group	CC BY
Chile	Ministerio de Ciencia de Chile	Terms of use
China	DXY COVID-19 dataset	MIT
Colombia	Datos Abiertos Colombia	Attribution required
Czech Republic	Ministry of Health of the Czech Republic	Open Data
Democratic Republic of Congo	HDX	CC BY
Estonia	Health Board of Estonia	Open Data
Finland	Finnish institute for health and welfare	CC BY
France	data.gouv.fr	Open License 2.0
Germany	Robert Koch Institute	Attribution Required
Haiti	HDX	CC-BY
Hong Kong	Hong Kong Department of Health	Attribution Required
Israel	Israel Government Data Portal	Attribution Required
Haiti	HDX	CC BY
India	Wikipedia	Attribution Required
India	Covid 19 India Organisation	CC BY
Indonesia	https://covid19.go.id/peta-sebaran	Public Domain
Italy	Italy’s Department of Civil Protection	CC BY
Iraq	HDX	CC BY
Japan	https://github.com/swsoyee/2019-ncov-japan	MIT
Japan	https://github.com/kaz-ogiwara/covid19	MIT
Libya	HDX	CC BY
Luxembourg	data.public.lu	CC0
Malaysia	Wikipedia	Attribution Required
Mexico	Secretaría de Salud Mexico	Attribution Required
Netherlands	RIVM	Public Domain
New Zealand	Ministry of Health	CC-BY
Norway	COVID19 EU Data	MIT
Pakistan	Wikipedia	Attribution Required
Peru	Datos Abiertos Peru	ODC BY
Philippines	Philippines Department of Health	Attribution required
Poland	COVID19 EU Data	MIT
Portugal	COVID-19: Portugal	MIT
Romania	https://github.com/adrianp/covid19romania	CC0
Romania	https://datelazi.ro/	Terms of Service
Russia	https://стопкоронавирус.рф (via [@jeetiss](https://github.com/jeetiss/covid19-russia)	CC BY
Slovenia	https://www.gov.si	Attribution Required
South Africa	FinMango COVID-19 Data	CC BY
South Korea	Wikipedia	Attribution Required
Spain	Ministry of Health	Attribution required
Spain (Canary Islands)	Gobierno de Canarias	Attribution required
Spain (Catalonia)	Dades Obertes Catalunya	CC0
Spain (Madrid)	Datos Abiertos Madrid	Attribution required
Sudan	HDX	CC BY
Sweden	Public Health Agency of Sweden	Fair Use
Switzerland	OpenZH data	CC BY
Taiwan	Ministry of Health and Welfare	Attribution Required
Thailand	Ministry of Public Health	Fair Use
Ukraine	National Security and Defense Council of Ukraine	CC BY
United Kingdom	https://github.com/tomwhite/covid-19-uk-data	The Unlicense
United Kingdom	https://coronavirus.data.gov.uk/	Attribution required, Open Government Licence v3.0
USA	NYT COVID Dataset	Attribution required, non-commercial use
USA	COVID Tracking Project	CC BY
USA (Alaska)	Alaska Department of Health and Social Services
USA (D.C.)	Government of the District of Columbia	Public Domain
USA (Delaware)	Delaware Health and Social Services	Public Domain
USA (Florida)	Florida Health	Public Domain
USA (Indiana)	Indiana Department of Health	CC BY
USA (Massachusetts)	MCAD COVID-19 Information & Resource Center	Public Domain
USA (New York)	New York City Health Department	Public Domain
USA (San Francisco)	SF Open Data	Public Domain Dedication and License
USA (Texas)	Texas Department of State Health Services	Attribution required
USA (Washington)	Washington State Department of Health	Public Domain
Venezuela	HDX	CC BY

2.3 Government emergency declarations

Dataset: lawatlas-emergency-declarations.csv

The emergency declarations dataset contains emergency declarations and mitigation policies for each US state starting on January 20, 2020. The data are aggregated by the Policy Surveillance Program at the Temple University Center for Public Health Law Research, and are published and maintained at LawAtlas.org. This dataset contains 8364 rows with 104 columns. It is a time-series data frame with dates corresponding to each location, along with boolean variables indicating the whether a state of action is effective.

Schema

Some major columns include:

Name	Type	Description	Example
date	`string`	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-30
key	`string`	Unique string identifying the region	US_CA
lawatlas_mitigation_policy	`integer` `[0-1]`	Has the state instituted legal action aimed at mitigating the spread of COVID-19?	0
lawatlas_state_emergency	`integer` `[0-1]`	Is there an emergency declaration in effect in the state?	0
lawatlas_emerg_statewide	`integer` `[0-1]`	Does the emergency declaration apply statewide?	0
lawatlas_travel_requirement	`integer` `[0-1]`	Is there a restriction on travelers?	0

Sources of data

Show data sources

Data	Source	License and Terms of Use
Emergency declarations and mitigation policies	LawAtlas	CC BY

2.4 Health indicators by regions

Dataset: health.csv

The health dataset mainly contains health related indicators for each region. It contains 3503 rows with 14 columns. Each row corresponds to one country.

Schema

Name	Type	Description	Example
key	`string`	Unique string identifying the region	BN
life_expectancy	`double` `[years]`	Average years that an individual is expected to live	75.722
smoking_prevalence	`double` `[%]`	Percentage of smokers in population	16.9
diabetes_prevalence	`double` `[%]`	Percentage of persons with diabetes in population	13.3
infant_mortality_rate	`double`	Infant mortality rate (per 1,000 live births)	9.8
adult_male_mortality_rate	`double`	Mortality rate, adult, male (per 1,000 male adults)	143.719
adult_female_mortality_rate	`double`	Mortality rate, adult, female (per 1,000 male adults)	98.803
pollution_mortality_rate	`double`	Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)	13.3
comorbidity_mortality_rate	`double` `[%]`	Mortality from cardiovascular disease, cancer, diabetes or cardiorespiratory disease between exact ages 30 and 70	16.6
hospital_beds	`double`	Hospital beds (per 1,000 people)	2.7
nurses	`double`	Nurses and midwives (per 1,000 people)	5.8974
physicians	`double`	Physicians (per 1,000 people)	1.609
health_expenditure	`double` `[USD]`	Health expenditure per capita	671.4115
out_of_pocket_health_expenditure	`double` `[USD]`	Out-of-pocket health expenditure per capita	34.756348

Note that the majority of the health indicators are only available at the country level.

Sources of data

Show data sources

Data	Source	License and Terms of Use
Health	Eurostat	CC BY
Health	Wikidata	CC0
Health	WorldBank	CC BY

2.5 Movement of people

Google’s Mobility Reports are joined with our known location keys, and can be downloaded at the following locations:

Dataset: mobility.csv

This dataset contains various metrics related to movement of people. It has 4196096 rows with 8 columns. The dataset is time-series based and each row corresponds to a dates and the associated location. Detailed description of columns are as follows.

Google COVID-19 Community Mobility Reports Terms of use

In order to download or use the data or reports, you must agree to the Google Terms of Service.

Schema

Name	Type	Description	Example
date	`string`	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-30
key	`string`	Unique string identifying the region	US_CA
mobility_grocery_and_pharmacy	`double` `[%]`	Percentage change in visits to places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies compared to baseline	-15
mobility_parks	`double` `[%]`	Percentage change in visits to places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens compared to baseline	-15
mobility_transit_stations	`double` `[%]`	Percentage change in visits to places like public transport hubs such as subway, bus, and train stations compared to baseline	-15
mobility_retail_and_recreation	`double` `[%]`	Percentage change in visits to restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters compared to baseline	-15
mobility_residential	`double` `[%]`	Percentage change in visits to places of residence compared to baseline	-15
mobility_workplaces	`double` `[%]`	Percentage change in visits to places of work compared to baseline	-15

Changes for each day are compared to a baseline value for that day of the week:

The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020.
The datasets show trends over several months with the most recent data representing approximately 2-3 days ago—this is how long it takes to produce the datasets.

Sources of data

Show data sources

Data	Source	License and Terms of Use
Google Mobility data	https://www.google.com/covid19/mobility/	Google Terms of Service

2.6 Population vaccination records

Dataset: vaccinations.csv

This dataset contains information related to deployment and administration of COVID-19 vaccines. It has a total of 1377366 rows with 32 columns. The dataset is time-series and each row corresponds to a date with the associated location. Detailed description for columns are as followed.

Schema

Name	Type	Description	Example
date	`string`	ISO 8601 date (YYYY-MM-DD) of the datapoint	2021-02-07
key	`string`	Unique string identifying the region	ID
new_persons_vaccinated*	`integer`	Count of new persons which have received one or more doses	7222
cumulative_persons_vaccinated**	`integer`	Cumulative sum of persons which have received one or more doses	784318
new_persons_fully_vaccinated*	`integer`	Count of new persons which have received all doses required for maximum immunity	1924
cumulative_persons_fully_vaccinated**	`integer`	Cumulative sum of persons which have received all doses required for maximum immunity	139131
new_vaccine_doses_administered*	`integer`	Count of new vaccine doses administered to persons	9146
cumulative_vaccine_doses_administered**	`integer`	Cumulative sum of vaccine doses administered to persons	923449
`${statistic}`_`${vaccine}`	`integer`	Statistic value corresponding to a specific vaccine such as `new_persons_vaccinated_moderna`	1035

*Values can be negative, typically indicating a correction or an adjustment in the way they were measured.

**Cumulative count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the cumulative counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.

Sources of data

Show data sources

Data	Source	License and Terms of Use	Notes
Country-level data	Our World in Data	CC BY
Argentina	Datos Argentina	Public domain
Australia	COVID LIVE	CC BY	Country level data is not the sum of the states/territories as there is a portion of vaccinations managed by the Federal government that is delivered directly to aged and disability care and not counted as part of the states/territories. As of 2021-03-14, only doses administered are reported for country-level data but NSW, VIC and WA continue to report the count of persons fully and partially vaccinated.
Austria	Open Data Österreich	CC BY
Belgium	Covid Vaccinations Belgium	CC BY	Regional data only available for Brussels, since the regions reported by the data source do not match our indexed subregions
Bolivia	Ministry of Health (via FinMango)	CC BY
Brazil	coronavirusbra1.github.io via [@wcota/covid19br]2	CC BY
Brazil	Brazil Ministério da Saúde	Creative Commons Atribuição
Bulgaria	Ministry of Health (via FinMango)	CC BY
Canada	Department of Health Canada	Attribution required
Colombia	Ministry of Health (via FinMango)	CC BY
Czech Republic	Ministry of Health of the Czech Republic	Open Data
France	data.gouv.fr	Open License 2.0
Germany	Robert Koch Institute (via FinMango)	Attribution Required
India	COVID19-India	CC BY
Israel	Israel Government Data Portal	Attribution Required	Admin level 2 regions are provided by the source and are aggregated to admin level 1. The total vaccination dose numbers provided by the source for admin level 2 do not match the country-wide total. This also impacts the aggregated level 1 totals.
Italy	Commissario straordinario per l’emergenza Covid-19	CC BY
Spain	Ministry of Health	Attribution required
Slovenia	Ministry of Health (via FinMango)	CC BY
Slovakia	https://korona.gov.sk, operated by Ministry of Investments, Regional Development and Informatization of the Slovak Republic]	Attribution required
Sweden	Public Health Agency of Sweden	Fair Use
Switzerland	Federal Office of Public Health	Fair Use
United Kingdom (nations)	NHS	OGL
United Kingdom (England)	NHS (via FinMango)	OGL
United States	CDC	Public Domain

2.7 Index reference table

Dataset: index.csv

This data table contains keys, codes and names for each region and country. It also contains the aggregate level (0: country, 1: state, etc.) With 22958 rows and 15 columns, this set is especially helpful in selecting and filtering the previous datasets in the data cleaning process.

Schema

Name	Type	Description	Example
key	`string`	Unique string identifying the region	US_CA_06001
place_id	`string`	A textual identifier that uniquely identifies a place in the Google Places database and on Google Maps (details)	ChIJd_Y0eVIvkIARuQyDN0F1LBA
wikidata	`string`	Wikidata ID corresponding to this key	Q107146
datacommons	`string`	DataCommons ID corresponding to this key	geoId/06001
country_code	`string`	ISO 3166-1 alphanumeric 2-letter code of the country	US
country_name	`string`	American English name of the country, subject to change	United States of America
subregion1_code	`string`	(Optional) ISO 3166-2 or NUTS 2/3 code of the subregion	CA
subregion1_name	`string`	(Optional) American English name of the subregion, subject to change	California
subregion2_code	`string`	(Optional) FIPS code of the county (or local equivalent)	06001
subregion2_name	`string`	(Optional) American English name of the county (or local equivalent), subject to change	Alameda County
3166-1-alpha-2	`string`	ISO 3166-1 alphanumeric 2-letter code of the country	US
3166-1-alpha-3	`string`	ISO 3166-1 alphanumeric 3-letter code of the country	USA
aggregation_level	`integer` `[0-2]`	Level at which data is aggregated, i.e. country, state/province or county level	2