The Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) is collecting data on COVID-19 cases. They use this data to run their well known COVID-19 Dashboard.
They are publically sharing the data they are compiling. This data is posted to GitHub so that anyone can download it and work with it.
I downloaded this data using git
, a popular program used by computer programmers for synchronizing data. If you know some git
you can do the same as me. Use a terminal to go to a directory where you'd like to store the data, and run the command:
git clone https://github.com/CSSEGISandData/COVID-19
This will create a new directory titled COVID-19
that contains all the files. If you want to update your directory, you can enter that directory and execute the command:
git pull
You can accomplish the same thing while avoiding using git
. Go to the CSSEGISandData/COVID-19 GitHub website. You'll see a green Clone or download
button. Click it and then select "Download ZIP". You will then be prompted to download a ZIP file of all the data. You can unzip that file to get the COVID-19
data directory. Later when you want to update the data, you can download the zip file again and replace your directory.
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import math as m
from mpmath import mp, iv
# For working with files:
import os
# For working with CSV files:
import csv
We first define a string jhu_data_directory
that points to the data we downloaded. In my case to get to the data from the directory containing the notebook, I have to go up one level and then into the COVID-19
directory.
jhu_data_directory = "../COVID-19"
Below we run some basic tests on the data. You'll get errors if the data is not present at this location.
# Tests for the data directory
assert os.path.isdir(jhu_data_directory), \
"jhu_data_directory is not a directory"
assert os.path.isdir(jhu_data_directory+"/csse_covid_19_data"), \
'jhu_data_directory+"/csse_covid_19_data" is not a directory'
time_series_directory = jhu_data_directory+"/csse_covid_19_data/csse_covid_19_time_series"
assert os.path.isdir(time_series_directory), \
'jhu_data_directory+"/csse_covid_19_data/csse_covid_19_time_series" is not a directory'
We will search the following files for data:
confirmed_data_file = time_series_directory + "/time_series_19-covid-Confirmed.csv"
deaths_data_file = time_series_directory + "/time_series_19-covid-Deaths.csv"
recovered_data_file = time_series_directory + "/time_series_19-covid-Recovered.csv"
data_files = [confirmed_data_file, deaths_data_file, recovered_data_file]
for file in data_files:
print(file)
assert os.path.isfile(file), file + " is not a file."
The data files listed above are comma-separated values (CSV) files. You can open them with a text editor to see the data. Alternately, you can open then with a spreadsheet program.
Python has a module for working with CSV files. Here is a link to the documentation:
We can load the library with:
import csv
We can open a file for reading with the following command:
# Open a file for reading:
csvfile = open(confirmed_data_file)
# Create a CSV reader to process the open file
reader = csv.reader(csvfile)
Calling next(reader)
returns a single list containing the first row of the data. In this case, the first row is header data describing the corresponding rows. See below.
row = next(reader)
for entry in row:
print(entry, end="; ")
You can see from the above that future rows will consist of row[0]
a State (or a Province), row[1]
a Country or a Region, row[2]
a lattitude, row[3]
a longitude, and data corresponding to each day starting on January 22, 2020 and ending at some point. Since this is the confirmed_data_file
, this lists the total number of confirmed cases by day. We'll store the list of days in a variable:
day_list = row[4:]
for day in day_list:
print(day, end="; ")
Now lets look at the next row:
row = next(reader)
for entry in row:
print(entry, end="; ")
int(row[-1]) - int(row[-2])
We see that the second row in the file corresponds to Thailand. The entries 15
and 101
are lattitude and longitude for Thailand and the other entries are the number of confirmed cases for each day as above. So the number of confirmed cases on the first day is given by:
row[4]
Observe that this is a string. So, we'd be better of converting it to an integer with int(row[4])
. To get all the confirmed days from the row as int, we can do the following:
thailand_confirmed_cases = []
for count in row[4:]:
thailand_confirmed_cases.append(int(count))
plt.plot(thailand_confirmed_cases)
plt.show()
We can continue reading rows in this way. When there is no further row, next(reader)
will raise a StopIteration
exception. Let us close the file:
csvfile.close()
I want to give a printout of each country and the corresponding states. Here I form a list of pairs (state, country)
:
state_country_pairs = []
# Open a file for reading:
csvfile = open(confirmed_data_file)
# Create a CSV reader to process the open file
reader = csv.reader(csvfile)
# Skip the header line:
next(reader)
try:
while True:
row = next(reader)
state_country_pairs.append((row[0], row[1]))
except StopIteration:
pass
csvfile.close()
Below, I form a dictionary country_dict
mapping countries to a list of states. (We haven't discussed dictionaries yet, but you can learn about them in the Python 3 Tutorial.) I also form a sorted list of countries
.
country_dict = {}
for state, country in state_country_pairs:
if country in country_dict:
country_dict[country].append(state)
else:
country_dict[country] = [state]
countries = []
for country in country_dict:
countries.append(country)
country_dict[country].sort()
countries.sort()
for country in countries:
print(country, end=": ")
for state in country_dict[country]:
print(state, end="; ")
print()
It seems that every row has a country (or regions), while some don't have states (or provinces).
def confirmed_cases(state, country):
# Open a file for reading:
csvfile = open(confirmed_data_file)
# Create a CSV reader to process the open file
reader = csv.reader(csvfile)
try:
while True:
row = next(reader)
if row[1]==country and row[0]==state:
# Just return data for each day.
cases = []
for num in row[4:]:
cases.append(int(num))
csvfile.close()
return np.array(cases)
except StopIteration:
csvfile.close()
raise ValueError("Didn't find row with state='{}' and country='{}'" \
.format(state, country))
confirmed = confirmed_cases("", "Korea, South")
print(confirmed)
plt.plot(confirmed)
plt.title("Confirmed cases: South Korea")
plt.show()
confirmed = confirmed_cases("New York", "US")
print(confirmed)
plt.plot(confirmed)
plt.title("Confirmed cases: New York")
plt.show()
confirmed = confirmed_cases("", "Italy")
print(confirmed)
plt.plot(confirmed)
plt.title("Confirmed cases: Italy")
plt.show()
We can create similar functions for handling deaths and recovered. This data is stored in the files whose locations we stored off in deaths_data_file
and recovered_data_file
.
def deaths_cases(state, country):
# Open a file for reading:
csvfile = open(deaths_data_file)
# Create a CSV reader to process the open file
reader = csv.reader(csvfile)
try:
while True:
row = next(reader)
if row[1]==country and row[0]==state:
# Just return data for each day.
cases = []
for num in row[4:]:
cases.append(int(num))
csvfile.close()
return np.array(cases)
except StopIteration:
csvfile.close()
raise ValueError("Didn't find row with state='{}' and country='{}'" \
.format(state, country))
return None
def recovered_cases(state, country):
# Open a file for reading:
csvfile = open(recovered_data_file)
# Create a CSV reader to process the open file
reader = csv.reader(csvfile)
try:
while True:
row = next(reader)
if row[1]==country and row[0]==state:
# Just return data for each day.
cases = []
for num in row[4:]:
cases.append(int(num))
csvfile.close()
return np.array(cases)
except StopIteration:
csvfile.close()
raise ValueError("Didn't find row with state='{}' and country='{}'" \
.format(state, country))
return None
state = ""
country = "Korea, South"
confirmed = confirmed_cases(state, country)
deaths = deaths_cases(state, country)
recovered = recovered_cases(state, country)
plt.plot(confirmed,"b")
plt.plot(deaths,"r")
plt.plot(recovered,"g")
plt.title("South Korea")
plt.show()
active = confirmed - deaths - recovered
plt.plot(active,"b")
plt.title("South Korea: Active Cases")
plt.show()
state = "Hubei"
country = "China"
confirmed = confirmed_cases(state, country)
deaths = deaths_cases(state, country)
recovered = recovered_cases(state, country)
plt.plot(confirmed,"b")
plt.plot(deaths,"r")
plt.plot(recovered,"g")
plt.title("Hubei, China")
plt.show()
active = confirmed - deaths - recovered
plt.plot(active,"b")
plt.title("China: Active Cases")
plt.show()
We'd like to compare the number of new COVID-19 cases by day in New York and Maryland.
# Get the number of confirmed cases in NY and MD.
state = "New York"
country = "US"
confirmed_NY = confirmed_cases(state, country)
state = "Maryland"
confirmed_MD = confirmed_cases(state, country)
confirmed_MD
# We set start to the index of the last zero entry.
# (The number of confirmed cases is increases over time.)
for i in range(len(confirmed_NY)):
if confirmed_NY[i] > 0:
break
start = i - 1
start
# Here we get the days indexed begining with `start`.
day_list[start:]
# Here we throw out the initial zeros in the NY data.
confirmed_NY = confirmed_NY[start:]
# We throw out the same days for MD.
confirmed_MD = confirmed_MD[start:]
# Here is the number of new cases each day in NY
delta_NY = confirmed_NY[1:] - confirmed_NY[:-1]
# Here is the number of new cases each day in MD
delta_MD = confirmed_MD[1:] - confirmed_MD[:-1]
delta_MD
Now we can plot the data.
plot_NY, = plt.plot(range(10,18), delta_NY,"ob", label="NY")
plot_MD, = plt.plot(range(10,18), delta_MD,"or", label = "MD")
plt.title("New cases per day, March 2020")
# Create a legend for the first line.
legend = plt.legend(handles=[plot_NY,plot_MD], loc='upper left')
plt.xlabel("Day in March")
plt.ylabel("Number of new cases")
plt.show()