This blog post explains how to do data analysis with Python using SQL. The focus is on connecting to or creating external SQL databases and constructing database frames with Pandas of the relevant information for cleaning, transforming, and other manipulations to the external database. Ultimately we visualize the data as curves using matplotlib on the different employment categories for NYC and the USA during COVID-19. A final report was generated using latex analyzing visually these graphs as a Research Colloquium, viewable in this blog post.
I have chosen to use SQLite 3 since it is free, so this needs to be installed. The libraries required for this analysis are sqlite3 for accessing an sql database using sqlite 3, datetime for converting date-numbers into a calendar, pandas for the data frame and series arrays to hold the relevant data, and numpy for data analysis, and matplotlib for generating graphs using its pyplot function. We will generate the graphs in PGF vector-format for the best latex rendering so the rc parameters of matplotlib for pgf rendering need to be set for latex.
import sqlite3
import datetime
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
import matplotlib as mpl
mpl.use("pgf")
mpl.rcParams.update({"pgf.texsystem": "pdflatex",
'font.family': 'serif',
'text.usetex': True,
'pgf.rcfonts': False,})
import matplotlib.pyplot as plt
Next we need to connect to (or create) the external SQL database, here named ‘Analysis.db’ and in the same directory as this python script, by setting up the sqlite3 database class object instance using the connect and then cursor methods.
conn = sqlite3.connect('Analysis.db')
c = conn.cursor()
We will create new tables in the external SQL database to store the relevant information. We use the execute sqlite3 object method on the cursor to run SQL commands. Since the actual data may change each time you run this data analysis code, it is good practice to drop the table first, in case it was already populated with old data. Also, when error-testing your code, it is helpful to start fresh each time to avoid duplicates in the external SQL database. One table EMPL_CITY is created for the employment data from NYC and EMPL_CITY is created for employment data from the National Data. These tables will be populated from the raw .csv data and then combined into the EMPL_COMP employment comparison table. At the end of a series of SQL commands we need to commit the changes to the external database.
c.execute('''DROP TABLE IF EXISTS EMPL_CITY''')
c.execute('''CREATE TABLE EMPL_CITY
([generated_id] INTEGER PRIMARY KEY,[year] integer, [month] integer, [day] integer,
[cityid] integer, [emp_combined] integer, [emp_combined_inclow] integer,
[emp_combined_incmiddle] integer, [emp_combined_inchigh] integer,
[emp_combined_ss40] integer, [emp_combined_ss60] integer,
[emp_combined_ss65] integer, [emp_combined_ss70] integer)''')
c.execute('''DROP TABLE IF EXISTS EMPL_NATIONAL''')
c.execute('''CREATE TABLE EMPL_NATIONAL
([generated_id] INTEGER PRIMARY KEY,[year] integer, [month] integer, [day] integer,
[emp_combined] integer, [emp_combined_inclow] integer, [emp_combined_incmiddle] integer,
[emp_combined_inchigh] integer, [emp_combined_ss40] integer, [emp_combined_ss60] integer,
[emp_combined_ss65] integer, [emp_combined_ss70] integer)''')
c.execute('''DROP TABLE IF EXISTS EMPL_COMP''')
c.execute('''CREATE TABLE EMPL_COMP
([year] integer, [month] integer, [day] integer,
[city_all] integer, [city_low] integer, [city_mid] integer, [city_high] integer,
[national_all] integer, [national_low] integer, [national_mid] integer, [national_high] integer,
[city_ss40] integer, [city_ss60] integer, [city_ss65] integer, [city_ss70] integer,
[national_ss40] integer, [national_ss60] integer, [national_ss65] integer, [national_ss70] integer)''')
conn.commit()
Using Python we will read the raw data from the .csv files into the created tables of the SQL database, then we will extract the relevant information and combine it.
read_empl_city = pd.read_csv('data/Employment Combined - City - Daily.csv')
read_empl_city.to_sql('EMPL_CITY', conn, if_exists='append', index = False)
read_empl_national = pd.read_csv ('data/Employment Combined - National - Daily.csv')
read_empl_national.to_sql('EMPL_NATIONAL', conn, if_exists='replace', index = False)
To perform this extraction and combination, we use the SQL commands to SELECT the relevant information from EMPL_City & EMPL_National, the raw data tables, and INSERT INTO EMPL_COMP, the combined and reduced table for manipulation. We use the WHERE command to ensure the year, month, and dates match up and that only data from NYC (city_id=2) is selected. These changes to the external database must again be committed. We then SELECT the data from this newly populated table and use the fetch all method to dump into a new Pandas Dataframe for easier use internally within the python script cache.
c.execute('''INSERT INTO EMPL_COMP (year,month,day,city_all,city_low,city_mid,city_high,
national_all,national_low,national_mid,national_high,city_ss40,city_ss60,city_ss65,city_ss70,
national_ss40,national_ss60,national_ss65,national_ss70)
SELECT Empl_City.year,Empl_City.month,Empl_City.day,Empl_City.emp_combined,Empl_City.emp_combined_inclow,
Empl_City.emp_combined_incmiddle,Empl_City.emp_combined_inchigh, Empl_National.emp_combined,
Empl_National.emp_combined_inclow, Empl_National.emp_combined_incmiddle, Empl_National.emp_combined_inchigh,
Empl_City.emp_combined_ss40, Empl_City.emp_combined_ss60, Empl_City.emp_combined_ss65, Empl_City.emp_combined_ss70,
Empl_National.emp_combined_ss40, Empl_National.emp_combined_ss60, Empl_National.emp_combined_ss65, Empl_National.emp_combined_ss70
FROM Empl_City, Empl_National
WHERE Empl_City.year=Empl_National.year AND Empl_City.month=Empl_National.month AND Empl_City.day=Empl_National.day
AND Empl_City.cityid=2''')
conn.commit()
c.execute('''
SELECT DISTINCT *
FROM EMPL_COMP''')
df_empl_comp = DataFrame(c.fetchall(), columns=['year','month','day','city_all','city_low','city_mid','city_high','national_all','national_low','national_mid','national_high','city_ss40','city_ss60','city_ss65','city_ss70','national_ss40','national_ss60','national_ss65','national_ss70'])
We then modify this Data Frame by combining the year, month, and year columns into a single calendric column using the datetime object within a Series, which will be set as the index to our data frame. The old working-memory SQL table (EMPL_COMP) for combined data is then deleted and re-created with this new reorganzed data condensation and then exported as a .csv for external use.
callist=[]
for i in range(len(df_empl_comp)):
callist.append(datetime.datetime(df_empl_comp['year'][i],df_empl_comp['month'][i],df_empl_comp['day'][i]))
cal=Series(callist)
df_empl_comp.insert(0,'calendar',cal)
del df_empl_comp['year']
del df_empl_comp['month']
del df_empl_comp['day']
c.execute('''DROP TABLE EMPL_COMP''')
c.execute('''CREATE TABLE EMPL_COMP
([calendar] date, [city_all] integer,
[city_low] integer, [city_mid] integer, [city_high] integer,
[national_all] integer, [national_low] integer, [national_mid] integer, [national_high] integer,
[city_ss40] integer, [city_ss60] integer, [city_ss65] integer, [city_ss70] integer,
[national_ss40] integer, [national_ss60] integer, [national_ss65] integer, [national_ss70] integer)''')
conn.commit()
df_empl_comp.to_sql('EMPL_COMP', conn, if_exists='append', index = False)
export_csv = df_empl_comp.to_csv ('data/employment_comparison.csv', index = None, header=True)
df_empl_comp=df_empl_comp.set_index('calendar')
Finally we are ready to plot the data and export to pgf. First, we split up the single Data Frame into the 2 different parts of the analysis using iloc (index-location) and then re-name these columns for better viewing in the graph. Since both of our relevant data-frames have the calendar variables as the indexes, they can be plotted as-is using matplotlib‘s pyplot by setting up a figure instance, calling the plot method on each of the data-frames, adjusting the labels, and then saving the pgf. Using this figure method, we must generate and save one plot-graph at a time.
df_empl_comp_ineq=df_empl_comp.iloc[:,0:8]
df_empl_comp_ineq=df_empl_comp_ineq.rename(columns={'city_all': 'NYC: all', 'city_low': 'NYC: low(Q1)','city_mid': 'NYC: mid(Q2+Q3)','city_high': 'NYC: high(Q4)','national_all': 'National: all', 'national_low': 'National: low(Q1)','national_mid': 'National: mid(Q2+Q3)','national_high': 'National: high(Q4)'})
df_empl_comp_indus=df_empl_comp.iloc[:,8:16]
df_empl_comp_indus=df_empl_comp_indus.rename(columns={'city_ss40': 'NYC: Proximate Functional (ss:40/`infrastructural\')','city_ss60': 'NYC: Remote Communicative (ss:60/`remote\')' ,'city_ss65': 'NYC: Remote Functional (ss:65/`essential\')','city_ss70': 'NYC: Proximate Communicative (ss:70/`face-to-face\')','national_ss40': 'National: Proximate Functional(ss:40/`infrastructural\')','national_ss60': 'National: Remote Communicative (ss:60/`remote\')','national_ss65': 'National: Remote Functional (ss:65/`essential\')','national_ss70': 'National: Proximate Communicative (ss:70/`face-to-face\')'})
plt.figure()
df_empl_comp_ineq.plot()
#plt.text(0.5, 3., "serif", family="serif")
plt.xlabel('Calendar: Daily')
plt.ylabel(u'Employment Level Change-Rate Relative to Jan,2020 Avg')
plt.title(u"Employment Flux by Income-Level Categories for NYC \& Nation")
plt.legend(loc='best')
plt.savefig('empl-flux-inc.pgf')
plt.figure()
df_empl_comp_indus.plot()
plt.xlabel('Calendar: Daily')
plt.ylabel(u'Employment Level Change-Rate Relative to Jan,2020 Avg')
plt.title(u"Employment Flux by Industry Categories for Nation")
plt.legend(loc='best')
plt.savefig('empl-flux-indus.pgf')
Now that we have these pgf graphs, they can be embedded in a latex file using these packages and spacing adjustment:
\usepackage{pgf}
\usepackage{pgfplots}
\usepackage{float}
\usepackage{wrapfig}
\setlength{\belowcaptionskip}{-10pt}
\setlength{\intextsep}{0pt}
\pgfplotsset{compat=1.8}
The code to display the pgf with float location-specifier here (‘H’) and optional scaling is:
\begin{figure}[H]
\centerline{\scalebox{1.0}{\input{empl-flux-inc.pgf}}}
\caption{\cite{ET} This graph compares the number of active employees from the three income level job category during the pandemic relative to an average baseline index calculated from 1/4/2020 to 1/31/2020. Low income jobs are the bottom quartile of income(<\$27k), Middle income jobs are the middle two quartiles of income(\$28k:\$60k), and High income jobs are the top quartile of income (>\$60k).}
\end{figure}
The generated latex file can be found here: Download