- August 21, 2020
- Posted by: Srinivas paturu
- Category: Data & Analytics
In this article, I would like to share the code for a simple and effective transformation to tackle the time series format data file that is published by John Hopkins university.
I would be eventually trying to port all the scripts to cloud on AWS and setup up a truly server less architecture.
The time series data file format is as below with the header:
This format is repeated for confirmed, death and recovered count of cases for global and US geographies.
The data recorded for the next day is appended to the file in a new column. Naturally, this horizontal format of laying the time series data is not conducive for analysis in most of the business intelligence tools. The data is more efficient if the column of dates in the above file are transformed to rows.
learn more about our data visualization services
Learn More
The operation is just a breeze in python using Pandas data frames and pivot functions. Here is the code
The transformation steps are in the following lines from above:
This line would pivot the dates from columns to rows.
These couple of lines are a smart way to use the diff() function with group by to get the equivalent of LAG() operation in a window analytical function in SQL.
The next part of the this article is about integrating this script with the AWS Lambda function to truly go Serverless.
For this to work, I borrowed the knowledge from another awesome Medium article on how to enable Pandas library on AWS Lambda. Without pandas there would a very tedious process of operating on numpy arrays instead of data frames.
The plan is succeeded with the following steps:
a. Created an AWS Lambda function with the following code (for the recovered cases file, which you can replicate for other files)
b. Created an Event rule on AWS CloudWatch that would trigger the above Lambda function to execute on a schedule of every X minutes.
c. The Lambda function would write the transformed file to S3 bucket.
Architecture Note
I have chosen the route of AWS Lambda to AWS S3 to stay within the limits of AWS free tier services. AWS Lambda are primarily intended for fast executing micro services and they are not supposed to do heavy data lifting, however we can use it for doing data lifting. The most optimal choice for heavy data transformations would be to execute an AWS Glue job on PySpark, but that would be outside of free tier limits.
Leverge your Biggest Asset Data
The last part of the guide is to enable AWS Quick Sight to generate a quick visualization layer for the data.
From the quick sight console, click on the new dataset option to import the S3 output files to SPICE (the high dimensional cube for business intelligence in AWS). Then add other data sets and join on the key (Date, Country/Region and Province/State). The final dataset would look like this below:
Then you can click on Save & Visualize option to start preparing your dashboards. Here are few I have made:
Hopefully this would help your journey in creating the COVID-19 dashboard.