Use Python and Google Cloud To Schedule A File Download And Import Into BigQuery

Recently, a friend asked for a simple way to set up a data pipeline for analyzing the daily results from the New York Times’s COVID-19 dataset. “How do I download this data set every day, and load it into BigQuery so my friends can do analysis on it? Preferably in a serverless and cost effective manner. Oh, and I’m also not that good at programming in Python.”

No big deal. Let’s figure this out.

Here’s a sample of the New York Times’s COVID-19 dataset.

date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,
02020-01-22,Snohomish,Washington,53061,1,
02020-01-23,Snohomish,Washington,53061,1,
02020-01-24,Cook,Illinois,17031,1,

Of course, a simple option is to use bash and cron. In this case, you’d simply download the data each day using Curl or WGET, then import into BigQuery using gcloud commands. However, this also requires a machine that is running the code, and he didn’t feel like running this from a laptop, or spinning up a cloud server.

So, serverless it is. Thankfully, Google Cloud (GCP) offers some awesome serverless tools where you can run a workflow like this for next to no cost.

In this tutorial, I’m going to show you how to set up a serverless data pipeline in GCP that will do the following.

  1. Schedule the download of a csv file from the internet

  2. Import the data into BigQuery

Note - This tutorial generalizes to any similar workflow where you need to download a file and import into BigQuery.

Here’s the workflow. There are a few moving parts, but it’s not too complicated.

BQ-download-cf-tutorial.png

Also, the NY Times COVID dataset is now available in BigQuery’s public datasets. So, you can take the easy way and view the data directly in BigQuery’s public dataset, or continue following along with this example.

The Workflow

We will need to create a few things to get started. You’ll need a Google Cloud account and a project, as well as API access to Cloud Storage, Cloud Functions, Cloud Scheduler, Pub/Sub, and BigQuery.

Here’s a summary of what we’re going to build.

  1. One Cloud Storage Bucket

  2. One BigQuery dataset and table

  3. Two Cloud Functions

  4. One Cloud Scheduler job

All of the code examples are located in this GitHub repo.

Even though this example is for the NY Times COVID dataset, it can be used for any situation where you download a file and import into BigQuery.

Let’s get started!

1 - Create A Cloud Storage Bucket

Setting up a Cloud Storage bucket is pretty straightforward, so straightforward that I’ll just give you a link to the official GCP documentation that gives an example.

2 - Create A BigQuery Dataset and Table

Just like the Cloud Storage bucket, creating a BigQuery dataset and table is very simple. Just remember that you first create a dataset, then a create a table.

When you create your BigQuery table, you’ll need to create a schema with the following fields. These BigQuery fields match the fields in the NY Times COVID csv file’s header.

date,county,state,fips,cases,deaths

BigQuery table fields:

  • date - DATE

  • county - STRING

  • state - STRING

  • fips - STRING

  • cases - INTEGER

  • deaths - INTEGER

Create the BigQuery table, which should have a schema that looks like this.

3 - Create Two Cloud Functions

Cloud Functions allow you to run discrete parts of code in an event-driven and serverless manner. They are perfect for a project like this where you need a lightweight and cheap workflow. If you’re familiar with AWS Lambda, Cloud Functions are the equivalent in GCP. Cloud Functions are triggered by events, such as Pub/Sub or objects landing in Cloud Storage. You can learn more about GCP Cloud Functions here.

You’ll need to create two Cloud Functions. One Cloud Function will download the data from a url and store it in Google Cloud Storage. Another Cloud Function will load this data from Google Cloud Storage into BigQuery.

Why not just combine the two Cloud Functions and do this all in one go? You can certainly do it this way if you want. I’m a fan of keeping Cloud Functions as discrete and single-purpose as possible. Also, each function gives you a module that can be reused in other settings with a little modification. That said, there’s nothing to prevent you from combining the two if that’s what you’re into.

Cloud Function 1 - Download data from a url, then store it in Google Cloud Storage

Cloud Functions are trigged from events - HTTP, Pub/Sub, objects landing in Cloud Storage, etc. This Cloud Function will be triggered by Pub/Sub. You’ll need to create a Pub/Sub topic as you set up the Cloud Function. You can learn more about triggering Cloud Functions via Pub/Sub here.

Create a Pub/Sub Topic, as illustrated below. Give it a meaningful name. Remember the name of the Pub/Sub topic you’ve created, as Cloud Scheduler will need this topic in order to schedule the download of the file. In this case, we’ll name our Pub/Sub topic get-file-from-url.

Next, let’s fill in the settings for this Cloud Function

  • Name - give it a name of your choosing

  • Memory allocated - 256 MiB should be fine *

  • Trigger Type - Pub/Sub (see image above)

  • Source Editor - inline (you can also package your code in a Zip file if you want)

  • Runtime: Python 3.7

  • MAIN.PY and REQUIREMENTS.TXT: See code examples in GitHub.

Insert your Python code in MAIN.PY, and include your dependencies in REQUIREMENTS.TXT. The code in MAIN.PY uses wget to download the data *, then import it into Google Cloud Storage.

*A note on files in Cloud Functions

“The only writeable part of the filesystem is the /tmp directory, which you can use to store temporary files in a function instance. This is a local disk mount point known as a "tmpfs" volume in which data written to the volume is stored in memory. Note that it will consume memory resources provisioned for the function.

The rest of the file system is read-only and accessible to the function.”

Read more here.

Here’s what MAIN.PY AND REQUIREMENTS.TXT should look like in your Cloud Function.

Next, populate the environment variables. Note that the Python code references something called “os.<stuff>”. The “stuff” part is an environment variable. It’s a good practice to include things like secrets, filenames, urls, and paths in environment variables instead of hard coding these. This makes your code reusable for other projects, since all you need to do is change your environment variables, and also prevents you from inadvertently uploading secrets to a public git repo.

Fill in your environment variables the following way. Avoid using quotes.

  • URL: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

  • BUCKET: The Cloud Storage bucket you created in Step #1.

  • FILE_NAME: The name of the file you want to save in Cloud Storage.

cf1-env.png

And that’s it. Feel free to test your Cloud Function in the Testing section of the GCP Cloud Functions console. If it works ok, you can check your Cloud Storage bucket to make sure the download file loaded correctly.

Cloud Function 2 - Trigger an import from Cloud Storage into BigQuery

Now that data is in Cloud Storage, we need to get it into BigQuery. Remember how I said that Cloud Functions can be triggered from objects landing in Cloud Storage? We’re going to create a new Cloud Function that is triggered once Cloud Function #1 loads data into Cloud Storage. This function will load data into the BigQuery table you created in step #2.

Since you have some practice from building the Cloud Function above, we’ll keep this a bit more brief.

Settings:

  • Name - give it a nameMemory allocated - 256 MiB should be fine *

  • Trigger Type: Cloud Storage

  • Bucket: the name of the Cloud Storage bucket you created in Step #1

  • Source Editor - inline (you can also package your code in a Zip file if you want)

  • Runtime: Python 3.7

  • MAIN.PY and REQUIREMENTS.TXT: See code examples in GitHub.

Environment variables

  • DATASET: The BigQuery dataset you created in Step #2

  • TABLE: The BigQuery table you created in Step #2

Your Python code should look like this.

4 - Create A Cloud Scheduler job

Cloud Scheduler is a GCP utility that allows you to - surprise surprise - schedule tasks. If you’ve used cron, then you’ll have a good idea of how this works. If not, no worries. Cron is very simple to figure out. You basically just need to know that cron goes something like this. You’ve got 5 options that you can populate however you like. In our example, we’ll set a Cloud Scheduler cron frequency for every ten minutes (this is obviously overkill for a data set that updates once a day, but since we’re in example-land, the point stands).

Cloud Scheduler needs to trigger something to execute. For this tutorial, we’ll target the Pub/Sub topic that we created in the Cloud Function we deployed above.

Once you get the frequency populated, hit create (or update). Here’s what your Cloud Scheduler should look like.

Then in the Cloud Scheduler console, hit “Run” to test the workflow. If everything works, you should see your BigQuery table populated with data.

Let’s See The Data!

Now that we’ve got the workflow pieced together and working, let’s have a look at the NY Times COVID data in BigQuery. Here’s what you should see in BigQuery.

The correct NY Times COVID dataset output

The correct NY Times COVID dataset output

Wrapping Up

You’ve learned to create a scheduled, serverless workflow in Python that delivers analytics in Google Cloud. That’s pretty cool!

From here, you can analyze your data in raw SQL, or visualize in a BI tool like Looker or Data Studio.

Feel free to extend this example to anything else that you can think of.

Joseph Reis