Wrangling Survey Data with Google Cloud Dataprep

Anyone who handles data on a a daily basis knows the pain and tedium of cleansing data - it's a time consuming, mundane, and thankless task. Trifacta comes to the rescue of bleary-eyed data professionals with its all-in-one data wrangling tool. In this post, I'll walk through a demo of cleaning up survey data with Google Cloud Dataprep, Google's fully managed Trifacta service.

I was initially skeptical of a proprietary tool with a GUI, but Trifacta is wonderful for dealing with the kind of dirty, poorly typed data that we encounter in the real world. When I've prepared data with Python, I've had to spend most of my time writing code to deal with numerous edge cases in each column. Trifacta's recommended transforms make quick work of these issues, and you can also write custom transforms. You can always combine Dataprep with other jobs if you need the full power of a programming language.

Processing the Data

survey_data.csv

For this post, I've created some fake survey data inspired by real survey data that I needed to munge last week. What appears to be personal information (SSN, age) was randomly generated.

Open Dataprep in your GCP account. If this is your first time using Dataprep within your project, you'll need to go through an authorization process - Trifacta is hosted outside your account and requires access to do its magic. Once you've completed this, you'll be able to create a flow.

Screen-Shot-2019-04-05-at-7.11.14-AM-300x83.png

Create a new flow. Click Add Datasets, then Import Datasets. Drag and drop the survey data into the window. Click Import, Add to Flow.

Screen-Shot-2019-04-05-at-7.17.34-AM-300x81.png

We now see survey_data.csv in the flow. Click the recipe button to add a new recipe. In the panel that appears to the right, click Edit Recipe.

Screen-Shot-2019-04-05-at-7.22.13-AM-1024x451.png

Note the horizontal bars at the top of each column. The turquoise section indicates valid data, the red portion mismatched values and the gray missing values. The first thing we notice is that we have several columns with no valid values. First and Last name, address and city have all been redacted. You can shift + click to select all four columns. In the right panel, you'll see several suggested transforms. Choose Delete columns. Click Add. This step is now part of your recipe.

Next, look at the first column, finished. This is a boolean indicating whether the participant finished the survey. Let's discard unfinished surveys. At the top of the column, you'll see vertical bars indicating proportions of data categories for the column.

Screen-Shot-2019-04-05-at-7.30.18-AM-150x150.png

The short bar indicates false. If you click on it, you'll be given options for this category in the right panel. Select Delete rows and add the step.

Next, look at the SSN4 column. In a future post, I'll use this as a join key to combine pre/post data for statistical analysis, so we need valid values here. Let's get rid of rows that are missing a value. Hover over the gray portion of the horizontal bar for the column and click. Choose the Delete rows transform and add. Next, select the red portion of the bar. If you hover over the Keep rows transform in the right panel, you'll see a pop up with some of the mismatched values.

mismatched-262x300.png

Click Delete rows, Add. We're not quite done with this column - if we look at the distribution at the top, we notice some suspiciously large values.

SSN-distro.png

Scrolling through the data, you'll notice that there are numbers with too many digits. Click on the bar on the far right and then click the Delete rows transform, but instead of adding, click edit. You should see a range condition (65000000 <= SSN4) && (SSN4 < 70000000). Change this to 10000 <= SSN4 to catch all the bad values. Click add.

SSN-post-dist.png

You can scroll down to make sure the data match what you expect.

Let's move on to the age column. We have mismatched data; if you scroll down, you'll see a field with 37 years. Right click on the column header and select extract, numbers. Adding the transform creates a new column called age1. Delete the age column and rename the new column to age. We still have missing values, but in this case we'll keep the corresponding rows.

The remainder of the columns are Q1 through Q22. These contain seven options ranging from strongly disagree to strongly agree. We're going to convert these to integers for statistical analysis. This step could be a little less painful, but it's not bad. I suspect this would be easier with Trifacta custom types, but they are not supported in Dataprep. Hopefully, this feature will be added in the future.

To start the transform process, select all the question columns with shift + click. Right click on a column header and select replace, text or patterns. In the Find box, type 'Strongly agree'. In the Replace with box, type 3. Add the transform. We have to repeat these steps six more times. The mapping is agree to 2, somewhat agree to 1, neither agree nor disagree to 0, somewhat disagree to -1, disagree to -2 and strongly disagree to -3. After you've completed this, the remaining mismatched values are spaces.

One last transform: we don't need the finished column anymore. Go ahead and delete it.

Now, we want to save out the data.

Full-data-300x116.png

Because our data is small, Trifacta has processed the full set without sampling, but GCP forces us to run the job. This takes a couple of minutes because Dataprep must spin up processing resources. Click the Run Job button and then run the job on the next page. In a few minutes your job should finish. Note that you will incur costs of a few cents.

Conclusion

If you handle data on a regular basis, if you're an analyst, data scientist, engineer or ETL developer, Cloud Dataprep could cut years off your data preparation time. Try working through this example by downloading the survey data above. I've also attached my final output and recipe in the zip file below.

files

BlogMatt Housley