How to Create a Data Engineering Project Portfolio That Will Land You a Job

Data engineering is the hot new job in data, and arguably the new “sexiest job of the 21st century” (sorry data science). With more and more people interested in becoming data engineers, we’re often asked “How do I build a portfolio for data engineering projects?” For people transitioning to data engineering from a related field, or those just starting out in their careers, a data engineering portfolio is a great way to showcase your skills, and hopefully get hired. So how do you create a data engineering portfolio? 

First, let’s look at the elements of a good portfolio. The goal of a portfolio is to showcase your knowledge and skills, typically with a hiring manager as the audience. The projects in a portfolio should demonstrate competence and capability in a given area. For example, if you claim knowledge of AWS, specifically what do you mean by that? Showcase concrete examples of how you work in AWS, such as creating an auto-scaling group of EC2 instances, or creating a table in Redshift using the correct type of sort key. Be clear in your communication - list out the steps, provide technical details where necessary, and use diagrams and images to summarize complicated workflows. Especially in data engineering, data pipelines can become complex. Being able to summarize and clearly articulate complex technical material is the hallmark of a great technician and expert. Always strive for communicating complexity in the simplest ways possible.

Here are some project ideas for a data engineering project portfolio. We’re assuming you’ll use one of the big 3 clouds - AWS, Azure, or Google Cloud (GCP) - since that’s what you’re increasingly going to find on the job as a data engineer. This list is non-exhaustive, so feel free to adapt this to what seems interesting or relevant to your journey in data engineering.

  • Create an auto-scaling group of servers. You can do this with AWS EC2, GCP Compute Engine, and Azure VM’s. The lesson is how to take advantage of the cloud, and build redundant architecture that scales with demand.

  • Build a data lake using a cloud’s object storage. Understand how to secure the data lake and capture metadata about the assets within the data lake.

  • Build an end to end serverless data pipeline using cloud managed services. Pick a cloud (AWS, Azure, GCP), then build an end to end streaming data pipeline. In AWS, consider using services such as Kinesis, S3, Glue, Lambda.

  • Spin up an open source data processing framework such as Spark or Kafka.

  • Learn cloud fundamentals like networking, security, and IAM permissions. 

  • Master a popular data-centric programming language such as Python. Demonstrate competency in using Python libraries such as Pandas. Learn how to make cloud API calls using Python and experiment with orchestration.

  • Orchestration is key to most modern data workflows. Check out Airflow, Dagster, and Prefect. Write some sample DAGs where you take data from ingestion to consumption, including a sojourn in a cloud data warehouse or data lake.

  • Contribute to open source projects, such as the ones mentioned in this post. If those seems challenging to make a contribution to, consider getting involved in a smaller open source project.

Data engineering portfolios are tricky because data engineering is “invisible”. Whereas data science projects are fairly self contained - often, you can download a Jupyter notebook and run the commands inline. However, in data engineering, you’ll typically have a variety of tools that aren’t as visible as a public notebook on Github, and largely operate behind the scenes. 

Next question - how do you make your data engineering portfolio visible? Because these examples are “behind the scenes” and largely invisible - you can’t really invite people to your cloud to check out your data pipeline - we suggest that you document your process and workflow. Here are some ideas for showcasing your work in a public way.

  • Create a blog. You can host your own (Wordpress, etc) or use Github.

  • Create tutorials and walkthroughs for various topics

  • Record videos and put them on YouTube and your blog. 

  • Share the content on social media and Slack groups.

  • Speak at meetups about your content. Convert the content to something presentation-friendly.

In summary, anybody looking to enter the data engineering field should create a data engineering project portfolio. Because data engineering is largely “invisible”, you need to find projects that both highlight some concrete examples of your data engineering knowledge, as well as lend themselves to a visible format, such as blog posts, videos, and public talks. Best of luck with your data engineering portfolio, and we hope you land the job you’re looking for!

For more information, check out our video on Data Engineering Project Portfolios.




Joseph Reiscareer advice