TODAYDOUGLEARNED is a place where I can dump my brain, muse, complain, and otherwise contribute to the greater project of aggregating human knowledge in a world wide web.

Learn Data Engineering with Google Sheets

Written in

by

Four years ago, I had no idea what data engineering was. I wasn’t familiar with data stacks, APIs, ETL/ELT, warehouses, or SQL. I was a freelance web developer using HTML/CSS/JavaScript, WordPress, and Webflow. But when a client asked me to automate some reporting processes, I unknowingly took the first steps on my data engineering journey. And the tool that provided me with a sandbox for learning and prototyping? Google Sheets.

There are plenty of jokes about spreadsheets being mistaken for databases.

But then again…spreadsheets can be a great way to learn about databases.

If you’re just starting out on your data engineering journey, accept this advice: it’s fine to start in spreadsheets, but recognize that you’ll probably outgrow them quickly. Use the knowledge and skills you pick up in spreadsheets as a foundation for the next steps into true database administration.

Why Google Sheets?

Before I go any further, I just want to share my perspective on the old Excel vs. Google Sheets debate. Ask any accountant which they prefer, and they’ll probably tell you that they’d rather die than use Google Sheets. There are a lot of features in Excel that just haven’t been duplicated in Sheets yet. But ask a data engineer which they prefer, and my guess is that they’d lean toward Sheets. Why? Extensibility.

If you want to extend Excel, you’ve got to learn VBA. VBA is a relic – a proprietary scripting language that you’ll only ever need if you plan to remain stuck in Microsoft Office products. If you want to extend Sheets, however, you just need some basic JavaScript. JavaScript is the language of the web. “Well,” someone might say, “It’s actually Google Apps Script.” Sure – Google has adapted vanilla JavaScript to create GAS, but it’s still vanilla JavaScript.

You’re going to need that extensibility if you want to start connecting data sources to Sheets.

Getting Started

Data engineering is a set of systems and processes for extracting, loading, and transforming data – moving it from point A to point B. You have a source, a destination, and sometimes an intermediary step or two in between. There are two predominant methods for this – ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). This infographic from dbt presents the two options.

The first thing you need is a source system. If you need some inspiration, check out this list of public APIs. You can utilize these APIs to extract data from all sorts of public data sources, often at no cost.

The next thing you need is some education on APIs. APIs (Application Programming Interfaces) allow you to interact with a system, sometimes by requesting data and other times by sending data. As I was learning about APIs, this tutorial by Ben Collins was my go-to reference.

Work on requesting data from a public URL and then writing it into your spreadsheet. Use parameters in your HTTP requests to only retrieve certain records. Learn how to work with paginated responses, states/cursors, and identifying records based on the time of their last update. Master these tasks, and you’ll be on your way to extracting data like a pro.

Next up, learn how to parse JSON and filter out records from the response. Try using regular expressions (regex) to modify values and clean the data. Master these tasks, and you’ll be on your way to transforming data like a pro.

Finally, see if on each new request you can clear the contents of the spreadsheet before loading records from the response. See if you can upsert records into your spreadsheet, eliminating the need to load every record every time. Master these tasks, and you’ll be on your way to loading data like a pro.

A Simple Example

Here’s a great example that I used while I was learning data engineering in Google Sheets:

The Pennsylvania Department of Transportation (PennDOT) offers a public API. I decided it would be a fun learning opportunity to write traffic event data into a Google Sheet.

The first thing I had to do was request access to the API. I created an account and identified myself as a hobbyist developer with an interest in their dataset. Not long thereafter, I was granted access and received a unique username and password.

I read through their API documentation and found the correct URL to send my requests to. Using Google Apps Script, I prepared my request as follows:

options.headers = {"Authorization": "Basic " + Utilities.base64Encode(username + ":" + password)};

var response = UrlFetchApp.fetch("https://www.dot511.state.pa.us/RCRS_Event_Data/api/RCRS/liveEvents?dataType=json", options);

This returned a JSON response containing data from the liveEvents endpoint. Extraction complete!

I then transformed the data. I was only interested in traffic events in my county, so I looked up my county in the API reference and found the value. I also looked up the event class I was interested in and found its value. I iterated through the response and filtered out only the records that met my county and event class criteria. I also filtered out everything but the four fields that I was interested in – DateTimeVerified, EventType, IncidentMuniName, and Description.

values.Values.forEach(function(elem,i) {
    if (elem['County'] == '66' && elem['EventClass'] == '1') {
      output.push([elem['DateTimeVerified'],elem['EventType'],elem['IncidentMuniName'],elem['Description']]);
    }
  });

Transformation – done!

It was time to load data into the spreadsheet. In Google Sheets, it’s as easy as defining the output range and then setting the values. In addition, I thought it would be nice to summarize the results with a count of events – I wrote that value into cells F3:F5.

sheet.getRange(8,1,len,4).setValues(output);
  sheet.getRange("Data!F3:F5").setValue("=countif(A8:A,\"*\")&\" results\"");

Loading – complete!

Each time I request new data from PennDOT’s API, I clear the sheet to prepare it for the next bath of results.

sheet.getRange("F3:F5").clearContent();
var lastRow = sheet.getLastRow();
 sheet.getRange(8,1,lastRow,23).clearContent();

I left out some of the necessary code from this example for simplicity’s sake – there are a thousand Github repos and Stackoverflow posts out there where you can copy & paste their working examples right into Google Apps Script.

Learn QUERY()

Once you have data in your Google Sheet, another fork in the road you might take is learning the QUERY() function. QUERY() is one of the most powerful functions I’ve come across, giving you a SQL-like ability to query a dataset. Use it to drill even further into your dataset to create dynamic tables, charts, and graphs. This can give you a preview of what it will be like when you are querying your database and creating dynamic, constantly-updating visualizations using a business intelligence solution!

Iterate and Extend

Once you’ve successfully extracted, transformed, and loaded data into a Google Sheet you can start to iterate and extend your work. Consider ways that you can optimize each step of the process. Can you build a button that executes your script? A custom menu? An HTML sidebar? The possibilities are endless.

These are the building blocks of a modern data stack! As you move on to larger and more complex projects, you will predictably run into the limitations of spreadsheets as databases. Even though Google Sheets now has a limit of 10 million rows, we found that the ETL process became unwieldy about 50,000 rows in. We also found that large datasets and multiple QUERY() functions could take a long time to process.

In Conclusion

My goal in writing this post wasn’t to go into great detail about data engineering, but instead to provide a high-level overview, some basic terminology/concepts, and to offer a free and accessible solution in Google Sheets for learners who seek hands-on experience without enterprise-scale resources at their fingertips.

Spreadsheets may not be as powerful as PostgreSQL, Redshift, Snowflake, and other true database solutions. But it is extensible, scalable to tens of thousands of records, and offers features that mimic the extract/transform/load/query/visualize features of more powerful data engineering products. It’s a great place to get started – as I found out over the last four years.

Advertisement

Tags

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: