DoiT Wins Google Cloud Global Sales Partner of the Year Award – Learn more

Production-Scale IoT Best Practices: Implementation with Google Cloud (Part 2/3)

1 qdxbyd9a tq0jdgpznhwrg

This post continues from Part One, where we walked through how to securely onboard a production-scale fleet of IoT devices streaming telemetry data to your Google Cloud environment via IoT Core and Pub/Sub.

Congrats! You’ve successfully registered several IoT devices — now what?

Your next goal should be to design a system that enables large-scale storage, analytics, and visualization/dashboarding capabilities on your data.

In order to do so, you need to design a data flow architecture well in advance that supports such large-scale data operations. This article provides a hands-on walkthrough for doing just that.

Overview

This discussion is broken up into the following sections:

  1. Batching loading into data sinks
  2. Storage and analysis of data
  3. Visualization of warehoused data

Unlike part one, what will be discussed here can be performed entirely through the GCP web console. Only basic SQL experience is required.

The following fully-managed and auto-scaling Google Cloud services will be discussed:

  • Pub/Sub — A serverless message queue
  • Dataflow — A stream and batch data processing engine
  • BigQuery — A serverless data warehouse
  • Data Studio — A data visualization/dashboard creation service

Batch loading into data sinks

Verify that messages are arriving

If you have successfully onboarded devices into the IoT registry and begun streaming data to the IoT core, you should see a steady stream of messages arriving from the GCP IoT main dashboard:

1 t9g47ybwaf4zp6z0hg0mda
Three successfully connected devices streaming temperature data every five seconds

As shown in Part One, these messages are also arriving in your ‘temperature’ Pub/Sub topic:

1 upnn8cafhxe0coyobfu7qq
Pub/Sub messages arriving into the ‘temperature’ topic

Streaming into BigQuery

Nice — We see messages arriving into Google Cloud. Next, we need to move Pub/Sub messages into a data warehouse where the data can reside for the purposes of cost-effective long-term retention, as well as readily-scalable analytics. Enter BigQuery.

BigQuery, Google Cloud’s fully-managed, serverless, and auto-scaling data warehouse, allows you to pay for both compute and storage with an on-demand pricing model, making it a great data sink for storing and analyzing our IoT data.

But how do we stream Pub/Sub messages into BigQuery? With Dataflow.

Dataflow, Google Cloud’s fully-managed and auto-scaling version of Apache Beam, is designed to shuttle data from one service to another. You have the ability to optionally filter and transform data, as well as optimally batch load into load operation-limited services such as databases and data warehousing solutions.

Dataflow contains several Google Cloud-created default templates, including a Pub/Sub-to-BigQuery template, so no coding effort is required to link data ingestion and data storage/analytics services.

Given that Pub/Sub, Dataflow, and BigQuery are all fully-managed and auto-scaling services, and (with the exception of Dataflow) serverless as well, it is possible to build an end-to-end IoT data management system that easily scales from development testing to petabyte-scale operations — with virtually no infrastructure management required as scaling occurs.

Let’s see all these services linked up together in action!

Pub/Sub Subscription Setup

Before we begin moving data from Pub/Sub to Dataflow, we should create a Pub/Sub subscription that is subscribed to the Pub/Sub topic.

Why? Messages hitting a Pub/Sub topic are sent immediately to Pub/Sub topic subscribers (via a Push strategy), then deleted from the topic. By contrast, subscribers can hold on to messages until a process requests messages (via a Pull strategy). It is possible to connect Dataflow to a topic rather than a subscription, but if such a Dataflow job were to experience downtime, messages hitting the topic while Dataflow is down would be lost.

Instead, by connecting Dataflow to a Pub/Sub subscription that is subscribed to the topic, you prevent messages from being lost during downtime. If a Dataflow job were temporarily interrupted, all IoT messages not yet processed by Dataflow would remain in the Pub/Sub subscription, waiting for the Dataflow job to resume pulling messages.

A Pub/Sub subscription to a Pub/Sub topic creates a data architecture that is resilient to downstream data ingestion service interruptions.

To create a subscription within Pub/Sub:

  1. Navigate to Subscriptions,
  2. Click “Create Subscription” and name your subscription “temperature_sub”
  3. Subscribe this to Pub/Sub topic “temperature”
  4. Leave the remaining options at their defaults
1 mnpqehntjcbrgimry8iqjg
Creation of Pub/Sub subscription ‘temperature_sub’ to Pub/Sub topic ‘temperature’

Once created, if you click on the subscription and click “Pull”, you should see messages begin to stream in:

1 z9bh5sdv9yhvrh811fg8tg
Example messages streaming into the Pub/Sub subscription

Storage and analysis of data

Now that we have a Pub/Sub subscription receiving messages, we are almost ready to create a Dataflow job to move Pub/Sub messages into BigQuery. Before setting that up, we need to create a table in BigQuery where the data from Dataflow will arrive.

BigQuery Table Setup

Navigate to BigQuery, click “Create Dataset”, and name your dataset ‘sensordata’, with other options left at their default settings:

1 azhydcr1okshachcn7wrla
BigQuery Dataset creation prompt

Once the dataset has been created, select it, click “Create table”, and name your new table “temperature”. Make sure you include the schema, partitioning, and clustering options shown in the screenshots below, as these options support common query patterns:

1 rbi8kas5ehn9ekqmcjvhvq
New ‘temperature’ BigQuery table schema
1 oezxtbxb8pjsn yovr4m4a
Partitioning and clustering options for the ‘temperature’ table

If created correctly, your new, empty table will look like the following:

1 fxjm0idqe mnj09tqk5q5g
An empty ‘temperature’ BigQuery table within the ‘sensordata’ dataset

After moving data into the table, we will showcase a common IoT query pattern: Performing analytics on data matching a specific time frame (e.g. a one-hour window for the present day), and for a specific device.

The table design shown above is ideal for such queries because:

  • Partitioning on the UTC timestamp field enables date-specific queries to avoid scanning over DateTime partitions for non-matching days
  • Within a partition, clustering (sorting) on deviceId and the epoch timestamp allow for more optimal retrieval of data for a specific device and time-frame within that particular date partition.

To write these queries, we need data to work within the table. Let’s get that Dataflow job going!

Dataflow Setup

We currently have messages sitting in a Pub/Sub subscription, waiting to be moved elsewhere, and a BigQuery table ready to accept those messages. What we need now is the ETL glue that links the two together. Since Pub/Sub and BigQuery are both fully-managed, auto-scaling, and serverless services, we ideally want an ETL tool that possesses these qualities as well.

Dataflow (mostly) matches these requirements. The marketing around Dataflow claims it is all three, but in truth, it is not fully serverless. You do need to specify the instance types and sizes that will be used, the minimum and maximum instance counts auto-scaling can bounce between, as well as how much disk temporary disk space each instance will need. You never manage these instances and their ability to decide when to scale, but you do have to provide these specifications. This is in contrast to Pub/Sub and BigQuery, which will auto-scale with no infrastructure configuration.

Despite not being completely serverless, Dataflow is a perfect fit for our Pub/Sub-to-BigQuery ETL requirement. It is also easy to use, particularly given that GCP offers many default Dataflow job templates, including one that supports a Pub/Sub-to-BigQuery workflow. Outside of having to increase the auto-scaling maximum permissible instance count as your IoT data throughput increases over time, you will in theory never have to worry about managing the infrastructure powering Dataflow.

With the basics now understood, let’s implement a Dataflow job. Navigate to Dataflow, click “Create Job from Template”, and follow these steps:

  • Name the job ‘pubsub-temp-to-bq’
  • Use the default streaming template ‘Pub/Sub Subscription to BigQuery’
  • Enter the full Pub/Sub Subscription name
  • Enter the full BigQuery table ID
  • Enter a Cloud Storage bucket location where temporary data may be stored as part of Dataflow’s process for batch loading into BigQuery
  • Leave the remaining options at their defaults. You typically would expand the Advanced Options and specify parameters such as a particular machine type and size to use, the min/max machine count auto-scaling values, and disk size per machine. However, for testing purposes, these can be left at their defaults.

Your Dataflow job creation screen should look like the following:

1 20dqqxh14yyfgydfplrbag

After hitting “Create” and waiting a few minutes for the underlying infrastructure to spin up and start running, you will see data flowing from the Pub/Sub subscription and into the destination BigQuery table.

The Python temperature streaming script provided in Part One streams at a rate of one record per second. Thus, in the Dataflow Directed Acyclic Graph (DAG) shown below, you should see x number of elements streaming per second, where x is the number of devices you are testing with. In my case, there are three devices streaming:

1 1fztxtwfpaifzvvo1tg 8w
Messages being successfully streamed from Pub/Sub to BigQuery via a Dataflow job

Once you see that the Dataflow job is active and successfully streaming Pub/Sub subscription data to BigQuery, you can run a query with the following format in BigQuery and see real-time data hitting the table:

SELECT *
FROM `iottempstreaming.sensordata.temperature`
WHERE DATE(timestamp_utc) = "2020-12-18"
ORDER BY timestamp_epoch DESC
LIMIT 10
1 0c7ihxauk0kitwmbaa6q2g

We see that partition filtering is taking place by observing that more total data is scanned when the day-filtering WHERE clause is removed.

With my example dataset, 1.1 MBs of filtered data are scanned (as seen above) and 1.7 MBs of unfiltered data are scanned (shown below):

SELECT *
FROM `iottempstreaming.sensordata.temperature`
ORDER BY timestamp_epoch DESC
LIMIT 10
1 ujaumr t 0usgppkouyow

Let’s see what the average, minimum, and maximum temperature values are from each sensor within the past hour:

SELECT
  device_id,
  ROUND(AVG(temp_f), 1) AS temp_f_avg,
  MIN(temp_f) AS temp_f_min,
  MAX(temp_f) AS temp_f_max
FROM `iottempstreaming.sensordata.temperature`
WHERE timestamp_utc > DATETIME_ADD(CURRENT_DATETIME(), INTERVAL -60 MINUTE)
GROUP BY device_id
1 h3bllrb6dyw9a6ispmouoa
Various statistics for each temperature streaming device

Congratulations! You have just set up a data workflow that is fully-managed from end-to-end, from data ingestion to the analytics backend. Before we wrap this walkthrough up, let’s quickly check out how easily this data can be visualized with Data Studio.

Visualization of warehoused data

Start by running a query similar to the following in BigQuery which grabs all rows of data from a particular day:

SELECT *
FROM `iottempstreaming.sensordata.temperature`
WHERE DATE(timestamp_utc) = "2020-12-18"
ORDER BY timestamp_epoch DESC

To the right of “Query Results”, click on “Explore Data”, then “Explore with Data Studio”:

1 fv 8nhmkvuvqevqeugyyla

This will load a table summarizing the data we just queried. However, by default, it will show a rather uninteresting table summarizing the total number of records streamed per second.

Let’s change a few values under the Data section on the right to make this more interesting:

  • Select “Line Chart” as the visualization type rather than “Table”
  • Remove “Record Count” as the Metric viewed, and replace this with “temp_f”. Make sure to change the default “SUM” metric to “AVG”.
  • Add “device_id” as a breakout dimension

Your choices should give dashboard layout settings similar to the following:

1 bd0mkfo n mf8ubowugbpg

The chart that this produces will show temperature values for each device over time, but it may not be auto-scaled well as the default minimum y-axis value will be zero. To fix this, click on the “Style” tab, scroll down until you get to option “Left Y-Axis”, and change these to reasonable values:

1 qxgkntmri n2s0thebcwcg

You may also want to increase the number of data points that can be present on the chart:

1 ia2qwpqmklnfogg2nqwuma

With these updates, you should have a beautiful, interactive chart that allows you to scroll through device temperature values as they fluctuate over time:

1 s9bofxwtbm6b7fisdfgokg

Next up: Machine Learning

Stay tuned for part three where we will build a functional machine learning model on this BigQuery dataset and use it to generate real-time predictions.

Subscribe to updates, news and more.