#15 on Fast Company’s Best Workplaces for Innovators List – Learn more

Airflow and superQuery

1 xmk dcdi2qeckymmzlzkda

Monitoring your BigQuery costs in real-time with the SuperQueryOperator

1 xmk dcdi2qeckymmzlzkda
Photo by Soheb Zaidi on Unsplash

“What is the cost?” A question asked so frequently in the tech world that every person at a small start-up shudders slightly when it is asked. The answer of which invariably is: “We’re not sure”.

One of the best tools for scheduling workflows in the data engineering world is Apache Airflow. This tool has taken many a business out of the inflexible cron scheduling doldrums into riding the big data waves on the high seas of Directed Acyclic Graphs (DAGs).

Of course this means that large globs of data are being moved into and out of databases and with this glorious movement often comes unavoidable costs.

One such database, a supercomputer if you will, is called Google BigQuery. It is the flagship of the Google Cloud offering and allows data processing at the Petabyte scale. It is very good at making you worry less about the power of the database infrastructure and more about the quality of your analysis and data flow problems in need of solving.

One key factor to consider with BigQuery is how open any individual or organisation is to driving up costs spent on scanning data on this platform. Even the most savvy of data engineers will tell you in angst about their errors in scanning across data they didn’t really want to and pushing their business’ monthly analysis bill over the budget.

In steps superQuery. The philosophy offered by superQuery is that you don’t have to worry about your costs because you have all the information you need and some safeguards that give you the opportunity to make an informed decision.


Knowing your costs in Airflow

1 wnx 9yasyzdavzemvxxveg

While your Airflow DAGs are churning away happily and pushing data to your processing systems of choice, a heap of logging happens in the background. Airflows logs are easily accessible, simple to read and give you a good overview of what your DAG is doing. Wouldn’t it be great if the log files could also show you information on your query execution plan and specifically what the cost and total data scanned was? For sure it would! Something like so:

--------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------

[2019-03-11 21:12:02,129] {models.py:1593} INFO - Executing <Task(SuperQueryOperator): connect_to_superquery_proxy> on 2019-03-01T00:00:00+00:00
[2019-03-11 21:12:03,836] {superq_operators.py:54} INFO - Executing: #standardSQL
SELECT COUNT(testField) FROM `mydata.PROD.myTable`;
[2019-03-11 21:12:03,844] {logging_mixin.py:95} INFO - [2019-03-11 21:12:03,843] {base_hook.py:83} INFO - Using connection to: id: mysql_default. Host: superproxy.system.io, Port: 3306, Schema: None, Login: XXXXXX, Password: XXXXXXXX, extra: {}
[2019-03-11 21:12:15,172] {superq_operators.py:68} INFO - ((
'{
   "startTime":1552331525642,
   "endTime":1552331534624,
   "executionTime":"8988",
   "bigQueryTotalBytesProcessed":26388279066,
   "bigQueryTotalCost":"0.12",
   "superQueryTotalBytesProcessed":0,
   "superQueryTotalCost":"0.00",
   "saving":0,
   "totalRows":"1",
}', '', '1', 'true'),)
[2019-03-11 21:12:17,121] {logging_mixin.py:95} INFO - [2019-03-11 21:12:17,119] {jobs.py:2527} INFO - Task exited with return code 0

This chunk of log tells you that your BigQuery operator in Airflow scanned across 24Gb of data and it cost you $0.12. Simple. You can also go ahead and parse the log files in a 3rd party tool or a bash script and create a summary of the costs originating from DAGs that scan data from BigQuery.


1 hyjrfqs1qbbawfsevvqm0q

How does this all work?

SuperQuery uses a MySql proxy in order to make connectivity universal and offer a SQL interface to get information.

What you need next: The SuperQueryOperator

There are the steps involved in order to get the same functionality described above:

  1. Add the superquery plugin to Airflow in order to use the SuperQueryOperator.
  2. Subscribe to superQuery for a trial and get login details for the superQuery MySql proxy
  3. Test your connection to the proxy with the DAG provided below.
  4. Replace your BigQuery operators with the SuperQuery operator in your own DAGs when you want to use this functionality.

Using the SuperQueryOperator

This is the interface for the SuperQuery operator:

TEST_SQL = """#standardSQL
SELECT COUNT(*) FROM `mydata.PROD.myTable`;"""
SuperQueryOperator(
  task_id="connect_to_superquery_proxy",
  sql=TEST_SQL,
  database="",
  explain=True,  # False if you don't want information
  dag=dag
)

Here is the code for the operator which you should copy into the plugins folder in Airflow:

Here is some code to test your connection to SuperQuery with:

In summary

This post describes how you can get a view of your Airflow costs when you connect and implement Airflow tasks to and from BigQuery. The use of the SuperQuery proxy can be extended to include a wide variety of detailed execution plan information and draw on the benefits the system provides.

Happy cost monitoring (and saving)!

Subscribe to updates, news and more.