By Vadim Solovey, Google Developer Expert.

Recently, I came across an interesting benchmark of BigData systems based on "A Comparison of Approaches to Large-Scale Data Analysis" by Pavlo et al. (SIGMOD 2009). Based on the benchmark methodology, the APMLab guys from Berkeley University developed an open-source software that allows anyone to run this benchmark using public cloud (AWS in this case).

This benchmark measures response time on a handful of relational queries: scans, aggregations, and joins across different data sizes. They have an impressive website with benchmark results, comparing Amazon Redshift, Hive, Shark, Impala and Stinger/Tez.

Since a lot of my time is devoted to working with Google BigQuery, I was intrigued how BigQuery would stand being compared to other solutions based on exactly the same dataset. Clearly, BigQuery has a very different nature since it's shared service and not dedicated, customer-deployed (and maintained) solution. 

Despite this nature of Google BigQuery, the results are extremely surprising (in my humble opinion).

The Dataset

The benchmark allows 3 different sizes of datasets - tiny, 1node and 5nodes. The largest dataset is 5nodes which has ‘rankings’ table with 90 million records and ‘uservisits’ table with 775 million records. The data is generated using Intel's Hadoop benchmark tools. The data itself are available at s3://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]. I used text version of the data which is in CSV format, thus making its very easy to load into BigQuery.

In general, there are two tables with the following schemas:

Ranking table schema:

(Lists websites and their page rank)

  • pageURL (STRING)
  • pageRank (INTEGER)
  • avgDuration (INTEGER)

Uservisits table schema:

(Stores server logs for each web page)

  • sourceIP (STRING)
  • destURL (STRING)
  • visitDate (STRING)
  • adRevenue FLOAT
  • userAgent (STRING)
  • countryCode (STRING)
  • languageCode (STRING)
  • searchWord (STRING)
  • duration (INTEGER)

The Workload

So, once again, the benchmark measures response time on a handful of relational queries: scans, aggregations and joins. Some of the query results are larger than 128MB and violate the quota policy. To overcome this, I was using a destination table to store the query results with setting allowLargeResults to true.

Another note would be on use of EACH modifier for GROUP BY and JOIN operators. This is required when the dataset is big and there is a high cardinality aggregation or join.


1. Scan Query

This query has 3 permutations, when X is 1000, X is 100 and then X is 10:

SELECT pageURL, pageRank FROM [benchmark.rankings] WHERE pageRank > X


2. Aggregation Query

This query has 3 permutations, when X is 8, X is 10 and then X is 12:

SELECT SUBSTR(sourceIP, 1, X) AS srcIP, SUM(adRevenue)
FROM [benchmark.uservisits]
GROUP EACH BY srcIP


3. Join Query

Like other queries earlier, this one also has 3 permutations: X is ’1980-04-01’, X is ’1983-01-01’ and X is ’2010-01-01’:

SELECT sourceIP, sum(adRevenue) AS totalRevenue, avg(pageRank) AS pageRank
FROM [benchmark.rankings] R
JOIN EACH(SELECT sourceIP, destURL, adRevenue
FROM [benchmark.uservisits] UV WHERE UV.visitDate > "1980-01-01" AND UV.visitDate < X)
NUV ON (R.pageURL = NUV.destURL)
GROUP EACH BY sourceIP
ORDER BY totalRevenue DESC LIMIT 1


4. External Script Query

The original benchmark also measuring performance of UDF (user defined functions). As of today, BigQuery doesn't support UDF but that can change soon and when it does, I am going to update the benchmark with the UDF performance results.


The Results

Occasional query latency spikes are expected with BigQuery due to its shared nature, so I executed each query 10 times (once per hour) and the results are median response time (in seconds) 

Query 1AQuery 1BQuery 1C
Redshift2.492.619.46
Impala (Disk)12.01512.01537.085
Impala (Mem)2.173.0136.04
Shark (Disk)6.6722.4
Shark (Mem)1.71.83.6
Hive50.4959.9343.34
Tez28.2236.3526.44
BigQuery4.614.611.4
Query 3AQuery 3BQuery 3C
Redshift33.2946.08168.25
Impala (Disk)108.68129.815431.26
Impala (Mem)41.2176.005386.6
Shark (Disk)111.7135.6382.6
Shark (Mem)44.767.3318
Hive561.14717.562374.17
Tez323.06402.331361.9
BigQuery9.39.111.2
Query 2AQuery 2BQuery 2C
Redshift25.4656.5179.15
Impala (Disk)113.72155.31277.53
Impala (Mem)84.35134.82261.015
Shark (Disk)151.4164.3196.5
Shark (Mem)83.7100.1132.6
Hive730.62764.95833.3
Tez377.48438.03427.56
BigQuery15.124.411.4
Query 4AQuery 4BQuery 4C
RedshiftN/AN/AN/A
Impala Diskuntesteduntesteduntested
Impala Memuntesteduntesteduntested
Shark Disk232.247.2279.4
Shark Mem162.928.1191.4
Hive896.47150.481047.45
Tez894.1662.6966.18
BigQueryTBDTBDTBD

Looking on the results, it's clear that BigQuery really shines with aggregations (queries 2A, 2B and 2C) and joins (queries 3A, 3B and 3C). In these two queries BigQuery is taking the 1st place among its rivals. 

The first query is pretty fast as well, however due the result size limitation and thus the need to store the query results in temporary table, the performance is not as good as it can be.

Nevertheless, BigQuery is no-maintenance, no-setup and pay-as-you-go BigData solution with awesome performance. I recommend you take a look at that and consider it for your next big data challenge.  

Posted
AuthorVadim Solovey
Categorieshomepage