JESWIN GEORGE

Logo

DATA ANALYST
Download my resume
View My LinkedIn Profile
View My Tableau Public Profile

View My GitHub Profile

How to optimize queries in GCP BigQuery?

How does query processing work in BigQuery?

Computation is done via Dremel. Workers or slots are used to extract data from storage and perform aggregations. To increase the performance each of these slots work independently from one another. Instead of working with each other they use the remote memory shuffle i.e., Distributed memory shuffle tier (as shown in the image below) to store intermediary results.

In BigQuery, each shuffled row (i.e., the aggregated row) can be consumed by the slot(worker) as soon as its created which makes it possible to operate distributed operations in a pipeline.

All the data movement mentioned above is facilitated by Jupyter (Petabit network, in the image) which is super-fast google scale network.

Optimizing BigQuery Queries

Do not SELECT unnecessary columns.

As BigQuery storage is columnar its well-optimized for running aggregates but not for returning large amounts of data across rows.

This improves performance by reducing the bytes scanned and the bytes send to subsequent stages.

Partitioning and Clustering

Partitions and Clustering helps us to scan only those blocks of data which are needed. So leveraging, Partitioning and Clustering we can prune out the data which is not needed. This includes using where clauses on the partitioning and clustering columns. Make sure that the order of columns is used correctly so first it will be partitioning column and then clustering columns in the where clause.

Check the table info column:

Filter early, filter often

Filter your data as needed earlier to reduce what needs to be passed onto subsequesnt steps.

Aggregations

Lets look at how BigQuery executes a GROUPBY clause.

JOINs

If you try to join two large tables then the bigquery favours a HASH/SHUFFLE join.

Similar to aggregation we discussed below, BigQuery uses a Hash function to shuffle(join) a left and right table so that matching key end up on the same slot(worker) and then the data can be joined locally.

But if one of your tables is small enough to fit in memory then its not necessary to shuffle large amounts of data.

In a small join, known as Broadcast join. BigQuery send this small table to every other slot which is processing the larger table.

Now to optimize the join queries:

Tweaking the way you do filtering and ordering

Using ORDER BY/RANK