DATA ANALYST
Download my resume
View My LinkedIn Profile
View My Tableau Public Profile
Computation is done via Dremel. Workers or slots are used to extract data from storage and perform aggregations. To increase the performance each of these slots work independently from one another. Instead of working with each other they use the remote memory shuffle i.e., Distributed memory shuffle tier (as shown in the image below) to store intermediary results.
In BigQuery, each shuffled row (i.e., the aggregated row) can be consumed by the slot(worker) as soon as its created which makes it possible to operate distributed operations in a pipeline.
All the data movement mentioned above is facilitated by Jupyter (Petabit network, in the image) which is super-fast google scale network.
As BigQuery storage is columnar its well-optimized for running aggregates but not for returning large amounts of data across rows.
This improves performance by reducing the bytes scanned and the bytes send to subsequent stages.
Partitions and Clustering helps us to scan only those blocks of data which are needed. So leveraging, Partitioning and Clustering we can prune out the data which is not needed. This includes using where clauses on the partitioning and clustering columns. Make sure that the order of columns is used correctly so first it will be partitioning column and then clustering columns in the where clause.
Check the table info column:
Filter your data as needed earlier to reduce what needs to be passed onto subsequesnt steps.
Lets look at how BigQuery executes a GROUPBY clause.
title
end up in the same slot. We can see 2nd step happening in the 2nd stage of execution details.If you try to join two large tables then the bigquery favours a HASH/SHUFFLE join.
Similar to aggregation we discussed below, BigQuery uses a Hash function to shuffle(join) a left and right table so that matching key end up on the same slot(worker) and then the data can be joined locally.
But if one of your tables is small enough to fit in memory then its not necessary to shuffle large amounts of data.
In a small join, known as Broadcast join. BigQuery send this small table to every other slot which is processing the larger table.
Now to optimize the join queries:
Place the larger table first in the join query followed by smaller and then by decreasing sizes.
WHERE clauses should be executed as soon as possible so that slots performing the joins are working with the least amount of data. You also want to make sure that you are using filters on both tables as shown below. This way you ensure that the data you dont need is eliminated from both side of the join.
Finally, you can cluster our partition table on common join keys because less data is scanned and query will execute faster. And clustering highly improves the efficiency.
When you run a ORDER BY statement you might run into resources exceeded error. This is because the final sorting for your query must occur on a single slot. And if you are attempting to sort on a large result set the final sorting can overwhelm the data.
Use of order by / rank increases the computation as the data has to be sorted and then computed and then stored. Its best to do this at the end of query as the operation has to be done on a smaller subset of data and will consume fewer slots.
We can also use LIMIT to restrct the number of rows which are taken from each set of slots and improve computation as its done on fewer rows.