JESWIN GEORGE

Logo

DATA ANALYST
Download my resume
View My LinkedIn Profile
View My Tableau Public Profile

View My GitHub Profile

Big Query Introduction for beginners

BigQuery Architecture

Partitions in BigQuery

Refer: Introduction to clustered tables - BigQuery docs

Types of partition

Partition cannot be created in any columns in BigQuery, There are some types of partition:

  1. Integer based partition: Here you define a range of integer values 0 to 10, 101 to 10000, etc. If your data comes out of range then all your data will come into unpartitioned-partition data.

Sample query for it is:

CREATE TABLE project-id.dataset_name.new_table
select col1, col2 
from tableA
PARTITION BY RANGE_BUCKET (SID, GENERATE_ARRAY(0,100,10))

RANGE_BUCKET is a keyword here.
0 to 100 is the record count. 0 to 10 one partition. 11 to 20 another partition and so on.

  1. Time-unit based partition
  2. Ingestion time based partition: Auto-generated by bigquery.

NULL data can come up in all three of the above partitions. Partiton cannot be created with string column on bigquery.

Storage Optimization

BigQuery has a storage optimizer that helps arrange data in optimal shape for querying by periodically re-writing files. Files may be written in format which is faster write too but then it will be formatted in a way to do faster querying.

Besides the optimization behind the scenes, there are a few things you can do to further enhance storage. Like partitioning your table when its created. Partitioning is dividing large tables into smaller chunks or partitions in a physically seperate locations. So when you use a where clause in your query, BigQuery only needs to look into the partitions.

Partitons are designed for places where there are large amounts of data and low number of distinct values. A good rule of thumb is partitions are more thab 1GB. If you overdo partitions you will create a lot of metadata, which in turn will slow down your queries.

If you need higher cardinality or you have a smaller table you should do clustering.

Cardinality refers to the number of unique values assigned to a dimension. Some dimensions have a fixed number of unique values. For example, the Device dimension can have up to 3 values — desktop, tablet, mobile. In this example, the cardinality of Device is 3.

High-cardinality dimensions are dimensions with more than 500 unique values in one day. High-cardinality dimensions increase the number of rows in a report, making it more likely that a report hits its row limit, causing any data past the limit to be condensed into the (other) row.

The image below gives difference between partitioning and clustering:

Clustering is like sorting your table on a particular set of columns. For example - clustering our table by CustomerID below will sort our the rows based on CustomerID.

In BigQuery, you are able to choose upto 4 columns to cluster on. This allows for efficient look-up as query engine needs to open files that have cluster keys. For example if we need to see the records of only customer 1, then we only need to open up a particular file.

BigQuery does automatic, free re-clustering to maintain the storage and query optimization.

Partitioning and clustering are often used together.

Refer: BigQuery CLustered tables with example