Data transformation in BigQuery | C2C Community
Solved

Data transformation in BigQuery


Userlevel 2

Hello All, 

I need to a build transformation to load a target table from 80+ source tables (all source tables are staged into BigQuery). From the 80+ tables, 30+ tables will be refreshed every 1 hr from their source systems and remaining will be once a day. 

Now the target table should be refreshed every 1 hr as soon as data is staged and the target table will have 350+ columns and hourly source data volume will be 40M ( 90% data will be changes to existing data and 10% or less will be new data) and over all table volume will be around 3B records. 

Any best practices or suggestion on designing transformation for this scenario ? 

 

Thanks,

icon

Best answer by jasond 9 August 2022, 16:56

View original

10 replies

Userlevel 7
Badge +65

Hi @ravi4r 

have you joined our Lounge section and introduced yourself using this template?

As I understand this will be batch data, correct?

Do you have the 80+ source tables on-prem or on Google Cloud?

Have you checked:

  • Eventarc & Workflows?
  • Cloud Composer?
  • Dataflow?
Userlevel 2

Hi @ilias , Thank you for responding. I really appreciate it. I just now submitted my intro in template format. 

My 80+ tables are available are in GCP BigQuery (brought them from on-prem through custom python framework). Its NRT approach (every 1 hr data load). I will mostly have max 1 hr window to finish my current data to be transformed into my final table before next run. Each run may have the volume of 40M records to be processed. 

So, I wonder if we have any way to process them in multiple threads/batches in parallel so that data volume per thread/batch will be split into small chucks. also, as we run them in parallel, The time for overall data processing will be dropped. I am exploring the options available in dataflow for now. I will go through the Eventarc & Workflows also (But, I am not sure whether it will work as its not streaming data). 

 

Thank you again. 

Userlevel 1

Hi @ravi4r,

A few considerations for you:

  1. Have you looked at Dataform (upcoming), or dbt (currently available) as a way to declaratively run your transformations? This would give you a way to specify materializations as code and then run on an interval (say hourly) for any new data.  This would also allow you to parallelize chunks of work, I imagine you are not running a query from the 80 tables at once, but there are incremental tables before you reach your end destination so please let me know if that assumption is not accurate.
  2. You could also look into materialized views if you want a more automated approach. I would consider if your 1hour processing window is a strong SLA or a generalized time period you would like the data to be refreshed in.
  3. In terms of patterns, if you have so many updates and you do not want to track changes over time, I would consider a delete / insert or re-materialization of the objects. Updates are the most expensive database operations to perform.
  4. Using clusters and partition keys in BigQuery will help to reduce the scan volumes, and therefore improve performance. you may consider these as a part of your stage table and destination table designs.
Userlevel 7
Badge +65

Hi @ravi4r,

have you checked @jasond’s answer? I think it will help you a lot.

Thanks, @jasond

Userlevel 4
Badge +4

Hello All, 

I need to a build transformation to load a target table from 80+ source tables (all source tables are staged into BigQuery). From the 80+ tables, 30+ tables will be refreshed every 1 hr from their source systems and remaining will be once a day. 

Now the target table should be refreshed every 1 hr as soon as data is staged and the target table will have 350+ columns and hourly source data volume will be 40M ( 90% data will be changes to existing data and 10% or less will be new data) and over all table volume will be around 3B records. 

Any best practices or suggestion on designing transformation for this scenario ? 

 

Thanks,

@ravi4r hope you had great answers; from our experts. Please include the latest Pub/Sub direct to BigQuery, as landing; so we can do the transformation; post setting up the foundational in BigQuery. 

Userlevel 7
Badge +65

Hi @ravi4r,

have you checked @jasond and @kumarchinnakali answers? Are any of these helpful? Need something more?

Userlevel 2

Hi @jasond, Thank you for your inputs. I really appreciate it. 

We are working in the direction of using DBT only. But, its very early stage as of now. We need to see whether we can get the performance that we are expecting in terms of processing the data within next cycle of data publish. 

Userlevel 1

Anytime! As you design your processes, consider that smaller chunks typically can run faster than one big, large query. This will help with some parallelization as well (and designing your source tables to use partitions / clusters to reduce scans).

Userlevel 3
Badge +3

Hi @jasond, Thank you for your inputs. I really appreciate it. 

We are working in the direction of using DBT only. But, its very early stage as of now. We need to see whether we can get the performance that we are expecting in terms of processing the data within next cycle of data publish. 

You should evaluate Dataform too, in addition to dbt.  I hope there will be a tighter integration to Google’s services over time. Dataform will be GA in Q1 2023 I understand. If you have a Google Cloud contact for your company they should be able to provide you access to dataform.co now. 

If you want a more generic tool, dbt is a good bet. 

40M does not seem to be a lot for one hour - but complexity of transformations might make  it tight for the 1 hr SLA.  Will 2K slots be enough for you to run the jobs in parallel ? Something to evaluate as part of your POC.

As @jasond says, updates are expensive. Design your transformations carefully to see if you can do deletes/inserts. 

 

Userlevel 7
Badge +65

Thank you for your helpful posts @jasond, @kumarchinnakali and @sathyaram_s 😀

Reply