BigQuery Omni as a migration assistant | C2C Community

BigQuery Omni as a migration assistant

  • 16 August 2021
  • 0 replies
  • 365 views

Userlevel 1

1 List of acronyms and terms 

AWS—Amazon Web Services

GCP—Google Cloud Platform

GCS—Google Cloud Storage

S3 bucket—public cloud storage resource available in Amazon Web Services

 

2 Introduction

BigQuery is a serverless and highly scalable data warehouse. It gives a possibility to query petabytes of data and gain insights within minutes as well as provides an ecosystem for doing different kinds of analytics such as business intelligence, geospatial analysis, machine learning, and AI.


BigQuery Omni is a multi-cloud analytics solution that brings capabilities of querying data across other cloud computing platforms such as AWS and Azure. At SoftServe, we’ve been working with BigQuery Omni since its launch and found out that it can be used to simplify data migration challenges.
 

3 Data reconciliation

During the migration process, it may be difficult to avoid different kinds of data corruption. Business metrics may lose their precision, records may be truncated, and some data may just be lost. To be sure that after migrating we haven’t lost any important pieces of information, we need to perform data reconciliation. It usually consists of doing some data checks across source and target datasets and comparing the results. Those checks may be based on aggregations, such as counts. The checks may be performed by comparing hashes of columns, tables and other properties.

 

4 BigQuery Omni use case

Imagine a situation where you migrate data from Azure to BigQuery. Most of Microsoft’s cloud databases solutions leverage SQL dialect that has many differences with BigQuery SQL. In this case, you would need to write different scripts for the same purpose to verify that data from Azure matches the migrated data. You can avoid that if you use BigQuery Omni because the same script can be used to query data both from Azure and GCP.


In most cases the migration is not a simple activity. In real world there might be tens, hundreds or even thousands of processes using the source data which cannot be moved at the same time as data. Process of moving the data in production may take days or even weeks. During the migration process the data consistency between both environment is crucial to ensure the consistent outputs for all the systems and processes. 


In some simple cases moving data just for run simple analysis of log files or other data might be not reasonable. What if you need to work on some aggregates only? What if you could just run a SQL query to get and transfer only the really required information? 


This is where the BigQuery Omni makes a difference. Before BigQuery Omni the data reconciliation process might look like following:  
1.    Export data from the source system 

2.    Use Storage Transfer Service to copy your data from source to the Google Cloud Storage 

3.    Load data to BigQuery 

4.    Run scripts on both datasets and compare the results 



It might look not complicated and easy to implement however there are few hidden costs you need to consider: 
1.    All the data needs to be duplicated which brings double of storage costs 

2.    The data needs to be transferred between different clouds and regions which may cause the egress transfer cost 

3.    The data transfer takes some time so it will produce time cost of this operation 

BigQuery Omni removes these additional costs and enables the possibility of working in really multi-cloud environment.  

 

5 General architecture of the solution

Architecture of the solution

Let’s assume that migration is done by extracting data from a Data Warehouse and other sources to the S3 bucket. Subsequently, data is transferred to GCS and loaded to BigQuery with some transformations. BigQuery Omni is connected to the S3 bucket with extracted data. It is a single point of access to the files with the possibility to access data via SQL queries. All the validation scripts can be executed in one place (on the GCP side of BigQuery) and compare validation metrics in one environment. It’s even possible to generate a validation report or implement a data validation dashboard in case the migration process is long (a year and longer).

 

6 Data validation example

In our example, we are using the London Bicycles dataset. There is data in our AWS S3 bucket.

Files in AWS S3 bucket

And there are files in a GCS bucket.

Files in GCS bucket

At first, we compare data from S3 file and GCS bucket file by records counts and, in this way, perform the highest-level and the simplest check after which we can understand the degree of data corruption (if the counts don’t match, then something has happened for sure).

Compared counts

Looks like the results are identical. Let’s try to compare the tables by hashes. It will give us a view whether contents in tables are identical. Accuracy of such a method depends on hash functions that are being used and theoretically may not always be precise due to the possibility of collisions.

Compared hashes (not-matched case)

During this check, we found out that hashes don’t match. So, there is a difference between contents somewhere. The source of mismatch can be found using more granular queries against individual columns. In our particular case, the problem was with rounding the ‘longitude’ column. After fixing the issue, the hashes became the same:

Compared hashes (matched case)

You can compare the results automatically by writing some assert statements in your preferred programming language. You can’t do it just using BigQuery SQL due to service restrictions which we’ll talk about in the following section.

 

7 BigQuery Omni cross-cloud joins restriction

If you want to do data checks across different cloud platforms simultaneously within a single SQL query, you will face a challenge. You may try to combine results by joining results or saving them in variables declared by scripting functionality. Either way, you won’t be able to do this. By the time of writing, BigQuery Omni does not support cross-cloud joins because the region of AWS data isn’t the same as the region of GCP data. So, the reason for this issue is unavailability of cross-dataset joins.

This challenge can be solved by different techniques, but the core of all of them is using Data Transfer Service (DTS) for transferring aggregated data from AWS to GCP.

 

8 Solution to remove the restriction: General overview

We created a workaround to mitigate the restriction. For the sake of simplicity, we made our solution as lightweight and cheap as possible, concentrating primarily on BigQuery Omni capabilities in the first place with a set of supportive services around it. Basically, the whole design implies the usage of the following GCP services (apart from the BigQuery Omni itself):

  • Cloud Build and Terraform for deployment automation
  • Cloud Functions as a computational environment
  • Data Transfer Service as a bridge between AWS and Google Cloud Platform
  • Cloud Storage as a landing for incoming data

On the AWS side we can stick only to S3 because in the scope of this article the original data source doesn’t really matter. We assume that data on S3 will be provided in one of the BigQuery federated formats (CSV, JSON, Avro, Parquet, ORC).

 

9 Solution to remove the restriction: Technical details

We created a set of Terraform templates for infrastructure provisioning. The templates are orchestrated via the Cloud Build. Later in this article is shown a code sample of the template fragment for S3 reading policies provisioning.

resource "aws_iam_policy" "bigquery-omni-connection-read-policy" {
name = "bigquery-omni-connection-read-policy"

policy = <<-EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AmazonS3ReadOnlyAccess",
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::${var.bqo_bucket_name}",
"arn:aws:s3:::${var.bqo_bucket_name}/*",
"arn:aws:s3:::${var.source_bucket_name}",
"arn:aws:s3:::${var.source_bucket_name}/*"]
}
]
}
EOF
}

And a code sample of a part of Cloudbuild build config file responsible for AWS setup.

- id: 'tf init aws'
name: 'hashicorp/terraform:0.14.10'
entrypoint: 'sh'
args:
- '-c'
- |
cd terraform/aws
echo "##### TERRAFORM INIT AWS PART 1 #####"
terraform init
secretEnv: ['AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY']

- id: 'tf plan aws part 1'
name: 'hashicorp/terraform:0.14.10'
entrypoint: 'sh'
args:
- '-c'
- |
cd terraform/aws
echo "##### TERRAFORM PLAN AWS PART 1 #####"
terraform plan
secretEnv: ['AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY']

- id: 'tf apply aws part 1'
name: 'hashicorp/terraform:0.14.10'
entrypoint: 'sh'
args:
- '-c'
- |
if [ $BRANCH_NAME = "master" ]; then
cd terraform/aws
echo "##### TERRAFORM APPLY AWS PART 1 #####"
terraform apply -auto-approve
else
echo "Skipping TERRAFORM APPLY AWS PART 1!"
echo "Branch $BRANCH_NAME is not an official environment"
fi
secretEnv: ['AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY']

As soon as we finished connections and access provisioning, we proceeded with the computational environment setup. In our design, we have created three Cloud Functions shown in the image below.

The function called validate_aws_data is responsible for validation queries execution against AWS (using BigQuery Omni) and the function called validate_gcp_data is responsible for GCP (native BigQuery) data. Previously we mentioned that, as the simplest test case, we can compare row counts and table hashes against both AWS and GCP data. It could be done using the following code sample:

SELECT
COUNT(1) AS COUNT_ROWS
,BIT_XOR(FARM_FINGERPRINT(TO_JSON_STRING(t))) AS CHECKSUM
FROM <source> AS t

The function that executes query across the data in GCP will immediately save result as a native BigQuery table. In the function that executes query across the data in AWS we need to include an additional “EXPORT DATA WITH CONNECTION” statement and provide BigQuery Omni connection ID for it. Thus, we could easily run a query against S3-based data and query output on the S3 as well. After that, we would need to invocate the Data Transfer Service to move data produced by the “EXPORT DATA” statement from S3 to Google Cloud Storage.

The load_aws_validation_results Cloud Function has a Cloud Storage Bucket-based trigger and will be invoked each time when the aforementioned Data Transfer Job moves BigQuery Omni outputs from S3 to Cloud Storage. This function does nothing more, but ingestion from Cloud Storage to BigQuery thus provides the ability to have results of two comparisons in one place (in the form of a native BigQuery table) and therefore mitigate the absence of cross-region (and cross-cloud) in BigQuery.

 

Authors:

Vitaliy Bashun, Senior Solutions Architect, SoftServe Inc

Artem Plyusch, Senior Data Engineer, SoftServe Inc

Mykola Borysov, Senior Data Engineer, SoftServe Inc

Marcin Siudzinski, Lead Data Engineer, SoftServe Inc

Roman Zolotar, Data Engineer, SoftServe Inc


0 replies

Be the first to reply!

Reply