BigQuery Table design | C2C Community
Solved

BigQuery Table design

  • 20 August 2021
  • 10 replies
  • 171 views

Userlevel 2
Badge

Hello,

 

I am collecting some data for analytics and the intent is to store it in the BQ DB. Need some guidance on structuring the table so that I can utilize BQ ML for analytics.

I have reviewed the best practices document of do’s and don’ts. What I am interested is in guidance on table design to optimize some repetitive data and some of the data is in JSON. Please reach out if you are in a position to help!

https://www.linkedin.com/in/srinivasan-vanamali-2507004/

 

icon

Best answer by sayle_doit 6 May 2022, 16:17

View original

10 replies

Userlevel 7
Badge +26

@jaclyn_wood would someone on your team have some insight here?

Userlevel 2
Badge

@Marissa  can I get some assistance with the above question,, This community isn’t what I thought would be collaborate and help each other out. 

Userlevel 7
Badge +22

Hi @Mali , pumping this conversation to the top of the list to see if anybody in the community has the concrete expertise to answer you!.

I’ll have a look to see if I can bring someone with the knowledge to answer your question. Stay tuned!

Cheers,

Userlevel 1

Hi @Mali, I am in the same boat too. I have a mobile app which stores data in Firestore. The data is in JSON format but nested and very much complicated to have relational structure for reporting datawarehouse in bigquery. So we are working on a cloud function based ETL which extracts data from firestore and transforms/loads to bigquery in much readable, relational model. It is a lot of work and I am sure there must be a better way of doing it. I would be happy to be a part of any knowledge sharing and discussions. 

regards,

Manish

Userlevel 2
Badge

@parentiv Any progress with your efforts. I am still struggling to find some good resource to help me with the table design. I have designed a microservice that extracts data using REST APIs. Some of the data are nested and repeated. Glad to jump on a call. I am on US EST. Ping me at mali@olympus-infotech.com

Userlevel 1

One thing I would recommend is looking into the JSON native data type for BQ here: https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data

It’s still in preview, but this is going to be the de facto way to do JSON in BQ going forward so might be a good way to get started and design for the future.

I would recommend storing the raw JSON in a column and also any fields you are pulling out pretty often into separate columns so you aren’t having to query the JSON directly for the most commonly used fields as the JSON querying is slower than standard querying.

Userlevel 2
Badge

I am already using CloudSQL Postgres to store data in JSON. Current table design meets my requirement. However, I want to leverage BQ ML for analytics. I can transform the data to normal columns if it makes sense. No real need to have it in JSON in BQ. But my data is nested with some dependencies between the parent row and a child row in the nested rows.

Userlevel 7
Badge +52

Hi @Mali,

When I was working as an engineer I was using Cloud SQL to (heavy) write and BQ to read fast our data. Cloud SQL is a DB while BQ is a warehouse after all. 

BQ is on Dremel. So, it’s working with columns not rows. If you read this white paper you can understand how BQ is working. In a few words BQ is storing data as nested relations. You can imagine the schema for a relation as a tree. That’s why I find logical what @sayle_doit is proposing.

I don’t think that there is a reason to change it from JSON to something else. Otherwise there is no reason for the Autodetect

When I was having a problem I was trying to think backwards. To do a “reverse-architecture”. To go from my target to my base.

  • You want to do ML on BQ.
  • BQ works with columns. Best results comes with denormalized data and when we use partitioned and clustered tables.
  • You have you data on CloudSQL (Postgres), but nested in rows
Userlevel 2
Badge

Hi @Mali,

When I was working as an engineer I was using Cloud SQL to (heavy) write and BQ to read fast our data. Cloud SQL is a DB while BQ is a warehouse after all. 

BQ is on Dremel. So, it’s working with columns not rows. If you read this white paper you can understand how BQ is working. In a few words BQ is storing data as nested relations. You can imagine the schema for a relation as a tree. That’s why I find logical what @sayle_doit is proposing.

I don’t think that there is a reason to change it from JSON to something else. Otherwise there is no reason for the Autodetect

When I was having a problem I was trying to think backwards. To do a “reverse-architecture”. To go from my target to my base.

  • You want to do ML on BQ.
  • BQ works with columns. Best results comes with denormalized data and when we use partitioned and clustered tables.
  • You have you data on CloudSQL (Postgres), but nested in rows

Thanks @ilias appreciate your feedback!

Userlevel 5
Badge +8

Hi, @Mali ! Did the above answers help you perhaps?! 🙂

Reply