BigQuery Omni is everywhere

Antoine Castex
Google Cloud - Community
3 min readMar 10, 2022

--

Each time you like using a tool, you finally need the not yet available feature…

This time Santa Claus has arrived earlier than expected !

My company is using BigQuery a lot for the enterprise Datawarehousing, but like other old company we have the legacy applications located far away from the current platform.

And trust me this is like trying to make discussing Spanish and Chinese people without translation, almost impossible.

Here we have Omni (Announced one year ago at Google Next conference), to help us on this task. The promise is :

  • your data can be OnPremise
  • your data can be on another Cloud Service Provider
  • your data will be available to use on BigQuery

Sounds good because I have a lot of Data hosted on Microsoft Azure in different place of the world, and sometimes BigQuery is not yet available.

Let’s try if Omni can solve this !

The below example is with AWS but mine with Azure is the same.

I have to upload or use an already stored data on an Azure region, in JSON, Parquet, Avro, ORC or CSV format.

First step on BigQuery console is to set up a connection between the both world (secured external connection)

Now the magic can operate

Now it’s benchmark time !

SQL is a standard, not only in my company. I want to see what I can do with that, just keep in mind one important fact :

  • Under the hood there is an Anthos cluster (managed by Google) on the CSP you choose to host BigQuery
  • Processing is done on the CSP and the result is back to the console
  • The target of Omni is to help company to bring data on BigQuery, not really to provide 100% same performance of BigQuery outside GCP ….because the place where it’s going to run can be different and the amount of data you are going to send back to the console can be affected by differents things like networking.

First Benchmark (public data )

I used the Chicago Taxi public datasource to know how many seconds takes my query to run :

  • 77 GB Table
  • 19,5856,374 rows
  • No partition
  • Full scan
  • Last test with operator mean use AVG, EXTRACT DAY, FORMAT, GROUP and ORDER

I changed from LIMIT 1000 to 5000, just to have a bigger amount of data returned in the console (not so big otherwise I’ve got ERROR OUTPUT TOO BIG, currently returned data to the console limited to 2MB)

I’ve also tried to return the result not in the console but directly to the source (Azure), too much time consuming, forget that idea …

Second Benchmark (Company data)

  • 1 TB Table as data source + 200MB Table
  • 9,172,577,750 rows
  • Inner join function used with Limit 10000 (because current Omni limitation on the size of the returned data to the console)

Conclusion :

Performance are not so bad, that can really solve problems !

Now waiting for more options and feature available.

--

--