Please BigQuery, can you accelerate my answers (and be cheaper)?

Antoine Castex
Beauty Tomorrow
Published in
7 min readApr 15, 2021

--

When you work for the first cosmetics group in the world, you work with the data, you work for the data, you work because of the data and you almost live for the data! The Company now wants to leverage Cloud technology for their data, for their business, for their customers

But does that mean cost a lot of money?

Consuming a lot of resources?

Need a lot of people for that?

Let’s do some tests to answer that!

The right tool for the right use case.

This is where Google Cloud Platform is coming. Because Google is the biggest data company in the world, because many companies use their data solutions, and because there is BigQuery inside, the king of SQL Analytics Datawarehouses.

The optimisation of a query is part of my daily job, and during all this time i’m trying to pass key messages to everyone who listens to me.

  • use the partition as much as possible
  • never use SELECT *
  • enable the BigQuery default cache
  • don’t use external data sources
  • ….

Of course BigQuery team is doing a crazy job with impressive new features every year, like clustering, stored procedures, authorized views … that can help to reduce the execution time of the query but also improve the security and reduce the money spent.

But that’s not enough, we need more and Google has listened to us … Here is where BI Engine is born.

The famous Lak Lakshmanan already write an article on it, but here is my turn :

BI Engine is a fast, in-memory analysis service. By using BI Engine you can analyse data stored in BigQuery with sub-second query response time and with high concurrency.

Here is what Google promises :

the same experience and latency, even if you are requesting from DataStudio, Looker, Spreadsheet, an API …

If the data can be cached in BI Engine, your query speed will be over boosted, like if you add a turbo on your car. It sounds crazy, let’s go !

My test needs to answer one question, is the cost of BI Engine (Sorry it’s not free) + the configuration of BI Engine + the execution time equal to what Google promised me ? Let’s see how it fares for each of the ways of using BigQuery we use internally.

Because i have many data on my hands available on BigQuery i will use a very simple table where i have :

  • the list of the product available on the catalog of my company
  • daily partitioned
  • located in EU
  • size of table is 70 Gb
  • number of rows is 15 millions

Not so big but easy to use and to compare with the different features and solutions for my benchmark.

Before doing my test I have to put in place my reservation.

Because i’m not an expert with BI Engine, i will choose the maximum available size : 100 Gb, and i will check what’s happening.

Price for that is fixed at

Total capacity 100 GB

Rate $0.0416 per GB/hour ($30.36 per GB/month)

Reservation cost $3,036.00/month

#1- From the UI Interactive mode & BQ CLI Batch mode (without Cache)

Let’s do a full scan of the table, which currently contains 300 columns and 60 partitioned days.

Without anything, that means “SELECT * from mytable”.

WAIT ! I just said, that thing was clearly forbidden, and for sure the result will not be visible because it is too big, so you would have to send it to a destination table…

Let’s add at least a LIMIT 1000 , which doest not reduce the cost but the execution time.

Here is my first score 108 seconds !

#2- Same platform but last 30 days only query

Because I will never need a full table scan and my business team will most likely have to focus on the last 30 days only.

BigQuery UI

So I’m using the partitioned table feature to filter on that amount of time.

Second score is … 3.5 seconds ! This time i don’t parse 70Gb but only 30Gb, make sense and means less money. I like this =)

#3- BI Engine Enabled + last 30 days only query

BI Engine is now enabled let’s try again with the same query…

The result is 1 second, impressive !

You can see that BI Engine has been used in the log of your job with BQ CLI :

bq show --format=prettyjson -j btdp-sbx-c-antoine:EU.bquxjob_xxxxxx
"query": {
"biEngineStatistics": {
"biEngineMode": "FULL"
},

#4- DataStudio with the last 30days

Let’s try now with my first and loved Data Visualisation tool, because it’s free and because it’s so easy to use.

DataStudio UI

When i add my BigQuery datasource to build a simple chart, queries are automatically routed to BI Engine, and the score is 309ms, Wow ! That is super dope ! But one additional very interesting information is the amount of parsed data under my cache (remember i took 100GB as BI Engine reservation because my table is 70Gb and i was afraid to reach the limit) is finally only 1GB. Why ? because BI Engine also compresses data by a factor 6 or 7 and loads only the minimum amount of data necessary to my most frequent queries, crazy ! (good for my wallet too)

#5- Looker with the last 30 days

Because sometimes you need more than just a simple chart and your team is expert on the data, you should use tools like Looker.

Looker UI

We added the new datasource and after querying it we discovered the score which is this time around 1.9 seconds. Nice but not as crazy as DataStudio. This is however not so worrying since Looker has its own independent cache named “aggregated tables” and that will also reduce your next data usage (i will not test that in this article).

The good news here is the processed data which is smaller than the previous test, 700mb, nice !

#6- Connected Sheet with the last 30 days

Google introduced the Connected Sheet feature which let you access, analyse, visualise, and share billions of rows of BigQuery data from your Sheets spreadsheet.

Connected Sheet UI

Just after adding my data source with my custom query (because i need only the last 30 days), the score appears at 200ms, Wow !!!! And only 500Mb of data processed. That is simply super dope (again).

The summary and cost analysis

Here is the details of my benchmark (execution time of each query have been controlled through Cloud Operations Metrics fromerly known as Stackdriver)

Remember that i’m doing everything in EU, which cost on BigQuery $5 per TB (with 1TB Free per month)

For BI Engine i assume that finally i need only 2GB as a reservation which costs $60.72/month

I also assume that my query will run 2000 times per day because of the users that are using my application.

Keep in mind that :

  • BI Engine does not guarantee that all requests will be 100% taken care of
  • I didn’t have considerate the BigQuery cache which is enabled by default and can reduce the cost of your queries

Here is the projection :

Because the Full scan will never happen (remember BigQuery best practices), Ihave to remove it from the final benchmark chart here :

Also for the money the cost of the test #2 can be raised to 0$ if you use BigQuery default cache, but still more slow than BI Engine.

My conclusion

The fact that BI Engine takes your data and let you re-use it as much as you need without paying additionally is the key !

Finally BI Engine is really the fastest and the cheapest way to accelerate your answer , but don’t forget the BigQuery best practices , mandatory more than ever !

  • Performance boosted, latency reduced up to 18x
  • Price reduced up to 20x and controlled

Finally the only one thing i really don’t like is the black box side of BI Engine, because of course we love the fact that this is a fully managed service but sometime everyone need to be able to have more informations or possibilities to custom/configure…and here it’s totally impossible, keep that in mind, maybe that will change in the near future, stay tuned :)

** Don’t Forget that this is a private feature **

Requesting access to the preview

Access to the BI Engine SQL interface in this preview phase is provided through an enrollment process. For access, submit the BI Engine preview enrollment form with your project details. You will be notified by email once your project is enrolled.

Project ID: BI Engine reservations are managed by the billing project’s ID. In the preview enrollment form, ensure that you specify the billing project ID that you intend to use to query your data.

--

--