Data warehousing is getting increasingly adopted by businesses of all sizes. Though most teams who are new to it can waste a lot of money on not only hosting the data but also processing it in a data warehouse. This article will walk you through how you can ultimately save more than 50% of your monthly costs on BigQuery.
Technical optimization may save 10-20%
Avoid data duplications
Data duplication can hurt not only the performance of your query but also the economics of your analytics. There are many ways to deduplicate your table.
1. Set the insertId property when streaming inserts to your table
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT * FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN ( SELECT id FROM `fh-bigquery.tt.test_import_native` )
2. Use DML to load data into an existing table (requires standard SQL)
3. Deduplicate tables by using Merge
WHERE id NOT IN (...)
ensures that only rows with new ids are loaded into the table.
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Use clustering and partitioning to reduce the data scanned
Clustered/partitioned table could be assigned a partitioning column. The column can help the bigquery engine split data into small tables at different intervals. For example, a time series table split by date of ingestion. At query time,
As partitioning ensures that BigQuery will scan only the data user has passed in the SQL as partition filter instead of the entire table. This means only a part of the table will be scanned, not the entire table. And this portion was usually just 3–5 percent of the table size. Which reduced the query size from TBs to a few GBs.
Optimize query usage
1. Use Preview instead of running SELECT *
When you first explore a table, it’s intuitive to use SELECT * to see the entire table. However, using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.
If you have to see the entire table, try to use the Preview feature instead of running SELECT * directly. The preview option allows you to see the full table without costs. It’s an economically efficient way when you are in the initial data exploration stage.
2. Only query the columns you need
This is very straightforward. By avoiding SELECT*, you should only query the columns you need when it comes to cost-saving. If you are not sure which columns are available, find the schema of the table. The schema displays the structure of the table and contains all column names and their data types. Look up the columns in the schema and only keep the necessary columns in your query.
3. Use the built-in validator to calculate the costs
Before running a query, you can check how much data processed it will generate below the console. You can use it as an indicator to adjust your query, so you won’t be afraid of generating a huge amount of data processed without any expectation.
Materialize results
If you create a large, multi-stage query, each time you run it, BigQuery reads all the data that is required by the query. You are billed for all the data that is read each time the query is run.
Instead, break your query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.
Working with a fully-managed account partner can save you 50%+
If none of the above optimization methods meets your budget, you can talk to a fully-managed BigQuery account partner. You may save north of 50% for your costs on BigQuery by switching to a fully-managed account that has a long-term partnership with Google Cloud. This offers several key advantages beyond technical and behavioral optimization.
Flat rate pricing
Since the account partner processes a larger volume of data, you may negotiate a flat-rate pricing for your usage even at a smaller volume. For example, you may spend $300-$500 a month on BigQuery every month and wish to control the expenses within a certain limit. A flat rate pricing offers great flexibility and predictability for your data processing.
High availability and performances
Most cloud platforms would allocate “slots” for computing needs. For individual users, you may or may not get a lot of slots for each query. This largely depends on the traffic public cloud infrastructure experience in any time period. Working with a fully-managed account, however, you are guaranteed to have a certain number of computers standing by for your requests. This would drastically reduce your wait time and improve your processing speed.
Additional products & services
Another advantage of working with an account partner is so that you can take your use cases beyond just data warehouses. If you can build your entire data stack from data warehousing, data connector, transformation & modeling, automation & scheduling, to business intelligence in one place, why would you pay for 5 separate tools?
If you want to save more than 50% on your BigQuery costs today, schedule a free consultation call today.