This is the second in a series of posts on implementing time series analytics capabilities on top of BigQuery.
(Update 01/07/2021: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the sliding time window function can be invoked with
In the first article, you performed time aggregation on fixed time windows using the now open sourced tumble function. Building on that prior work, we will now take a look implementing sliding (hopping) windows and session windows.
Creating the sliding windows function
(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with
As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.
The BigQuery data warehouse has extensive flexibility in how an administrator sets up access control and column level security. However, BigQuery’s column level security, currently only allows for one of two outcomes when a sensitive column is accessed: the query succeeds showing the clear text of the column or the query fails.
Organizations frequently need more flexibility, for example, they often want to truncate a timestamp or show only the last four digits of a phone number. In complex scenarios, an organization my want to share Tink encrypted data while also controlling access to the symmetric key.
In this tutorial…
Following up on an earlier post about creating and assigning Flex Slots commitments through the client libraries. Users already taking advantage of flat rate pricing often ask how they can automate Flex Slots purchases for short-term capacity increases.
For organizations making use of flat rate pricing, slot scheduling is a simple way to add capacity while ensuring a well-understood cost ceiling, during critical business hours.
For other common use cases, like mitigating performance constraints during complex ELT jobs; organizations should instead consider invoking the Reservation API to purchase and delete commitments as a part of job orchestration, in systems like…
(Update 11/20/2020: Examples now use a generated a sequence table instead of using the wise_all_sky public dataset as a base table)
When creating demos and proofs of concept, I frequently run into the problem of finding suitable sample data to get started. BigQuery’s public datasets are a great resource, but rarely have everything that’s needed to build a proper environment.
Creating initial sample data has become a necessary and tedious process for many projects. Consequently fake data libraries, like faker.js or this one for python, have become popular in recent years. …
I recently worked on a BigQuery use case, which required a very large, skewed, many-to-many join resulting in a fan-out of over 180 Billion records. On-Demand projects, limited to 2000 slots, would frequently time-out before the query could be completed. The query plan identified this problem as bytes spilled to disk and pipelined repartition stages, where the output of a repartition stage is the input to another repartition stage.
BigQuery’s OpenStreetMap (OSM), public dataset…
(Update 5/13/2020: Code now uses the Reservations client libraries)
Google Cloud recently added Flex Slots as a new pricing option for BigQuery. Users on Flat Rate commitments no longer pay for queries by bytes scanned and instead pay for reserved compute resources; and using Flex Slots commitments, users can now cancel the reservation anytime after 60 seconds. At $20/500 slot-hours, billed per second, Flex Slots can offer significant cost savings for On-Demand customers whose query sizes exceed 1TiB.