This is the second in a series of posts on implementing time series analytics capabilities on top of BigQuery.
(Update 01/07/2021: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the sliding time window function can be invoked with bqutil.fn.ts_slide()
)
In the first article, you performed time aggregation on fixed time windows using the now open sourced tumble function. Building on that prior work, we will now take a look implementing sliding (hopping) windows and session windows.
Creating the sliding windows function
Unlike fixed windows, sliding windows are designed to overlap; for example, you can check the 15 minute rolling average of a stock price, every minute. As a result, some data points may then exists within multiple windows. …
(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with bqutil.fn.ts_tumble()
)
As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.
For the purposes of demonstration, the samples in this guide use the public San Francisco 311 dataset, performing time series analysis using the created_date and supervisor_district for the timestamp and series key, respectively. These examples can be applied to common time series problems like log analytics or IoT monitoring. …
The BigQuery data warehouse has extensive flexibility in how an administrator sets up access control and column level security. However, BigQuery’s column level security, currently only allows for one of two outcomes when a sensitive column is accessed: the query succeeds showing the clear text of the column or the query fails.
Organizations frequently need more flexibility, for example, they often want to truncate a timestamp or show only the last four digits of a phone number. In complex scenarios, an organization my want to share Tink encrypted data while also controlling access to the symmetric key.
In this tutorial I will show you how to embed BigQuery’s AEAD encryption functions into Authorized UDFs, to conditionally manipulate decrypted values without exposing the encryption key to the end user. …
Following up on an earlier post about creating and assigning Flex Slots commitments through the client libraries. Users already taking advantage of flat rate pricing often ask how they can automate Flex Slots purchases for short-term capacity increases.
For organizations making use of flat rate pricing, slot scheduling is a simple way to add capacity while ensuring a well-understood cost ceiling, during critical business hours.
For other common use cases, like mitigating performance constraints during complex ELT jobs; organizations should instead consider invoking the Reservation API to purchase and delete commitments as a part of job orchestration, in systems like Cloud Composer. …
(Update 11/20/2020: Examples now use a generated a sequence table instead of using the wise_all_sky public dataset as a base table)
When creating demos and proofs of concept, I frequently run into the problem of finding suitable sample data to get started. BigQuery’s public datasets are a great resource, but rarely have everything that’s needed to build a proper environment.
Creating initial sample data has become a necessary and tedious process for many projects. Consequently fake data libraries, like faker.js or this one for python, have become popular in recent years. …
I recently worked on a BigQuery use case, which required a very large, skewed, many-to-many join resulting in a fan-out of over 180 Billion records. On-Demand projects, limited to 2000 slots, would frequently time-out before the query could be completed. The query plan identified this problem as bytes spilled to disk and pipelined repartition stages, where the output of a repartition stage is the input to another repartition stage.
Here is a quick tutorial on how to work around these issues; using BigQuery’s scripting and integer range partitioning features to efficiently break down large queries.
BigQuery’s OpenStreetMap (OSM), public dataset offers a convenient example dataset to perform expensive joins with high fan-out, e.g. join on geospatial distance. To make the dataset a little more manageable, I am filtering the data on nodes in Denmark (country code: DK), which has a high concentration of OSM node entries. …
(Update 5/13/2020: Code now uses the Reservations client libraries)
Google Cloud recently added Flex Slots as a new pricing option for BigQuery. Users on Flat Rate commitments no longer pay for queries by bytes scanned and instead pay for reserved compute resources; and using Flex Slots commitments, users can now cancel the reservation anytime after 60 seconds. At $20/500 slot-hours, billed per second, Flex Slots can offer significant cost savings for On-Demand customers whose query sizes exceed 1TiB.
In this article I will show you how your applications can run an hour’s worth of queries on a 500 slot reservation for the same price as a single 4TiB on-demand query (currently priced at $5/TiB), using the Reservations APIs. …
About