Defining sliding windows and session windows

This is the second in a series of posts on implementing time series analytics capabilities on top of BigQuery.

Time series analytics with BigQuery

(Update 01/07/2021: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the sliding time window function can be invoked with bqutil.fn.ts_slide())

In the first article, you performed time aggregation on fixed time windows using the now open sourced tumble function. Building on that prior work, we will now take a look implementing sliding (hopping) windows and session windows.

Image for post
Image for post

Creating the sliding windows function


Techniques for tumbles, fills, and linear interpolation

(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with bqutil.fn.ts_tumble())

As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.

For the…

The BigQuery data warehouse has extensive flexibility in how an administrator sets up access control and column level security. However, BigQuery’s column level security, currently only allows for one of two outcomes when a sensitive column is accessed: the query succeeds showing the clear text of the column or the query fails.

Organizations frequently need more flexibility, for example, they often want to truncate a timestamp or show only the last four digits of a phone number. In complex scenarios, an organization my want to share Tink encrypted data while also controlling access to the symmetric key.

In this tutorial…

Reliably adding and removing short term slot capacity on a schedule

Following up on an earlier post about creating and assigning Flex Slots commitments through the client libraries. Users already taking advantage of flat rate pricing often ask how they can automate Flex Slots purchases for short-term capacity increases.

For organizations making use of flat rate pricing, slot scheduling is a simple way to add capacity while ensuring a well-understood cost ceiling, during critical business hours.

For other common use cases, like mitigating performance constraints during complex ELT jobs; organizations should instead consider invoking the Reservation API to purchase and delete commitments as a part of job orchestration, in systems like…

Wrapping faker.js with a Javascript UDF

(Update 11/20/2020: Examples now use a generated a sequence table instead of using the wise_all_sky public dataset as a base table)

When creating demos and proofs of concept, I frequently run into the problem of finding suitable sample data to get started. BigQuery’s public datasets are a great resource, but rarely have everything that’s needed to build a proper environment.

Creating initial sample data has become a necessary and tedious process for many projects. Consequently fake data libraries, like faker.js or this one for python, have become popular in recent years. …

Using integer range partitions with BigQuery scripting

I recently worked on a BigQuery use case, which required a very large, skewed, many-to-many join resulting in a fan-out of over 180 Billion records. On-Demand projects, limited to 2000 slots, would frequently time-out before the query could be completed. The query plan identified this problem as bytes spilled to disk and pipelined repartition stages, where the output of a repartition stage is the input to another repartition stage.

Image for post
Image for post
A pipelined repartition: Stage 0A reads the output of Stage 09

Here is a quick tutorial on how to work around these issues; using BigQuery’s scripting and integer range partitioning features to efficiently break down large queries.

The Setup

BigQuery’s OpenStreetMap (OSM), public dataset…

(Update 5/13/2020: Code now uses the Reservations client libraries)

Google Cloud recently added Flex Slots as a new pricing option for BigQuery. Users on Flat Rate commitments no longer pay for queries by bytes scanned and instead pay for reserved compute resources; and using Flex Slots commitments, users can now cancel the reservation anytime after 60 seconds. At $20/500 slot-hours, billed per second, Flex Slots can offer significant cost savings for On-Demand customers whose query sizes exceed 1TiB.

In this article I will show you how your applications can run an hour’s worth of queries on a 500 slot reservation…

Patrick Dunn

Solutions Architect @ Google Cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store