Flappy-Scaling: Snowflake Auto-Scaling Gone Awry

Robert
sundeck
Published in
8 min readAug 10, 2023

--

Back to two of my favorite Snowflake topics: Snowflake warehouse sizing for heterogenous workloads, and unexpected system dynamics!

Consider a typical data warehouse workload — a BI dashboard which generates a small number of complex queries (i.e. joining and aggregating large fact data across multiple dimensions, including time), and a larger number of fairly lightweight supporting queries which populate filters the user can manipulate to slice the data. These smaller queries are often against dimension tables, and take the general form of SELECT DISTINCT product_category FROM products.

It’s not uncommon for a single BI dashboard to generate 20–30 queries, however the ratio of quick/cheap vs expensive queries on a given worksheet is often 7:1 or higher. This means for a dashboard generating 20 queries when first loaded, it’s common for only 2–3 of these to require significant processing by the data warehouse.

A dashboard typically is connected to Snowflake via ODBC or JDBC, with a single warehouse specified in the connection properties. When the dashboard is first loaded (or refreshed), it fires all of its queries at Snowflake; the ones which populate the filter controls will take a second or less each to run (even on an extra-small warehouse), while the more complex queries joining and aggregating larger fact tables (the heart of the analysis), require considerably more resources and will run for 30–60 seconds or longer on a large (L) or extra-large (XL) warehouse.

To give a good interactive experience to our dashboard users, we size the warehouse appropriately for the complex queries, to ensure they complete in a reasonable amount of time (a minute or less). Then, to support concurrent use of the dashboard across many users within a department, we configure the warehouse to auto-scale with increasing concurrency.

This is where things start to get interesting. Snowflake auto-scales horizontally, by temporarily adding “clusters” of the same size as our original warehouse. This is triggered once the concurrency limit for all currently-running clusters is reached, and queries start queueing. When our dashboard fires 20 simultaneous queries, the first 8 begin executing and the remainder are queued, triggering the spin-up of another cluster.

Given that in our hypothetical workload there is only ~1 query in 10 which may take 30+ seconds to run, there is a good chance that by the time additional capacity comes online, most of our small queries have already executed. A timeline helps illustate the dynamic:

When the dashboard submits its batch of queries, the already-running warehouse cluster (labeled “Baseline Cluster” in the diagram) goes from idle to full utilization, and query queue depth spikes. Another cluster starts spinning up (“Auto-scaled Cluster”); all the while the queue depth is rapidly falling as lightweight queries continue to be processed by the baseline cluster. Once the additional cluster is up and ready to go, the few remaining queries in the queue are distributed between the baseline cluster and the additional auto-scaled cluster.

This short amount of time between the cluster spin-up request and the cluster becoming available is on the order of a half-second — quite fast! The dynamic we observe is caused by a combination of 3 factors: the “bursty” nature of the query traffic, the significant skew in processing time (and required resources) between our few complex queries mixed with many lightweight ones in the same batch, and the fact that we sized the warehouse on the larger side to give good response time for the heavier queries.

We can imagine this a different way, at the query level, taking into account the ratio of simple to complex queries in the query burst from the dashboard, and observe how events unfold:

In the multi-user case, this pattern is magnified further, potentially resulting in more than one additional cluster being provisioned during concurrent query “bursts”. Each burst spins up new resources which only perform a small amount of work, sit idle for a few minutes, then are shut down again until the next burst arrives. The net result is over-provisioning of compute, with little measurable benefit to the user.

This wasteful pattern isn’t easy to spot, but can be uncovered with some analysis of QUERY_HISTORY, WAREHOUSE_EVENTS_HISTORY, and WAREHOUSE_METERING_HISTORY; the basic approach is to sessionize warehouse cluster events, then intersect these cluster “sessions” with individual queries to find auto-scaled clusters which are created but not well utilized.

If you’re are using Sundeck OpsCenter (it’s free, and open), the analysis is made easier since the cluster sessionization is already done for you; you can write a query like the following against OpsCenter reporting views to find underutilized auto-scaled clusters:

with cluster_sessions as (
-- auto-scaled cluster sessions started in the past 2 days, expanded
-- to give one row per second for each second the cluster was up
select dateadd(second, index, date_trunc(second, st)) as cluster_second,
st, et, warehouse_name, warehouse_id, cluster_number,
-- for multi-cluster warehouses, the cluster numbers between query_history
-- and warehouse_events_history are off by one ...
cluster_number + 1 as join_cluster_number
from opscenter.reporting.cluster_sessions,
lateral flatten(
-- nifty trick for doing the expansion
split(repeat(',', DATEDIFF(second, st, et)), ',')
)
where cluster_number > 0
and st >= dateadd(day, -2, current_timestamp)
order by st, warehouse_id, cluster_number
),
queries_on_autoscaled_clusters as (
-- same thing for queries from past 2 days, expand them by the number of seconds they ran
select dateadd(second, index, date_trunc(second, start_time)) as query_second,
eqh.query_id, eqh.start_time, eqh.end_time, eqh.warehouse_id, eqh.cluster_number
from opscenter.reporting.enriched_query_history eqh,
lateral flatten(
split(repeat(',', DATEDIFF(second, start_time, end_time)), ',')
)
where cluster_number > 1
and start_time >= dateadd(day, -2, current_timestamp)
order by query_id, index
),
cluster_activity_by_second as (
-- for each second the cluster was up, count how many queries were running in that second
select
cs.cluster_second, cs.st, cs.et, cs.warehouse_id, cs.warehouse_name, cs.cluster_number,
sum(case when query_id is null then 0 else 1 end) as queries
from
cluster_sessions cs
left outer join queries_on_autoscaled_clusters q ON (
cs.warehouse_id = q.warehouse_id
AND cs.join_cluster_number = q.cluster_number
AND cs.cluster_second = q.query_second
)
group by
cs.cluster_second, cs.st, cs.et, cs.warehouse_id, cs.warehouse_name, cs.cluster_number
)
-- lastly, roll up total # of seconds the cluster was up, and how many
-- of those seconds that queries were being processed on the cluster
select
st as cluster_spinup, et as cluster_spindown,
warehouse_id, warehouse_name, cluster_number,
count(*) as uptime_seconds,
sum(case when queries > 0 then 1 else 0 end) as query_seconds,
round(query_seconds/uptime_seconds*100.0,2) as utilized_pct,
max(queries) as max_concurrency
from cluster_activity_by_second
group by
st, et, warehouse_id, warehouse_name, cluster_number
order by
st desc, warehouse_name, cluster_number
;

Note: If you’d prefer to do the cluster sessionization yourself, you can refer to the OpsCenter code to get a jumpstart on the sessionization bits.

Running this query on my test system, where I have a PowerBI dashboard built on Snowflake’s TPC-H sample data, I observe the following:

This perfectly illustrates the pattern; here we see back-to-back auto-scale events which add a cluster, do very little processing on it, and then shut it down … then re-spin it again a minute later. The second spin-up in this example (top row) only performs work for 3 seconds out of 163 seconds of uptime.

OK, so now you’ve done the analysis in your own environment, and realize this scenario is happening to you too. Take heart! Awareness is the first step towards recovery. Let’s see what Snowflake features might help.

Economy Mode for Warehouse Autoscaling

The first option we can consider to address the auto-scale cluster “flapping” problem for our BI workload is to enable the Economy autoscaling policy on our warehouse. This tells Snowflake to “slow your roll” when queries start queuing, and not add another cluster until it’s estimated that the new cluster will have 6 minutes or more of work to do.

This will certainly help control costs, and in single-user / periodic burst scenarios will prevent the spin-up of the second warehouse altogether — with little to no impact on the user experience. However, in the multi-user case, the results can be less than ideal. Remembering our usage pattern of each dashboard firing ~20 cheap queries (taking 300–1000ms), and 2–3 expensive queries (taking 30s-60s+), another cluster won’t be requested until 48 of the large queries are in the queue. That’s 15–20 dashboard refreshes worth, and for the users who didn’t get their requests in early, their dashboards will appear to “freeze” for periods of time waiting on resources.

Query Acceleration Service

A more promising approach is to use Snowflake’s Query Acceleration Service, which is designed to better accommodate heterogeneous workloads running on a given warehouse. This works by assigning serverless resources to particular queries, effectively allowing our large/complex queries to use temporary resources outside of the warehouse they are submitted to.

In this pattern, rather than sizing our warehouse up for the largest queries, we would instead size it down to more closely match the size of the smaller queries in the workload. Then, we enable the Query Acceleration Service on the warehouse, and set a “scale factor” for the serverless resources as a multiple of the current warehouse size.

Ideally for our example, we could size the warehouse to Small, then enable QAS with a scale factor of 8, which would allow the larger queries to consume up to 8 times the resources of our Small warehouse (i.e. an XL equivalent), on an as-needed basis.

This looks like a good solution, however there is a catch: Query Acceleration Service only works for specific types of queries. It can help considerably when queries need to scan many partitions using selective filters, and when performing certain aggregations (where the cardinality is not too high). However, there are numerous examples where QAS won’t kick in, such as with queries with a LIMIT but no ORDER BY(note that PowerBI often applies LIMIT 1000001 OFFSET 0to queries it generates).

This means that we won’t know ahead of time which queries generated by our dashboard will qualify for QAS, so if we size our warehouse down expecting QAS to handle all of the really big stuff, our users can be surprised if their dashboard submits a complex query which is not accelerated, and is stuck running on a small warehouse.

Then, there’s Sundeck

At Sundeck, we’ve developed a unique SaaS platform that sits in the query flow between SQL tools and Snowflake, allowing queries to be matched and acted on before they ever reach Snowflake. This allows us to easily perform warehouse routing at runtime, based on properties of the queries themselves (including the shape and complexity of the query).

In the next post, I’ll give a quick example of how we can dynamically route heterogenous workloads on a per-query basis, using the Sundeck platform. It will be fun! Stay tuned!

--

--