Almost everything is a time serie

BigQuery Techniques for Robust Time Series Data Handling

Mar 14, 2025 · Lucien Varacca

tl;dr

Through numerous client engagements involving time series analysis, I’ve observed a recurring pattern in data preparation and forecasting. While the specific applications vary, the fundamental steps are often the same. As most client work is confidential, this deep-dive provides a valuable opportunity to illustrate these techniques with public data.

Introduction

In virtually any data role – be it data engineer, analytics engineer, or data architect – time series data is a constant presence. Despite the vast array of use cases, the fundamental data types we encounter are surprisingly few, and time series is arguably the most prevalent. Think weather, sensor readings, sales, finance, transportation – any metric tracked over time is inherently time series.

And while this arguably will never change, the way we handle it has greatly evolved. Data types remain constant, but tools evolve.

Having transitioned from days of writing PySpark on a large on-prem Hadoop cluster to modern cloud solutions like BigQuery, the difference is transformative. The combination of Python and SQL on BigQuery now handles most of my needs, and BigQuery Studio eliminates the overhead of managing infrastructure like JupyterHub. Auto-save, multiple runtimes, scheduling it all feels like a weight lifted.

And you may start to understand the point of this deep-dive: to explore how to effectively store, process, and forecast time series data in a simple and scalable way, using the tools I’ve come to rely on.

BigQuery Studio Interface

The beautiful interface of BigQuery Studio that will be our playground for today

First, let’s find data

As usual, let’s rely on open data to get a credible example. Let’s change from the traditional sales dataset or stock market datasets, with electricity consumption data from the UCI Machine Learning Repository. It contains electricity consumption data from 370 clients, collected between 2011 and 2014. The data shows electricity consumption for each 15-minute interval, resulting in 96 measurements per day for each client.

The dataset is modest, with only 140k rows and 397MB of data. The CSV file has 371 columns: the first representing timestamps and the other 370 each representing a client. Lucky us, the data is clean : the timestamp is a string with the following format yyyy-mm-dd hh:mm:ss

After downloading the CSV file and uploaded it to a GCS bucket, let’s create a partitioned table in BigQuery (to have a clear conscience if the data volume suddenly ballooned). For the sake of simplicity, I work on a BigQuery Notebook, allowing me to seamlessly run SQL queries and Python code.

from google.cloud import bigquery

client = bigquery.Client()

project_id, dataset, table_raw ="mfg-bq-demo",  "timeseries", "ElectricityLoadDiagrams"

columns = ["timestamp"] + [f"MT_{str(i).zfill(3)}" for i in range(1, 371)]
create_table_query = f"""
CREATE  TABLE IF NOT EXISTS `{dataset}.{table_raw}`  (
   timestamp TIMESTAMP,
   {f", ".join(f"{col} FLOAT64" for col in columns[1:])}
) PARTITION BY DATE(timestamp)"""

query_job = client.query(create_table_query)
query_job.result()

Once the table created, we can upload our CSV file


table_ref = client.get_table(f"{project_id}.{dataset}.{table_raw}")

job_config = bigquery.LoadJobConfig(
       schema= [{"name": "timestamp", "type": 'TIMESTAMP'}] + [{"name": f"MT_{str(i).zfill(3)}", "type": "FLOAT"} for i in range(1, 371)],
       skip_leading_rows=1,
       source_format=bigquery.SourceFormat.CSV,
       field_delimiter=';',
       write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
   )

uri = "gs://mfg-timeseries-data/LD2011_2014.txt"
load_job = client.load_table_from_uri(uri, table_ref, job_config=job_config)

load_job.result()

Table details

Everything’s looking good so far but while BigQuery can seamlessly ingest up to 10,000 columns, it would be more handy to have only three columns : timestamp, client_id and measure. In other words, our SQL queries will be easier to write for an unpivoted table - only a crazy mind would accept to work with that many columns. And because the architecture of BigQuery is quite something and Colossus and Borg are hell of good solutions, there is virtually no limit to the number of lines you can ingest.

Let’s first create the table

table_unpivoted = "ElectricityLoadDiagramsUnpivoted"

create_unpivotedtable_query = f"""
CREATE  TABLE IF NOT EXISTS `{dataset}.{table_unpivoted}`  (
   timestamp TIMESTAMP,
   client_id STRING,
   values FLOAT64
) PARTITION BY DATE(timestamp)"""

query_job = client.query(create_unpivotedtable_query)
query_job.result()

And unpivot our initial data

columns = [f"MT_{str(i).zfill(3)}" for i in range(1, 371)]

unpivot_data_query = f"""
INSERT INTO `{dataset}.{table_unpivoted}` (timestamp, client_id, values)
SELECT timestamp, client_id, values FROM timeseries.ElectricityLoadDiagrams
UNPIVOT(values FOR client_id IN {', '.join(columns)})"""

query_job = client.query(unpivot_data_query)
query_job.result()

You can’t always get what you want

We’re fortunate that this open data dataset is remarkably clean, which is uncommon in real life to say the least. Before we discuss unaligned time series, let’s establish two basic checks for time series data: eliminating duplicates and filling gaps. Let’s simulate common data quality issues to make it more representative of real-world scenarios.

So before processing to remove them, let’s introduce duplicates and gaps in our table.

We can introduce both local and global time gaps into our dataset. Local time gaps, specific to individual clients, can be created by randomly deleting rows within the unpivoted data. Conversely, global time gaps, affecting all clients, can be generated by randomly deleting rows from the original, pivoted dataset. To achieve a comprehensive analysis, we will implement both types of time gaps. Because this wouldn’t be fun otherwise, we will do both.

## Manually create global gaps in the table
%%bigquery
DELETE FROM `timeseries.ElectricityLoadDiagrams`
WHERE RAND() < 0.01;

We re-created the ElectricityLoadDiagramsUnpivoted table in order (with the query above) to include the gaps before adding the local gaps :

## Manually create local gaps in the table
%%bigquery
DELETE FROM `timeseries.ElectricityLoadDiagramsUnpivoted`
WHERE RAND() < 0.01;

Final step, let’s add some duplicates

%%bigquery
INSERT INTO `timeseries.ElectricityLoadDiagramsUnpivoted` (timestamp, client_id, values)
SELECT timestamp, client_id, values
FROM `timeseries.ElectricityLoadDiagramsUnpivoted`
WHERE RAND() < 0.2;

Congrats, the data is now bad enough to looks real 🎉

Time for some insights

Of course, before deep-diving, we need to remove the ~20% duplicates we just introduced in our table and to fill the gaps we created.

SELECT
 (COUNT(*) / COUNT(DISTINCT CONCAT(timestamp, client_id)) - 1) * 100 AS row_count_ratio
FROM
 `mfg-bq-demo.timeseries.ElectricityLoadDiagramsUnpivoted`;

In this scenario, the most straightforward approach is to utilize BigQuery’s built-in time series functions and recreate the table. (Materialized views were not an option due to the GAP_FILL function’s current incompatibility with our unpivoted table.) I’ve opted to fill missing values with NULLs, but BigQuery also offers local linear interpolation for more meaningful replacements. Notably, gap-filling isn’t strictly necessary for forecasting, as interpolation will be applied during that stage.

%%bigquery
CREATE OR REPLACE TABLE `timeseries.ElectricityLoadDiagramsUnpivoted`
PARTITION BY DATE(timestamp)
AS
SELECT timestamp, client_id, values
FROM GAP_FILL(
  (
   SELECT timestamp, client_id, values
   FROM (
     SELECT
       timestamp,
       client_id,
       values,
       ROW_NUMBER() OVER (PARTITION BY timestamp, client_id ORDER BY timestamp) AS rn
     FROM
       `timeseries.ElectricityLoadDiagramsUnpivoted`
   )
   WHERE rn = 1
 ),
 ts_column => 'timestamp',
 partitioning_columns => ['client_id'],
 bucket_width => INTERVAL 15 MINUTE,
 value_columns => [
   ('values', 'null')
 ]
);

Now that everything is set, we can finally begin exploring our data 🔎

%%bigquery
SELECT MIN(timestamp) as min_date,  MAX(timestamp) AS max_date FROM timeseries.ElectricityLoadDiagramsUnpivoted

min_date	max_date
2011-01-01 00:15:00+00:00	2015-01-01 00:00:00+00:00

%%bigquery
SELECT COUNT(DISTINCT client_id) AS unique_clients FROM timeseries.ElectricityLoadDiagramsUnpivoted

	unique_clients
0	370

Not much surprise so far, the provided dataset information is .. on point.
Let’s have a look at the evolution of active clients

%%bigquery
## Evolution of active clients
SELECT
 TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
 client_id,
 COUNTIF(values != 0) > 0 AS is_active
FROM timeseries.ElectricityLoadDiagramsUnpivoted
GROUP BY day, client_id
ORDER BY day

day	client_id	is_active
0	2011-01-01 00:00:00+00:00	MT_225	True
1	2011-01-01 00:00:00+00:00	MT_215	True
2	2011-01-01 00:00:00+00:00	MT_059	False
3	2011-01-01 00:00:00+00:00	MT_218	True
4	2011-01-01 00:00:00+00:00	MT_032	False
…	…	…	…
540931	2015-01-01 00:00:00+00:00	MT_319	True
540932	2015-01-01 00:00:00+00:00	MT_167	True
540933	2015-01-01 00:00:00+00:00	MT_176	True
540934	2015-01-01 00:00:00+00:00	MT_100	True
540935	2015-01-01 00:00:00+00:00	MT_011	True

540936 rows × 3 columns

As you may have noticed, we can execute both Python code and SQL queries directly within the notebook. We can use IPython magics like %%bigquery to write and run SQL queries seamlessly. These queries can be parameterized, allowing us to load results into variables for later use within the notebook. Let’s demonstrate this by calculating the number of active clients per day.

%%bigquery active_clients_by_day
SELECT
 day,
 COUNT(DISTINCT CASE WHEN is_active THEN client_id END) AS active_clients
FROM
(
 SELECT
   TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
   client_id,
   COUNTIF(values != 0) > 0 AS is_active
 FROM timeseries.ElectricityLoadDiagramsUnpivoted
 GROUP BY day, client_id
)
GROUP by day
ORDER BY day

We can know do our data visualization using `seaborn` as we normally would

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


plt.figure(figsize=(10, 6))
ax = sns.lineplot(data=active_clients_by_day, x="day", y="active_clients", color="#4285F4", linewidth=2)

ax.set_xlabel("date")
ax.set_ylabel("active clients")
ax.set_title("Evolution of active clients")

ax.xaxis.set_major_locator(mdates.AutoDateLocator())
plt.gcf().autofmt_xdate()

plt.grid(True, alpha=0.5)  # Add a subtle grid
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

Table details

Another metric that came to mind is the evolution of kwH consumption

%%bigquery kwH_consumption_by_day

SELECT
 TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
 AVG(values/4) AS avg_kWh_load
FROM timeseries.ElectricityLoadDiagramsUnpivoted
WHERE timestamp >= '2012-01-01'
GROUP BY day
ORDER BY day

We will smooth a bit the curve for a better visual result

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


window_size = 10
kwH_consumption_by_day['avg_kWh_load_smooth'] = kwH_consumption_by_day['avg_kWh_load'].rolling(window=window_size, center=True).mean() # center parameter important to avoid shift


plt.figure(figsize=(10, 6))
ax = sns.lineplot(data=kwH_consumption_by_day, x="day", y="avg_kWh_load_smooth", color="#4285F4", linewidth=2)

ax.set_xlabel("date")
ax.set_ylabel("kwH")
ax.set_title("Evolution of kwH consumption")

ax.xaxis.set_major_locator(mdates.AutoDateLocator())
plt.gcf().autofmt_xdate()

plt.grid(True, alpha=0.5)
plt.tight_layout()
plt.show()

Table details

Generally, 8 kWh per day can be considered a moderate energy consumption for an average-sized household (2 to 4 people) in a standard-sized dwelling.

Let’s compare the average kwH consumption by month

%%bigquery average_kWh_consumption_by_month

SELECT
 EXTRACT(YEAR FROM timestamp) as year,
 EXTRACT(MONTH FROM timestamp) as month,
 AVG(values/4) AS avg_kWh_load
FROM timeseries.ElectricityLoadDiagramsUnpivoted
WHERE timestamp >= '2012-01-01'
GROUP BY year, month
HAVING year in (2012, 2013)
ORDER BY year, month

plt.figure(figsize=(10, 6))

ax = sns.barplot(x='month', y='avg_kWh_load', hue='year', data=average_kWh_consumption_by_month, palette=["#174EA6", "#4285F4" ])

ax.set_title('Average kWh Consumption by Month (2012 vs 2013)')
ax.set_xlabel('Month')
ax.set_ylabel('Average kWh')

plt.xticks(range(1, 13))
plt.legend(title='Year')
plt.grid(True, alpha=0.5)
plt.tight_layout()

ax.set_ylim(6.5e+13, None)

plt.show()

Table details

%%bigquery monthly_augmentation_percentage
SELECT
month,
ROUND(
 ABS(_2012 - _2013) / ((_2012 + _2013) / 2) * 100
 , 2) as augmentation_percentage
 FROM (
   SELECT
     EXTRACT(YEAR FROM timestamp) as year,
     EXTRACT(MONTH FROM timestamp) as month,
     AVG(values/4) AS avg_kWh_load
   FROM timeseries.ElectricityLoadDiagramsUnpivoted
   WHERE timestamp >= '2012-01-01'
   GROUP BY year, month
   HAVING year in (2012, 2013)
   ORDER BY year, month
 ) PIVOT (AVG(avg_kWh_load) FOR year IN (2012, 2013))
 ORDER BY month

month	augmentation_percentage
0	1	6.55
1	2	6.66
2	3	5.67
3	4	4.65
4	5	5.20
5	6	4.49
6	7	2.05
7	8	1.66
8	9	3.68
9	10	5.71
10	11	6.12
11	12	5.86

Switch the data to be unaligned

Before trying to forecast the electricity consumption of various customers, I’d like to digress on one practical aspect of TS data. Most of the time, time series data is unaligned, meaning that data points are not received simultaneously. The logic of what we have seen before is very similar, with TIMESTAMP_BUCKET still applying.
Moving forward, we’ll continue with aligned data, but I wanted to take a moment to show you how to simulate unaligned time series data.

We can transform our table to simulate unaligned data, allowing for a maximum delay of 5 minutes in data arrival.

table_unpivoted_unaligned = "ElectricityLoadDiagramsUnpivotedUnaligned"

create_unaligned_data_query = f"""
CREATE  TABLE IF NOT EXISTS `{dataset}.{table_unpivoted_unaligned}`
PARTITION BY DATE(timestamp)
AS (
 SELECT
   TIMESTAMP_ADD(
     timestamp , INTERVAL CAST(RAND() * 5 * 60 AS INT64) SECOND
   ) AS timestamp,
   client_id,
   values
 FROM {dataset}.{table_unpivoted}
);
"""

query_job = client.query(create_unaligned_data_query)
query_job.result()

A key challenge when working with multiple unaligned time series is the lack of synchronized start and end points, making direct comparison difficult. This is further compounded by differing sampling rates, time zone discrepancies, and the presence of missing or irregularly spaced data, all of which hinder consistent analysis.

However, techniques like TIMESTAMP_BUCKET offer a practical solution for realigning these series, enabling coherent analysis and comparison.

%%bigquery monthly_kwH_consumption

SELECT
 FORMAT_DATE('%Y', timestamp) as year,
 FORMAT_DATE('%B', timestamp) as month,
 AVG(values/4) AS avg_kWh_load
FROM timeseries.ElectricityLoadDiagramsUnpivotedUnaligned
WHERE timestamp >= '2012-01-01' AND timestamp < '2015-01-01'
GROUP BY year, month
ORDER BY year, month

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
monthly_kwH_consumption['month'] = pd.Categorical(monthly_kwH_consumption['month'], categories=month_order, ordered=True)

monthly_kwH_consumption_pivot = monthly_kwH_consumption.pivot(index="month", columns="year", values="avg_kWh_load")

sns.heatmap(monthly_kwH_consumption_pivot, fmt="d", cmap="coolwarm")

Just a quick note before we move on with forecasting : you might see me using FORMAT_DATE('%B', timestamp) and TIMESTAMP_BUCKET(timestamp, INTERVAL 1 MONTH) for monthly aggregation. While they might visually produce similar results, they function very differently.

FORMAT_DATE creates month names as text, causing alphabetical sorting – which is why we need Pandas’ Categorical to fix the order. TIMESTAMP_BUCKET, on the other hand, generates actual timestamps representing the start of each month. This allows BigQuery to understand and sort the data chronologically, eliminating the need for manual reordering and ensuring accurate time-based analysis.

Forecasting the future, 15-minutes at the time

The number one thing one might want to do with time series data is forecasting what is next. In our case, we have enough data to predict the consumption of a given customer. We can achieve this without going far thanks to BigQueryML, it lets you create and run machine learning (ML) models by using SQL queries. Among various models available, the one we’re gonna be interested in today is ARIMA_PLUS.

Because we want to forecast the consumption of each customer, we need to forecast multiple time series. The ARIMA_PLUS model lets us do that thanks to the TIME_SERIES_ID_COL options.

As always with BQML, creating the model is straightforward :

%%bigquery
CREATE MODEL `timeseries.ElectricityForecastConsumption`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
        time_series_timestamp_col='timestamp',
        time_series_data_col='values',
        time_series_id_col='client_id',
        holiday_region="CA"
        )
AS
SELECT
 timestamp,
 client_id,
 values
FROM
 `timeseries.ElectricityLoadDiagramsUnpivoted`

It processed ~23GB and lasted around 3 hours (as mentioned before, we had to forecast each customers).

BQML automatically replaced the NULLs values we created earlier based on linear interpolation. It also automatically filled the gaps.
At this point, the more curious one can perform an evaluation (only for the best model by default) and inspect the model coefficients :

%%bigquery
SELECT * FROM ML.ARIMA_EVALUATE(MODEL timeseries.ElectricityForecastConsumption);
SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL timeseries.ElectricityForecastConsumption);

With our model now operational, we can forecast future electricity consumption for each client by specifying the forecast horizon (the number of time points). To forecast the first 10 days of January, the horizon would be 960, given that we have a data point every 15 minutes.

%%bigquery

SELECT *
FROM ML.FORECAST(
 MODEL `timeseries.ElectricityForecastConsumption`,
 STRUCT(960 AS horizon , 0.9 AS confidence_level)
)

Since we didn’t explicitly define the forecast horizon, the model defaults to a maximum of 1000 time points. We could increase this horizon up to 10,000 time points by retraining the model, though this would increase training time. For forecasts exceeding 10,000 time points (approximately 3.5 months in our current hourly data), we can aggregate the data to a coarser granularity, such as hourly, thereby reducing the number of points and allowing us to forecast further, potentially up to 14 months for each client.

For now, let’s leverage our 15-min model to get a client consumption prediction

%%bigquery ClientConsumptionPrediction
SELECT
   TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
   AVG(values/4) AS avg_kWh_load,
   'Historical' AS data_source,
   null AS lower_bound,
   null AS upper_bound  
FROM
   timeseries.ElectricityLoadDiagramsUnpivoted
WHERE
   client_id = 'MT_001' AND timestamp >= '2014-12-01'
GROUP BY day
UNION ALL
SELECT
   TIMESTAMP_BUCKET(forecast_timestamp, INTERVAL 24 HOUR) AS day,
   AVG(forecast_value/4) AS avg_kwh_load,
   'Forecast' AS data_source,
   AVG(prediction_interval_lower_bound/4) AS lower_bound, 
   AVG(prediction_interval_upper_bound/4) AS upper_bound     
FROM
   ML.FORECAST(
       MODEL `timeseries.ElectricityForecastConsumption`,
       STRUCT(960 AS horizon, 0.9 AS confidence_level)
   )
WHERE
   client_id = 'MT_001'
GROUP BY day
ORDER BY day

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

window_size = 10
ClientConsumptionPrediction['avg_kwh_load_smooth'] = ClientConsumptionPrediction['avg_kWh_load'].rolling(window=window_size, center=True).mean()

plt.figure(figsize=(10, 6))

sns.lineplot(
   data=ClientConsumptionPrediction[ClientConsumptionPrediction['data_source'] == 'Historical'],
   x="day",
   y="avg_kwh_load_smooth",
   color="#4285F4",
   linewidth=2,
   label="Historical"
)

sns.lineplot(
   data=ClientConsumptionPrediction[ClientConsumptionPrediction['data_source'] == 'Forecast'],
   x="day",
   y="avg_kwh_load_smooth",
   color="#4285F4",
   linewidth=2,
   linestyle='dotted',
   label="Forecast"
)


plt.xlabel("date")
plt.ylabel("kWh")
plt.title("Evolution of kWh consumption for MT_001")

plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.gcf().autofmt_xdate()

plt.grid(True, alpha=0.5)
plt.tight_layout()
plt.legend()
plt.show()

Let’s compare that on the same period one year before

As mentioned above, it intuitively doesn’t make sense to train the MODEL on 15-minutes intervals if we aggregate the data later on (with the risk to loose precision).

Going further in the future

Let train a larger model on daily-aggregated data

%%bigquery
CREATE MODEL `timeseries.DailyElectricityForecastConsumptionExpanded`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
        time_series_timestamp_col='day',
        time_series_data_col='avg_kWh_load',
        time_series_id_col='client_id',
        holiday_region="CA",
        horizon=10000
        )
AS
SELECT
 TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
 client_id,
 AVG(values/4) AS avg_kWh_load
FROM
 `timeseries.ElectricityLoadDiagramsUnpivoted`
GROUP BY day, client_id

%%bigquery ClientDailyConsumptionPrediction
SELECT
   TIMESTAMP_BUCKET(timestamp, INTERVAL 24 HOUR) AS day,
   AVG(values/4) AS avg_kWh_load,
   'Historical' AS data_source,
FROM
   timeseries.ElectricityLoadDiagramsUnpivoted
WHERE
   client_id = 'MT_003' AND timestamp >= '2014-12-01'
GROUP BY day
UNION ALL
SELECT
   TIMESTAMP_BUCKET(forecast_timestamp, INTERVAL 24 HOUR) AS day,
   avg(forecast_value) AS avg_kwh_load,
   'Forecast' AS data_source,
FROM
   ML.FORECAST(
       MODEL `timeseries.DailyElectricityForecastConsumptionExpanded`,
       STRUCT(1000 AS horizon, 0.9 AS confidence_level)
   )
WHERE
   client_id = 'MT_003'
GROUP BY day
ORDER BY day

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

window_size = 10
ClientDailyConsumptionPrediction['avg_kwh_load_smooth'] = ClientDailyConsumptionPrediction['avg_kWh_load'].rolling(window=window_size, center=True).mean()

plt.figure(figsize=(10, 6))

sns.lineplot(
   data=ClientDailyConsumptionPrediction[ClientDailyConsumptionPrediction['data_source'] == 'Historical'],
   x="day",
   y="avg_kwh_load_smooth",
   color="#4285F4",
   linewidth=2,
   label="Historical"
)

sns.lineplot(
   data=ClientDailyConsumptionPrediction[ClientDailyConsumptionPrediction['data_source'] == 'Forecast'],
   x="day",
   y="avg_kwh_load_smooth",
   color="#4285F4",
   linewidth=2,
   linestyle='dotted',
   label="Forecast"
)

plt.xlabel("date")
plt.ylabel("kWh")
plt.title("Evolution of kWh consumption for MT_003")

plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.gcf().autofmt_xdate()

plt.grid(True, alpha=0.5)
plt.tight_layout()
plt.legend()
plt.show()

We’ve extended the forecast horizon considerably thanks to the aggregated data, but the persistent pattern starting January 8th, 2015, clearly points to the need for more robust feature engineering.

Merely accounting for holidays isn’t enough to capture the intricacies of this data. To enhance predictive accuracy, we should incorporate lagged consumption (e.g., consumption from the previous day, week, or month) and segment customers based on their usage, building tailored models for each segment.

Conclusion

We’ve covered a lot, so I’ll stop there for this deep dive, but I think you got the idea: BigQuery is really handy when it comes to handling time series data. Preprocessing and forecasting can be efficiently performed within BigQuery using SQL, even at large scales. Although we used batch data for this example, the same approach applies to streaming data.

Think of time series analysis like navigating a vast, intricate map. BigQuery provides the tools – the compass, the magnifying glass, and the telescope – to explore this landscape. We can zoom in to examine individual data points, pan out to observe long-term trends, and even peer into the future with forecasting. And we can do all of this at BigQuery scale, effortlessly handling the vast and ever-expanding territories of time series data.