Calculating Median with Biguery SQL

Arpit Rana
3 min readDec 6, 2020
Photo by Ruthson Zimmerman on Unsplash

I would normally do this in Pandas, but for reporting consistently, daily, weekly, monthly the median value of a particular metric, it was important to have this be part of the SQL code that generates the end extract table. Bigquery does not have Median function built-in hence we have to find a new way to calculate median of field in another way

Median

Let’s start by outlining the logic of the median. In order to find the median of an array of numbers we must:

  • Sort the array
  • Find the value of the middle item in the array

Step 2 has two more conditions to be calculated if the number of elements in an array are:

  • odd — median is simply the middle value
  • even — median is the average of the middle two values

We use the above definitions in somewhat of a clever way to solve this problem. I found multiple solutions online, that I’m going to adapt and break

The Setup

Using Bigquery public data sets bigquery-public-data.chicago_taxi_trips.taxi_trips calculate the median of field calledtrip_seconds

Solution 1

Here is the first solution

DECLARE count INT64;
SET count = (SELECT COUNT(*) FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`)+1;
WITH base as (
SELECT
trip_seconds
,ROW_NUMBER() OVER (ORDER BY trip_seconds) AS rowindex
-- ,(SELECT COUNT(*) FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`)+1 as count
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
)
SELECT
AVG(trip_seconds) AS median__trip_seconds
FROM
base
WHERE
base.rowindex IN (FLOOR(count / 2), CEIL(count / 2));

Read about Scripting in Standard SQL -here

We are using the Logic above as a framework to solve this problem.

  • Generate a sorted array with row_number window function to keep track of the length of the array
  • Return the average trip duration for the middle two elements of the array. If the length of the array is even, we average the two middle indices. If the length of the array is odd, then the middle two indices are actually equal, so the average of the two is just the middle value itself.

We define the variable count that will be used to assign the count of rows in the column. To understand we add + 1 you will need to go through this article on how the median of a value is calculated.

DECLARE count INT64;
SET count = (SELECT COUNT(*) FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`)+1;

Next, we generate a row index sorted in ascending order of the values of the column using window function ROW_NUMBER()

 ROW_NUMBER() OVER (ORDER BY trip_seconds) AS rowindex

On to the final bits, we calculate the average value of the trip_seconds between the middle value for an odd number of elements and the middle two values for the even number of elements using the where condition below.

SELECT
AVG(trip_seconds) AS median__trip_seconds
FROM
base
WHERE
base.rowindex IN (FLOOR(count / 2), CEIL(count / 2));

The median value is 540 seconds.

What I Learned

This was a fun exercise in combining user-defined variables and subqueries. Up until this point, I had actually never run any queries with user-defined variables. To solve this problem, I rewrote the code to fit the taxi_trips table. Then, I just messed around with each piece of the query, running only small chunks at a time to gain an understanding of every single line. Once I had grasped all the concepts, I immediately thought to try indexing with row_number(). Overall, this was an enjoyable learning experience, and I’m sure there are many more ways to calculate the median in Bigquery SQL. Do you have any other solution? Can you think of better more efficient ways for me to write my code? Let me know in my comments!

--

--

Arpit Rana

Analytics Engineer , Airflow Developer, Tableau Developer, Bigquery, GCP, DBT @ACV AUCTIONS