Analyzing Strava data with ClickHouse

January 1st, 2025

It’s a new year and I’ve been messing around with some “new” tech, ClickHouse! Let’s imagine your online business requires a couple things, a checkout process for handling customer transactions, adding items to carts, processing orders, etc. Additionally, you also need a way to analyze sales patterns over time to make better business decisions.

These different needs require different database approaches. Transactional databases (OLTP) handle day-to-day operations, while analytical databases (OLAP) help you understand trends and patterns in your data. While these black and white definitions drastically oversimplify the reality, we are going to dig into the OLAP side of things by getting more familiar with ClickHouse.

ClickHouse is a highly performant and large-scale capable columnar database. Columnar databases (like ClickHouse) store data differently than traditional row-based databases. In a row-based database like MySQL, data is stored and retrieved one row at a time - so reading a customer’s name also loads their address, phone number, and other fields. In contrast, columnar databases like ClickHouse store each column’s data separately on disk. This means when you want to calculate the average order value across millions of orders, it only needs to read the ‘amount’ column, making analytical queries dramatically faster.

Project overview & setup

I’ve covered setup already, which you can find at the following:

This will allow you to follow along with the code and setup.

Anatomy of a GPX File: Core Components and Structure

My Garmin watch packages up my runs into an XML file known as GPX. A GPX (GPS Exchange Format) file is composed of these key elements, arranged hierarchically:

<gpx>                      # Root container for all GPS data
  <metadata>               # File-level information (time, creator)
    <time>...</time>
  </metadata>
 
  <trk>                   # A single activity/track
    <name>...</name>      # Activity details
    <type>...</type>
 
    <trkseg>             # One continuous segment of GPS data
      <trkpt>           # Individual GPS point with associated data
        <ele>...</ele>   # Elevation
        <time>...</time> # Timestamp
        <extensions>     # Additional data (heart rate, cadence, etc.)
          ...
        </extensions>
      </trkpt>
    </trkseg>
  </trk>
</gpx>

Schemas

Since we need to run queries over this data often to get analytics info for things like dashboards, client side graphing, etc. we need to think about optimal schema design before going too much further.

We will start with 2 tables:

activities
track_points

The columns are straightforward, but we will discuss the indices in a bit more detail below.

activities

CREATE TABLE activities (
    activity_id UUID,
    user_id UUID,
    start_time DateTime64(3) CODEC(DoubleDelta),
    end_time DateTime64(3) CODEC(DoubleDelta),
    activity_type LowCardinality(String),
    distance_meters Decimal64(2),
    duration_seconds Int32,
    avg_heart_rate Int32,
    avg_pace_seconds_per_km Decimal64(2),
    max_heart_rate Int32,
    max_pace_seconds_per_km Decimal64(2),
    total_elevation_gain_meters Decimal64(2),
    gpx_file_url String,
    created_at DateTime64(3) DEFAULT now64(3),
    PRIMARY KEY (user_id, start_time),
INDEX idx_activities_type activity_type TYPE set(0) GRANULARITY 1
)
ENGINE = MergeTree()
ORDER BY (user_id, start_time);

track_points

CREATE TABLE track_points (
  point_id UUID DEFAULT generateUUIDv4(),
  activity_id UUID,
  timestamp DateTime64(3) CODEC(DoubleDelta),
  latitude Decimal64(6),
  longitude Decimal64(6),
  elevation_meters Decimal64(2),
  heart_rate Int32,
  speed_mps Decimal64(2),
  cadence Int32,
  temperature Decimal64(1),
  h3index UInt64 MATERIALIZED geoToH3(CAST(latitude AS Float64), CAST(longitude AS Float64), 12),
  PRIMARY KEY (activity_id, timestamp),
  INDEX idx_track_points_h3 (h3index) TYPE minmax GRANULARITY 4,
  INDEX idx_track_points_location (longitude, latitude) TYPE minmax GRANULARITY 4
)
ENGINE = MergeTree()
ORDER BY (activity_id, timestamp);

Indices

Foreign key allows us to associate granular GPS data to a particular activity.
idx_track_points_activity index is crucial for quickly finding points for an activity in order
idx_track_points_location index is a geospatial index that allows for efficient geospatial queries like finding all points within a bounding box, closest points to a location or determining where routes intersect.

Data parser and ingest

Typically, indices help with:

block-level skipping large chunks of data on read
queries that only grab a few columns from a row
minimal storage impact

In ClickHouse, we store blocks which represent the basic unit of storage in ClickHouse and default to 8,192 rows. Data within those blocks are stored in columns (instead of rows like we are used to with traditional MySQL databases). When we create indices, we create them over blocks, not the rows.

Let’s consider a realistic example on why indices matter at scale.

Dataset overview example

10 million activity records
8,192 rows per block ≈ 1,220 blocks
Activity distribution:
- 50% running (5M rows)
- 30% cycling (3M rows)
- 20% other (2M rows)

Sample Query

Let’s examine a common query pattern for activity data:

activities

SELECT *  FROM activities  WHERE activity_type = 'running'  AND start_time >= now() - INTERVAL 30 DAY AND heart_rate > 160;

Without Index: Full Table Scan

When no index exists on activity_type, ClickHouse must:

Read every block that matches the time range
Examine every row in each block
Filter for running activities
Check heart rate threshold

For a 30-day time range covering 200 blocks:

Must read & process all 200 blocks
Approximately 1.6M rows examined
Significant I/O and CPU overhead

With Set Index: Optimized Scanning

Adding a set index on activity_type:

activities

INDEX idx_activities_type activity_type TYPE set(0) GRANULARITY 1

Now ClickHouse can:

Check the index to identify blocks containing ‘running’
Skip blocks with no running activities entirely
Only read blocks that potentially contain matches

For the same 200 blocks:

Only reads ~100 blocks containing running activities
Examines approximately 800K rows
Other 100 blocks skipped entirely

Performance Comparison

Without Index: Read: 1.6M rows Processed: 1.6M rows Returned: 5,000 rows Time: 2.5 seconds

With Set Index: Read: 800K rows Processed: 800K rows Returned: 5,000 rows Time: 1.2 seconds

Multiple benefits from this simple line in our schema that yield:

50% reduction in blocks read from disk
50% fewer rows processed
Approximately 2x faster query execution

Additionally you’d see CPU and memory improvements as well.

Understanding our indices

In the activities table we have two indices:

activities

...
PRIMARY KEY (user_id, start_time), 
INDEX idx_activities_type activity_type TYPE set(0) GRANULARITY 1

Primary key enables efficient time range filtering per user
Set index works for low cardinality columns (activity_type) like cycling/running
LowCardinality(String) reduces memory for string columns
Granularity 1 means one index entry per block; higher values like 4 mean less precision but smaller index
Set(0) stores exact value sets per granule for perfect matches, while set(5) assumes ≤5 unique values per block and may allow false positives

For the track_points table, we have 4 indices:

track_points

  ...
  h3index UInt64 MATERIALIZED geoToH3(CAST(latitude AS Float64), CAST(longitude AS Float64), 12),
  PRIMARY KEY (activity_id, timestamp),
  INDEX idx_track_points_h3 (h3index) TYPE minmax GRANULARITY 4,
  INDEX idx_track_points_location (longitude, latitude) TYPE minmax GRANULARITY 4

Primary key here is good for sorting data efficiently for a single activity you’re targeting
The H3 geospatial index has a precision of 12 which gets us meter-level precision.
The minmax index on lat/long is useful for bounding box type queries and complements H3 well

With this schema design we can optimally answer questions like:

Show complete track for activity X

track_points

SELECT
    timestamp,
    latitude,
    longitude,
    elevation_meters,
    speed_mps,
    heart_rate,
    temperature
FROM track_points
WHERE activity_id = '123e4567-e89b-12d3-a456-426614174000'
ORDER BY timestamp;

What part of activity X occurred between time step Y and Z?

track_points

SELECT
    timestamp,
    latitude,
    longitude,
    elevation_meters,
    speed_mps,
    heart_rate,
    temperature
FROM track_points
WHERE activity_id = '123e4567-e89b-12d3-a456-426614174000'
    AND timestamp BETWEEN '2024-01-01 14:30:00' AND '2024-01-01 15:30:00'
ORDER BY timestamp;

What activities exist in a given geographical area?

track_points

SELECT
    activity_id,
    count() as points_in_area,
    min(timestamp) as first_seen,
    max(timestamp) as last_seen
FROM track_points
WHERE longitude BETWEEN 10.0 AND 11.0
    AND latitude BETWEEN 50.0 AND 51.0
GROUP BY activity_id
HAVING points_in_area > 5;  -- Filter out activities that just barely clip the area

Example use of H3 index for complex polygons

track_points

SELECT
    activity_id,
    count() as points_in_area,
    min(timestamp) as first_seen,
    max(timestamp) as last_seen
FROM track_points
WHERE h3index IN (
    -- Get all H3 cells that cover the area of interest
    SELECT arrayJoin(h3_k_ring(geoToH3(50.5, 10.5, 12), 1))
)
GROUP BY activity_id
HAVING points_in_area > 5;

Parser overview

The interesting bits don’t live within the code, but I will leave the repo here: https://github.com/Schachte/strava-clickhouse

This program is designed to experiment with the above ideas:

Reads N GPX files from the activities dir
Parses the data from the XML into JSON
Inserts batches of parsed JSON containing activity info into our two tables

Querying data

So far, I’ve enjoyed querying and visualizing data with the following tools:

Tabix: web-based. Slightly glitchy, but fine for browser-based querying
DbGate: Desktop UI and so far is nice to use
Grafana: The powerhouse of data visualizations and has native ClickHouse plugin available

Here is an example graph I made in Grafana for analyzing smoothed pace over time for a run.

Click to enlarge

Example Grafana graph showing smoothed pace over time