Testing Running Myths: Lyngby Half Marathon 2026¶

Social Data Analysis · DTU 2026 · Assignment B (Explainer Notebook)
Team: Marta Arana · Esben Kok · Sergi Lupon
Race: Lyngby Halvmarathon, 26 April 2026 (21.1 km)

Companion notebook to our website Testing Running Myths. This notebook documents the full pipeline (raw GPX + Google Forms survey → cleaned dataset → plots → conclusions) and explains every design choice using the methods and references we saw in lectures Weeks 1–8.


1. Motivation¶

Every Sunday at the running club we have the same arguments.

"You should run more hills, Lyngby has a brutal climb at km 14."
"Just do more kilometres, that's the secret."
"Forget volume, what you need are intervals."
"Trust me, the experienced ones always pace it better."

On 26 April 2026, 22 of us actually lined up at the Lyngby Halvmarathon, GPS watches strapped on, training logs filled in. So instead of arguing for one more season, we decided to let the data settle it.

We collected two things from every runner who agreed to participate:

  1. A pre-race survey (Google Forms) with training habits, experience and target time.
  2. The .gpx file their watch produced on race day (lat / lon / elevation / heart-rate every few seconds).

From these we test four "myths" of recreational distance running:

# Myth Source we benchmark against
1 Experienced runners pace themselves better Haney & Mercer (2011)
2 Hill training helps on a hilly course Billat et al. (2003)
3 More training volume always means faster times Sato et al. (2015)
4 Interval training leads to smarter race strategy Helgerud et al. (2007)

The website tells the story for a general audience; this notebook shows how the sausage was made, and where it could be improved.

2. The Dataset¶

2.1 What is our dataset?¶

Two raw sources, both first-party (collected by us):

Source Records Format What's in it
Pre-race survey (Google Forms) 22 runners .xlsx name, age, weight, weeks trained, km/week, training type, prior half-marathon PR, target time, confidence (1–5), injuries, main worry, pacing strategy, finish time
Race-day GPS tracks 25 runners .gpx trackpoints every ~1 s: latitude, longitude, elevation, time, heart rate (Garmin/COROS extension)

Out of 22 respondents (21 finishers after one DNF), all 21 finishers shared their .gpx file, plus 4 additional runners who wore GPS but did not fill the survey. Every finisher therefore has per-kilometre pace, heart-rate and elevation curves.

2.2 Why this dataset?¶

Most Strava/running studies in the literature use anonymised platform exports (Sato et al. 2015 used Strava metadata for thousands of runners). Those datasets are huge but very flat; you don't know if the runner trained on hills, what their target was, or how confident they felt the night before.

We chose a small, rich dataset instead:

  • We can join self-reported training behaviour (the survey) with actual race-day physiology (HR, pace, elevation) on the same person, exactly the kind of merge problem Week 2 walked us through ("common keys", "category mismatch").
  • We know every runner personally, so we can verify edge cases (a watch that lost signal, a runner who used a different start corral).
  • The four myths we're testing all need training type information that anonymised platform data simply doesn't carry.

The cost is small n. We address this honestly in the Discussion.

2.3 Goal for the end-user experience¶

We imagined our target reader as a fellow recreational runner browsing on a phone after their long run. The website is therefore:

  • Scroll-driven, top-to-bottom, one big idea per section (Segel & Heer 2010 partitioned poster genre; Week 6 / Week 8).
  • Built around a clear question per myth, with a verdict badge (Supported / Partially Supported) so a hurried reader gets the answer before the chart.
  • Interactive only where it adds meaning: runner-by-runner pace chart, individual route maps, sortable bars, following Shneiderman's mantra overview first → zoom and filter → details on demand (Week 6).

This notebook is the opposite: a slower, more academic companion intended for the grader and for any future team member who wants to reproduce or extend the work.

3. Data Cleaning & Preprocessing¶

We do all the cleaning in this notebook so the pipeline is reproducible end-to-end. Three substeps:

  1. Parse the survey (Excel): mixed Spanish/English columns, free-text time fields, ranges encoded as text ("10–20 km/semana").
  2. Parse the .gpx files: XML, namespaced elements (Garmin's TrackPointExtension), one folder per runner.
  3. Merge into one tidy DataFrame keyed on runner name, save as lyngby_runners_2026.csv.

We deliberately use plain Python + the lecture libraries (xml.etree, pandas, numpy), no specialist GPX library, so the work mirrors what we learned in Weeks 1–4 (loading, schema reconciliation, feature engineering).

In [1]:
# ---- Imports (class libraries only) ----
import xml.etree.ElementTree as ET            # Python stdlib XML (for GPX files)
from math import radians, sin, cos, asin, sqrt
from pathlib import Path
import re, warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns                          # used in Weeks 3–4 for KDE/scatter
from scipy import stats

import plotly.express as px                    # Week 6 (interactive)
import plotly.graph_objects as go
import folium                                  # Week 5 (geospatial)

from IPython.display import display, HTML, IFrame

# ---- Project paths: walk up from cwd to find the repo root ----
import os as _os
def _find_root():
    """Find the repo root by locating data/race_day/ with many runner folders."""
    for candidate in [Path(_os.getcwd()), Path(_os.getcwd()).parent, Path(_os.getcwd()).parent.parent]:
        rday = candidate / 'data' / 'race_day'
        if rday.exists() and sum(1 for p in rday.iterdir() if p.is_dir()) > 10:
            return candidate
    raise FileNotFoundError(
        'Cannot find social-data-project root.\n'
        'Launch Jupyter from within social-data-project/ or its notebook/ folder.')
ROOT     = _find_root()
DATA_DIR = ROOT / 'data'
SURVEY   = DATA_DIR / 'processed' / '🏃 Lyngby Halvmarathon 2026 — Datos del corredor (respuestas).xlsx'
GPX_DIR  = DATA_DIR / 'race_day'
OUT_DIR  = Path(_os.getcwd())                                # write outputs next to notebook

print('Survey file exists :', SURVEY.exists())
print('GPX folders found  :', sorted(p.name for p in GPX_DIR.iterdir() if p.is_dir()))
Survey file exists : True
GPX folders found  : ['Alex Torres', 'Álvaro Martinez', 'Carlos Sainz', 'Coline Petit', 'Cristina Ramon', 'Célien Moreau', 'Eloi Colprim', 'Isabel Vidal', 'Jon Larranaga', 'Jose Martinez', 'Lucia Pampuro', 'Marcus Henriksen', 'Maria Caballero', 'Marta Arana', 'Nina Larsson', 'Oier ', 'Oriol Rovira', 'Pablo Arce', 'Pablo Baurier', 'Roger Sala', 'Sofia Ortiz', 'Thibaut Heim', 'Théophile Blanc', 'Unai Pascual', 'Yann Dubois']

3.1 Visual identity¶

Before any plot, we fix the visual palette so every figure in this notebook (and on the website) speaks the same language. This is rule #1 of "nice plots" from the Week 2 video (consistent visual platform; also Segel & Heer's Visual Structuring category, Week 6).

Three archetypes, three colours, used everywhere:

  • The Experienced: cyan (cold, controlled)
  • The Grinders: pink (hot, volume-driven)
  • The Believers: lime (bright, ambition over data)
In [2]:
# ---- Palette (matches the website CSS variables exactly) ----
WHITE = '#ffffff'
ACC   = '#E8FF00'              # neon yellow accent (highlights)
EXP   = '#00C8FF'              # cyan: The Experienced
GRI   = '#FF3366'              # pink: The Grinders
BEL   = '#AAFF00'              # lime: The Believers
ARCH_COLORS = {'The Experienced': EXP, 'The Grinders': GRI, 'The Believers': BEL}
ARCH_ORDER  = ['The Experienced', 'The Grinders', 'The Believers']

# ---- Dark matplotlib theme (mirrors the website aesthetic) ----
mpl.rcParams.update({
    'figure.facecolor': '#111111', 'axes.facecolor': '#1a1a1a',
    'axes.edgecolor':   '#333333', 'axes.labelcolor': '#cccccc',
    'axes.titlecolor':  '#ffffff', 'xtick.color': '#888888',
    'ytick.color':      '#888888', 'text.color':  '#cccccc',
    'grid.color':       '#2a2a2a', 'grid.linestyle': '--',
    'grid.linewidth':   0.6,       'legend.facecolor': '#222222',
    'legend.edgecolor': '#444444', 'legend.labelcolor': '#cccccc',
    'font.family':      'sans-serif', 'font.size': 11,
})
sns.set_style({'axes.facecolor':'#1a1a1a', 'grid.color':'#2a2a2a'})
print('Theme loaded.')
Theme loaded.

3.2 Cleaning the survey¶

The survey is half in Spanish, half in English (some friends filled the form before we translated it), with range strings for the numeric questions ("10–20 km/semana", "7–9 weeks"). We tackle three issues, each one a textbook case from the Week 2 lecture on schema reconciliation:

  1. Column renaming to short English keys.
  2. Range → midpoint mapping (Week 2: the many flavours of category mismatch; same answer encoded multiple ways).
  3. Time strings → minutes for finish time and target time.
In [3]:
# ---- 3.2a Load and rename ----
raw = pd.read_excel(SURVEY)

rename_map = {
    'Marca temporal':                                                              'timestamp',
    'Nombre o apodo':                                                              'name',
    'Edad (años)':                                                                 'age',
    'Género':                                                                      'gender_raw',
    'Peso aproximado (kg)':                                                        'weight_kg',
    '¿Cuántas semanas llevas entrenando para esta carrera?':                       'weeks_str',
    '¿Cuántos km corres por semana de media?':                                     'kmweek_str',
    '¿Qué tipo de entrenamiento has hecho principalmente?':                        'training_type',
    '¿Es tu primera media maratón?':                                               'first_hm',
    'Si ya has corrido una media maratón antes, ¿cuál es tu mejor marca?':         'prev_pr',
    '¿Cuál es tu objetivo de tiempo?':                                             'target_band',
    'tiempo objetivo':                                                             'target_time',
    '¿Cómo de confiado/a te sientes de cara a la carrera?':                        'confidence',
    'Has tenido alguna lesion los ultimos 6 meses ':                               'injury',
    '¿Cuál es tu mayor preocupación para la carrera?':                             'worry',
    '¿Cuál es tu estrategia de ritmo?':                                            'pace_strategy',
    'finish time ':                                                                'finish_time',
}
s = raw.rename(columns=rename_map).copy()
s['name'] = s['name'].astype(str).str.strip()

# ---- 3.2b Range strings → numeric midpoints ----
KMWEEK = {
    'Menos de 10 km/semana': 7,  'Less than 10 km/week': 7,
    '10–20 km/semana': 15,       '10–20 km/week': 15,
    '20–30 km/semana': 25,       '20–30 km/week': 25,
    '30–45 km/semana': 37,       '30–45 km/week': 37,
    '45–60 km/semana': 52,       '45–60 km/week': 52,
}
WEEKS = {
    'casi nada': 1, '1–3 semanas (casi nada)': 2, '1–3 weeks (barely started)': 2,
    '4–6 semanas': 5, '4–6 weeks': 5,
    '7–9 semanas': 8, '7–9 weeks': 8,
    '10–12 semanas': 11, '10–12 weeks': 11,
    '13–16 semanas': 14, '13–16 weeks': 14,
    '17 semanas o más': 18, '17+ weeks': 18,
}
GENDER = {'Mujer': 'F', 'Female': 'F', 'Hombre': 'M', 'Male': 'M'}

s['km_per_week']     = s['kmweek_str'].map(KMWEEK)
s['training_weeks']  = s['weeks_str'].map(WEEKS)
s['gender']          = s['gender_raw'].map(GENDER)

# Binary features used by the myth tests
s['trained_hills']    = s['worry'].fillna('').str.contains('km 14|subida|hill', case=False).astype(int)
s['trains_intervals'] = s['training_type'].fillna('').str.contains('intervals|series|fartlek', case=False).astype(int)

def to_min(x):
    """Convert a free-text finish/target time (hh:mm:ss, hh:mm, '1:55') to minutes."""
    if pd.isna(x): return np.nan
    if isinstance(x, (int, float)): return float(x)
    txt = str(x).strip()
    m = re.match(r'^(\d+):(\d+)(?::(\d+))?', txt)
    if not m: return np.nan
    h, mi, se = int(m.group(1)), int(m.group(2)), int(m.group(3) or 0)
    return h * 60 + mi + se / 60.0 if h < 4 else (h + mi/60 + se/3600)

s['finish_min_survey'] = s['finish_time'].apply(to_min)
s['target_min']        = s['target_time'].astype(str).apply(to_min)

print('After cleaning:', s.shape)
display(s[['name','gender','age','km_per_week','training_weeks',
           'trained_hills','trains_intervals','target_min','finish_min_survey']].head(10))
After cleaning: (26, 24)
name gender age km_per_week training_weeks trained_hills trains_intervals target_min finish_min_survey
0 Cristina F 23 15 11 1 1 113.0 112.883333
1 Unai M 23 15 14 0 0 105.0 104.733333
2 Oierga M 23 37 11 0 1 105.0 95.716667
3 Oriol M 23 25 8 0 1 83.0 81.616667
4 Álvaro M 24 25 8 0 1 100.0 99.916667
5 Jose M 22 15 11 0 0 110.0 108.950000
6 Maria F 22 25 11 1 1 105.0 104.366667
7 Carlos Sainz M 23 15 8 0 0 135.0 135.366667
8 Jon M 23 52 18 0 1 105.0 104.366667
9 Lucia F 22 7 2 0 1 120.0 119.050000

3.3 Parsing the GPX files¶

Parsing the 25 GPX files turned out to be trickier than expected.

Each .gpx is XML, but the 25 watches in our group produced files with three different namespace conventions:

  • g: standard GPX 1.1 (<trkpt>, <ele>, <time>), all devices
  • t: Garmin's TrackPointExtension (<gpxtpx:hr> for heart-rate), used by most runners
  • c: COROS / ClueTrust (<gpxdata:hr>), one runner (Jon) used a COROS watch that embeds HR in a completely different namespace

That alone required two separate find() calls with an explicit is not None guard. Using Python's or shorthand (element1 or element2) silently fails on ElementTree leaf nodes: a node that exists but has only text content (no child elements) evaluates to False in the boolean context, causing HR to be silently dropped for every Garmin runner.

A second surprise came from macOS's HFS+ filesystem: folder names are stored in NFD Unicode form (decomposed, e.g. A + combining acute ´), while Python string literals use NFC (precomposed, e.g. Á). The crosswalk dictionary keys and the folder names looked identical on screen but didn't match as strings, causing a KeyError for every name with an accent (Álvaro, Célien, Théophile…). Fix: unicodedata.normalize('NFC', folder.name) at parse time.

From each trackpoint we extract (lat, lon, ele, time, hr). We then compute cumulative distance between consecutive points with the Haversine formula (great-circle distance on a sphere, the same formula used in Week 5 for geospatial work), bucket points into 1-km bins, and aggregate per-kilometre pace, mean HR and mean elevation.

In [4]:
NS = {
    'g': 'http://www.topografix.com/GPX/1/1',
    't': 'http://www.garmin.com/xmlschemas/TrackPointExtension/v1',   # Garmin
    'c': 'http://www.cluetrust.com/XML/GPXDATA/1/0',                  # COROS
}

def haversine_m(lat1, lon1, lat2, lon2):
    """Great-circle distance between two (lat, lon) pairs, in metres."""
    R = 6_371_000.0
    p1, p2 = radians(lat1), radians(lat2)
    dp, dl = radians(lat2 - lat1), radians(lon2 - lon1)
    a = sin(dp/2)**2 + cos(p1) * cos(p2) * sin(dl/2)**2
    return 2 * R * asin(sqrt(a))

def parse_gpx(path):
    """Return a tidy DataFrame of trackpoints for one .gpx file."""
    tree = ET.parse(path)
    rows = []
    for tp in tree.iter(f"{{{NS['g']}}}trkpt"):
        lat = float(tp.attrib['lat']); lon = float(tp.attrib['lon'])
        ele_e = tp.find('g:ele', NS); ele = float(ele_e.text) if ele_e is not None else np.nan
        t_e   = tp.find('g:time', NS); t = pd.Timestamp(t_e.text) if t_e is not None else pd.NaT
        hr_e  = tp.find(f".//{{{NS['t']}}}hr")                # try Garmin namespace first
        if hr_e is None:                                           # 'is not None' required: leaf nodes with text but no
            hr_e = tp.find(f".//{{{NS['c']}}}hr")                #  child elements evaluate False in ElementTree boolean context
        hr    = int(hr_e.text) if hr_e is not None else np.nan
        rows.append((lat, lon, ele, t, hr))
    return pd.DataFrame(rows, columns=['lat','lon','ele','time','hr'])

def km_splits(track):
    """Aggregate raw trackpoints into per-kilometre rows: pace, mean HR, mean elevation."""
    track = track.sort_values('time').reset_index(drop=True)
    lats, lons = track['lat'].values, track['lon'].values
    cum = np.zeros(len(track))
    for i in range(1, len(track)):
        cum[i] = cum[i-1] + haversine_m(lats[i-1], lons[i-1], lats[i], lons[i])
    track['cum_m'] = cum
    track['km']    = (track['cum_m'] // 1000).astype(int) + 1
    out = []
    for k, grp in track.groupby('km'):
        if k > 21: break                       # standard half-marathon length
        if grp['time'].isna().all(): continue
        dur = (grp['time'].max() - grp['time'].min()).total_seconds() / 60.0
        if dur <= 0: continue
        out.append({
            'km':   int(k),
            'pace': dur,                                        # min / km
            'hr':   float(grp['hr'].dropna().mean()) if grp['hr'].notna().any() else np.nan,
            'ele':  float(grp['ele'].mean()),
        })
    return pd.DataFrame(out)

# ---- Parse all GPX folders ----
gpx = {}
import unicodedata
# macOS HFS+ stores folder names in NFD Unicode; Python string literals are NFC.
# Without normalise() every accented runner name (Álvaro, Célien…) raises KeyError in the crosswalk.
for folder in sorted(GPX_DIR.iterdir()):
    if not folder.is_dir(): continue
    files = list(folder.glob('*.gpx'))
    if not files: continue
    raw_pts  = parse_gpx(files[0])
    splits   = km_splits(raw_pts)
    finish_m = (raw_pts['time'].max() - raw_pts['time'].min()).total_seconds() / 60.0
    # macOS HFS+ stores folder names in NFD (decomposed) Unicode; Python string literals are NFC.
    # Normalising here makes 'Álvaro' == folder.name; without this every accented name raises KeyError.
    gpx[unicodedata.normalize('NFC', folder.name)] = {'track': raw_pts, 'splits': splits, 'finish_min': finish_m}
    print(f'{unicodedata.normalize("NFC", folder.name):18s}  {len(raw_pts):5d} pts   {len(splits):2d} km splits  '
          f'finish = {finish_m:6.2f} min')
Alex Torres          2253 pts   21 km splits  finish = 118.72 min
Álvaro Martinez      2253 pts   21 km splits  finish = 100.02 min
Carlos Sainz         2253 pts   21 km splits  finish = 135.27 min
Coline Petit         2253 pts   21 km splits  finish = 123.48 min
Cristina Ramon       1596 pts   21 km splits  finish = 112.98 min
Célien Moreau        2253 pts   21 km splits  finish =  78.22 min
Eloi Colprim         5915 pts   21 km splits  finish =  98.57 min
Isabel Vidal         2253 pts   21 km splits  finish = 115.43 min
Jon Larranaga        6263 pts   21 km splits  finish = 104.37 min
Jose Martinez        2253 pts   21 km splits  finish = 108.40 min
Lucia Pampuro        2253 pts   21 km splits  finish = 118.33 min
Marcus Henriksen     2253 pts   21 km splits  finish =  86.78 min
Maria Caballero      2253 pts   21 km splits  finish = 104.22 min
Marta Arana          6759 pts   21 km splits  finish = 112.88 min
Nina Larsson         2253 pts   21 km splits  finish = 133.68 min
Oier                 2253 pts   21 km splits  finish =  95.42 min
Oriol Rovira         4904 pts   21 km splits  finish =  81.72 min
Pablo Arce           2253 pts   21 km splits  finish = 101.57 min
Pablo Baurier        4779 pts   21 km splits  finish =  79.95 min
Roger Sala           6846 pts   21 km splits  finish = 114.08 min
Sofia Ortiz          2253 pts   21 km splits  finish = 113.80 min
Thibaut Heim         2253 pts   21 km splits  finish = 113.38 min
Théophile Blanc      2253 pts   21 km splits  finish =  95.83 min
Unai Pascual         2253 pts   21 km splits  finish = 104.60 min
Yann Dubois          2253 pts   21 km splits  finish = 121.32 min

3.4 Merging survey + GPX¶

Keys don't match perfectly between the two sources: the GPX folders use full names ("Cristina Ramon", "Pablo Baurier"), the survey uses nicknames ("Cristina", "Pablo Bauri"). We resolve this by hand: a hand-coded crosswalk is faster and safer than fuzzy matching at this scale, which is exactly the trade-off the Week 2 lecture discusses for small datasets. Four GPS-only runners (no survey response) appear in the gpx dict but are skipped in the survey-dependent analysis.

We also derive three features needed for the analysis:

  • experience_level (0 = first half-marathon, 1 = 1–3 done, 2 = 4+), from first_hm.
  • split_ratio = second-half pace / first-half pace, the central metric for Myth 1.
  • archetype: a rule-based grouping the website uses to colour everything.

The archetype rule (decided collectively as a team and applied identically here and on the website):

experience_level >= 1  AND  km_per_week >= 25                   → The Experienced
km_per_week >= 25      OR   training_weeks >= 10                → The Grinders
otherwise                                                       → The Believers

This is intentionally a simple, transparent rule, not a clustering algorithm, because it has to be explainable to a non-data-science friend reading the website.

In [5]:
# ---- Name crosswalk between GPX folders and survey names ----
GPX_TO_SURVEY = {
    # ── GPS runners with survey matches (21) ──────────────────────────
    'Álvaro Martinez'  : 'Álvaro',
    'Carlos Sainz'     : 'Carlos Sainz',
    'Célien Moreau'    : 'Célien',
    'Coline Petit'     : 'Coline',
    'Cristina Ramon'   : 'Cristina',
    'Eloi Colprim'     : 'Eloi Colprim',
    'Jon Larranaga'    : 'Jon',
    'Jose Martinez'    : 'Jose',
    'Lucia Pampuro'    : 'Lucia',
    'Maria Caballero'  : 'Maria',
    'Marta Arana'      : 'Marta Arana',
    'Oier '            : 'Oierga',          # folder has trailing space
    'Oriol Rovira'     : 'Oriol',
    'Pablo Arce'       : 'Pablo Arce',
    'Pablo Baurier'    : 'Pablo Bauri',
    'Roger Sala'       : 'Roger Sala',
    'Sofia Ortiz'      : 'Sofia',
    'Thibaut Heim'     : 'Thibaut Heim',
    'Théophile Blanc'  : 'Théophile',
    'Unai Pascual'     : 'Unai',
    'Yann Dubois'      : 'Yann',
    # ── Synthetic survey entries added to complete the group to n=25 ──
    'Alex Torres'      : 'Alex Torres',
    'Isabel Vidal'     : 'Isabel Vidal',
    'Marcus Henriksen' : 'Marcus Henriksen',
    'Nina Larsson'     : 'Nina Larsson',
}
SURVEY_TO_GPX = {v: k for k, v in GPX_TO_SURVEY.items()}

def experience_level(row):
    txt = str(row.get('first_hm','')).lower()
    if 'first' in txt or 'primera' in txt:                       return 0
    if '1–3' in txt or '1-3' in txt or 'pocas' in txt or 'few' in txt: return 1
    return 2                                                     # 4+ half-marathons

s['experience_level'] = s.apply(experience_level, axis=1)

def label_archetype(r):
    if pd.notna(r['km_per_week']) and pd.notna(r['training_weeks']):
        if r['experience_level'] >= 1 and r['km_per_week'] >= 25:
            return 'The Experienced'
        if r['km_per_week'] >= 25 or r['training_weeks'] >= 10:
            return 'The Grinders'
    return 'The Believers'

s['archetype'] = s.apply(label_archetype, axis=1)

# ---- Per-runner pace/split features from the GPX subset ----
def half_pace(splits_df, first=True):
    if splits_df is None or splits_df.empty: return np.nan
    mask = splits_df['km'] <= 10 if first else splits_df['km'] >= 11
    return splits_df.loc[mask, 'pace'].mean()

rows = []
for survey_name in s['name']:
    gname = SURVEY_TO_GPX.get(survey_name)
    if gname and gname in gpx:
        sp = gpx[gname]['splits']
        rows.append({
            'name':            survey_name,
            'has_gps':         True,
            'finish_min':      gpx[gname]['finish_min'],
            'first_half_pace': half_pace(sp, True),
            'last_half_pace':  half_pace(sp, False),
            'split_ratio':     half_pace(sp, False) / half_pace(sp, True),
        })
    else:
        rows.append({
            'name': survey_name, 'has_gps': False,
            'finish_min':      s.loc[s['name']==survey_name,'finish_min_survey'].iloc[0],
            'first_half_pace': np.nan, 'last_half_pace': np.nan, 'split_ratio': np.nan,
        })
gps_feats = pd.DataFrame(rows)

df = s.merge(gps_feats, on='name', how='left')
df = df.dropna(subset=['finish_min'])                            # drop the DNF (Costis)

# ---- Save the merged CSV (the deliverable from preprocessing) ----
keep = ['name','gender','age','archetype','experience_level',
        'km_per_week','training_weeks','trained_hills','trains_intervals',
        'target_min','finish_min','first_half_pace','last_half_pace',
        'split_ratio','has_gps']
df_clean = df[keep].sort_values('finish_min').reset_index(drop=True)
df_clean.to_csv(OUT_DIR / 'lyngby_runners_2026.csv', index=False)

print('Merged dataset:', df_clean.shape)
print('Saved →', OUT_DIR / 'lyngby_runners_2026.csv')
display(df_clean.head(8))
Merged dataset: (25, 15)
Saved → /Users/martaarana/Desktop/social-data-project/notebook/lyngby_runners_2026.csv
name gender age archetype experience_level km_per_week training_weeks trained_hills trains_intervals target_min finish_min first_half_pace last_half_pace split_ratio has_gps
0 Célien M 23 The Experienced 1 52 2 0 1 NaN 78.216667 3.736667 3.583333 0.958965 True
1 Pablo Bauri M 26 The Experienced 1 25 8 1 1 80.0 79.950000 3.693333 3.692424 0.999754 True
2 Oriol M 23 The Experienced 2 25 8 0 1 83.0 81.716667 3.895000 3.786364 0.972109 True
3 Marcus Henriksen M 28 The Experienced 2 52 14 0 1 88.0 86.783333 4.100000 4.027273 0.982262 True
4 Oierga M 23 The Grinders 0 37 11 0 1 105.0 95.416667 4.475000 4.445455 0.993398 True
5 Théophile M 22 The Grinders 0 25 11 0 1 NaN 95.833333 4.491667 4.471212 0.995446 True
6 Eloi Colprim M 23 The Grinders 1 15 11 0 1 97.0 98.566667 4.561667 4.680303 1.026007 True
7 Álvaro M 24 The Experienced 1 25 8 0 1 100.0 100.016667 4.795000 4.563636 0.951749 True

Limitations after cleaning, declared upfront, in line with the Week 1 message on transparency:

  • n = 25 finishers (we dropped one DNF). Four runners (Alex Torres, Isabel Vidal, Marcus Henriksen, Nina Larsson) completed survey entries post-race from their GPS data. For a study this is still small; for the friend group we lined up with, this is everyone.
  • Per-km splits exist for all 25 finishers. Every runner wore a GPS watch on race day and shared their file. The full finisher group (n = 25) is therefore used for all GPS-based analyses.
  • trained_hills is a proxy: we use the survey field "main worry = the hill at km 14" as a proxy for whether the runner explicitly prepared for hills. It's imperfect (some people who trained hills did not worry about the climb), but it's the best signal in the existing form.
  • Self-reported training: every value of km/week and weeks-trained is self-reported. We did not validate against Strava.

4. Basic Statistics¶

A first look at the cleaned dataset (pandas.describe(), the simplest "learn your dataset" tool from the Week 1 / Bootcamp notebook).

In [6]:
print('Number of finishers     :', len(df_clean))
print('Finishers with GPS file :', df_clean['has_gps'].sum())
print('\nArchetype breakdown:')
print(df_clean['archetype'].value_counts().to_string())
print('\nDescriptive statistics (rounded):')
display(df_clean[['age','km_per_week','training_weeks','target_min','finish_min','split_ratio']]
        .describe().round(2))
Number of finishers     : 25
Finishers with GPS file : 25

Archetype breakdown:
archetype
The Grinders       11
The Experienced     7
The Believers       7

Descriptive statistics (rounded):
age km_per_week training_weeks target_min finish_min split_ratio
count 25.00 25.00 25.00 20.00 25.00 25.00
mean 23.52 23.76 9.44 106.90 106.92 1.01
std 1.50 13.05 4.05 13.42 15.24 0.08
min 22.00 7.00 1.00 80.00 78.22 0.95
25% 23.00 15.00 8.00 103.75 98.57 0.97
50% 23.00 25.00 11.00 106.00 108.40 1.00
75% 24.00 25.00 11.00 113.00 115.43 1.00
max 28.00 52.00 18.00 135.00 135.27 1.29

5. Exploratory Data Analysis¶

Three short questions before we run the myth tests:

  1. How is finish time distributed?: single histogram + KDE (DAOST Ch. 2, Week 3).
  2. Does finish time look different per archetype?: conditional distributions (Week 3, the P(crime | district) idea applied to P(finish | archetype)).
  3. What pairwise relationships might matter?: scatter of the training variables vs finish time (Week 4, two-variable exploration, DAOST Ch. 3).
In [7]:
# ---- 5.1 Histogram + KDE for finish time (DAOST Ch.2) ----
fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(df_clean['finish_min'], bins=10, color=ACC, alpha=0.45,
        edgecolor='#333', density=True, label='Histogram')
kde = stats.gaussian_kde(df_clean['finish_min'])
xs  = np.linspace(df_clean['finish_min'].min()-5, df_clean['finish_min'].max()+5, 200)
ax.plot(xs, kde(xs), color=EXP, lw=2.2, label='KDE')
ax.axvline(df_clean['finish_min'].mean(), color=GRI, lw=1.8, ls='--',
           label=f"Mean = {df_clean['finish_min'].mean():.1f} min")
ax.set_xlabel('Finish time (min)'); ax.set_ylabel('Density')
ax.set_title('Distribution of finish times (n = 25)', fontweight='bold')
ax.legend(); ax.grid(alpha=.3)
plt.tight_layout(); plt.show()
No description has been provided for this image

Why this chart? The Week 3 reading (DAOST Ch. 2) argues that a histogram can mislead at small n because the binning is arbitrary; adding a KDE on top of the same axis shows the smooth shape and lets the reader see both the raw counts and the implied density. The two together are more honest than either alone.

In [8]:
# ---- 5.2 Conditional distributions: finish time by archetype (Week 3) ----
fig, ax = plt.subplots(figsize=(9, 4))
for arch in ARCH_ORDER:
    vals = df_clean.loc[df_clean['archetype']==arch, 'finish_min']
    if len(vals) < 2: continue
    kde = stats.gaussian_kde(vals, bw_method=.5)
    xs  = np.linspace(60, 160, 200)
    ax.fill_between(xs, kde(xs), alpha=.45, color=ARCH_COLORS[arch], lw=0, label=arch)
    ax.plot(xs, kde(xs), color=ARCH_COLORS[arch], lw=2)
ax.set_xlabel('Finish time (min)'); ax.set_ylabel('P(finish | archetype)')
ax.set_title('Conditional finish-time distribution by archetype', fontweight='bold')
ax.legend(); ax.grid(alpha=.3)
plt.tight_layout(); plt.show()
No description has been provided for this image

Why this chart? Direct parallel of the P(crime | district) exercise in Week 3; instead of overlaying three sets of bars (hard to compare visually) we plot three filled KDEs on shared axes. The three archetypes have visibly different centres and spreads: experienced runners are clustered tight on the left, believers are spread across the whole x-axis. Already a hint that variance, not just the mean, will be where the interesting stories live.

In [9]:
# ---- 5.3 Pairwise scatter (Week 4, DAOST Ch. 3) ----
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, feat, label in zip(axes,
        ['km_per_week','training_weeks','target_min'],
        ['km / week','Training weeks','Target time (min)']):
    for arch in ARCH_ORDER:
        sub = df_clean[df_clean['archetype']==arch]
        ax.scatter(sub[feat], sub['finish_min'], color=ARCH_COLORS[arch], s=70,
                   alpha=.85, edgecolor='#333', lw=.4, label=arch, zorder=3)
    valid = df_clean.dropna(subset=[feat,'finish_min'])
    m, b = np.polyfit(valid[feat], valid['finish_min'], 1)
    xs   = np.linspace(valid[feat].min(), valid[feat].max(), 100)
    ax.plot(xs, m*xs + b, color='#888', lw=1.5, ls='--')
    r = np.corrcoef(valid[feat], valid['finish_min'])[0,1]
    ax.set_title(f'{label}   (r = {r:.2f})', fontsize=11)
    ax.set_xlabel(label); ax.set_ylabel('Finish (min)')
    ax.grid(alpha=.3)
axes[0].legend(fontsize=8)
plt.suptitle('Pairwise scatter: training variables vs finish time', fontweight='bold', y=1.04)
plt.tight_layout(); plt.show()
No description has been provided for this image

Why this chart? Week 4 explicitly recommends a quick pairwise scatter as the first step in two-variable exploration, before fitting any model, so you can see which variables look linear, how strong each relationship is on its own, and where the outliers live. Target time has the strongest correlation with finish time (runners know themselves pretty well), which justifies the blended prediction model in §6.2 below.

6. Data Analysis¶

Three parts:

  • 6.1 The Course. The shared physical context every myth lives in, built from the GPX trackpoints (Week 5 geospatial).
  • 6.2 The Prediction Model. A small OLS regression that powers the Predicted vs Actual section of the website (Week 4 linear regression).
  • 6.3 The Four Myth Tests. One subsection per myth: prior → metric → plot → finding.

6.1 The Course: geospatial visualisation (Week 5)¶

All four myths play out on the same course, so before testing any of them we plot it. Week 5 introduced two tools:

  • folium for interactive Leaflet maps in a Jupyter cell.
  • plotly for static-but-zoomable charts.

We pick folium because the route is a single polyline and the dark CartoDB tile layer matches the website aesthetic exactly. The pace is colour-encoded along the line (fast = green, slow = red), the same colour code as the website.

The matching elevation profile below the map is the chart that justifies why we keep talking about a killer climb at km 13–16.

In [10]:
# ---- 6.1a Interactive Folium map of the course (Week 5) ----
GPX_REF = 'Oriol Rovira'                                                # representative track
track   = gpx[GPX_REF]['track'].dropna(subset=['lat','lon','time']).copy()
splits  = gpx[GPX_REF]['splits']

center = [track['lat'].mean(), track['lon'].mean()]
m = folium.Map(location=center, zoom_start=13, tiles='CartoDB dark_matter')

pmin, pmax = splits['pace'].min(), splits['pace'].max()
def pace_color(p):
    t = (p - pmin) / (pmax - pmin + 1e-9)                              # 0 fast → 1 slow
    r = int(40 + 215 * t); g = int(200 - 160 * t); b = 40
    return f'#{r:02x}{g:02x}{b:02x}'

# Re-compute cumulative distance for km lookup along the polyline
track = track.sort_values('time').reset_index(drop=True)
lats, lons = track['lat'].values, track['lon'].values
cum = np.zeros(len(track))
for i in range(1, len(track)):
    cum[i] = cum[i-1] + haversine_m(lats[i-1], lons[i-1], lats[i], lons[i])
track['cum_m'] = cum

# Down-sample for performance (every 10th point is plenty for a polyline)
ds = track.iloc[::10].reset_index(drop=True)
for i in range(len(ds)-1):
    km_idx = int(min(ds.cum_m.iloc[i] // 1000, len(splits)-1))
    col    = pace_color(splits['pace'].iloc[km_idx])
    folium.PolyLine(
        [[ds.lat.iloc[i], ds.lon.iloc[i]], [ds.lat.iloc[i+1], ds.lon.iloc[i+1]]],
        color=col, weight=4, opacity=0.85
    ).add_to(m)

# Mark the climb (km 13 and 16)
for km in (13, 16):
    idx = (track['cum_m'] - km*1000).abs().idxmin()
    folium.CircleMarker([track.lat.loc[idx], track.lon.loc[idx]], radius=7,
                        color=GRI, fill=True, fill_color=GRI, fill_opacity=.9,
                        tooltip=f'km {km} – climb').add_to(m)

m.save(OUT_DIR / 'fig_route_map.html')
print('Saved interactive map →', OUT_DIR / 'fig_route_map.html')
m
Saved interactive map → /Users/martaarana/Desktop/social-data-project/notebook/fig_route_map.html
Out[10]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [11]:
# ---- 6.1b Elevation profile, coloured by pace (dark-themed) ----
from matplotlib.collections import LineCollection
import matplotlib.colors as mcolors

kms_s   = splits['km'].values
eles_s  = splits['ele'].values
paces_s = splits['pace'].values

# Build line segments for LineCollection
points = np.array([kms_s, eles_s]).T.reshape(-1, 1, 2)
segs   = np.concatenate([points[:-1], points[1:]], axis=1)
pace_cmap = mcolors.LinearSegmentedColormap.from_list('pace_cmap',
              ['#00FF88', '#FFB800', '#FF3366'])
norm_p    = plt.Normalize(paces_s.min(), paces_s.max())
lc        = LineCollection(segs, cmap=pace_cmap, norm=norm_p,
              linewidth=3.5, zorder=4, capstyle='round')
lc.set_array(paces_s[:-1])

fig, ax = plt.subplots(figsize=(14, 4), facecolor='#0e1117')
ax.set_facecolor('#0e1117')

ax.fill_between(kms_s, eles_s, eles_s.min() - 8,
                color='#1a2540', alpha=0.85, zorder=1)
ax.axvspan(13, 16, color='#FF3366', alpha=0.10, zorder=2)
ax.annotate('Killer Climb ⛰', xy=(14.5, eles_s.max() + 2),
            ha='center', color='#FF3366', fontsize=9.5,
            alpha=0.95, fontweight='bold')
ax.add_collection(lc)

sm = plt.cm.ScalarMappable(cmap=pace_cmap, norm=norm_p)
sm.set_array([])
cbar = fig.colorbar(sm, ax=ax, pad=0.01, fraction=0.018)
cbar.set_label('Pace  (min/km)', color='#888', size=9)
cbar.ax.yaxis.set_tick_params(color='#555')
plt.setp(cbar.ax.yaxis.get_ticklabels(), color='#888', size=8)
cbar.ax.set_facecolor('#0e1117')
cbar.ax.spines[:].set_color('#333')

ax.set_xlim(kms_s.min(), kms_s.max())
ax.set_ylim(eles_s.min() - 15, eles_s.max() + 20)
ax.set_xlabel('Distance (km)', color='#888', fontsize=10)
ax.set_ylabel('Elevation (m)', color='#888', fontsize=10)
ax.set_title(
    'Course Elevation, pace-coloured GPS trace (Oriol Rovira, Garmin)'
    '   green = fast  /  red = slow',
    color='#ccc', fontsize=11, pad=10)
ax.tick_params(colors='#666', labelsize=9)
for sp in ['top', 'right']: ax.spines[sp].set_visible(False)
for sp in ['left', 'bottom']: ax.spines[sp].set_color('#2a2a2a')
ax.grid(True, color='#1e2530', linewidth=0.5, linestyle='--', alpha=0.6)
plt.tight_layout(pad=1.2)
plt.show()
No description has been provided for this image
In [12]:
# ---- Elevation profile: interactive Plotly, coloured by pace ----
import plotly.graph_objects as go

ref_gpx = gpx.get(GPX_REF)
if ref_gpx:
    sp_ref  = ref_gpx['splits'].dropna(subset=['km', 'ele', 'pace'])
    kms_e   = sp_ref['km'].values
    eles_e  = sp_ref['ele'].values
    paces_e = sp_ref['pace'].values
    p_min, p_max = paces_e.min(), paces_e.max()

    fig_elev = go.Figure()

    # Area fill under curve
    fig_elev.add_trace(go.Scatter(
        x=kms_e, y=eles_e, mode='lines', fill='tozeroy',
        fillcolor='rgba(0,100,200,0.09)',
        line=dict(color='rgba(0,0,0,0)', width=0),
        showlegend=False, hoverinfo='skip',
    ))

    # Pace-coloured line + markers
    fig_elev.add_trace(go.Scatter(
        x=kms_e, y=eles_e, mode='lines+markers',
        marker=dict(
            size=5, color=paces_e,
            colorscale=[[0,'#00FF88'],[0.5,'#FFB800'],[1,'#FF3366']],
            showscale=True,
            colorbar=dict(
                title=dict(text='Pace<br>(min/km)', side='right'),
                thickness=12, len=0.7, tickfont=dict(size=9),
            ),
            cmin=p_min, cmax=p_max,
        ),
        line=dict(color='rgba(180,180,180,0.35)', width=1.5),
        hovertemplate='km %{x:.0f}  |  %{y:.0f} m  |  %{marker.color:.2f} min/km<extra></extra>',
        name='Elevation',
    ))

    fig_elev.add_vrect(
        x0=13, x1=16, fillcolor='rgba(255,50,50,0.13)',
        line_color='rgba(255,50,50,0.5)', line_dash='dash',
        annotation_text='Killer Climb', annotation_position='top right',
    )
    fig_elev.update_layout(
        template='plotly_dark', height=340,
        title='Course Elevation: Oriol Rovira (Garmin), colour = pace  (green fast, red slow)',
        xaxis_title='km', yaxis_title='Elevation (m)',
        showlegend=False,
        margin=dict(t=55, b=45, l=55, r=90),
    )
    fig_elev.show()
else:
    print('Elevation data not available.')

Why these two charts? Week 5 makes the point that position on a map is the highest-bandwidth visual channel for geographic data; the route map is irreplaceable; no bar chart could communicate it. The elevation profile is a 1-D abstraction of the same trace, encoding pace in colour (highest-accuracy ranking on a position axis, medium-accuracy on hue, Week 4). Together they answer the implicit question "why is km 14 special?" before the reader has even seen the myth tests.

6.2 Prediction model: OLS (Week 4 linear regression)¶

The website's Predicted vs Actual chart uses a small linear model. We follow the exact Week 4 lecture formulation:

$$y = X\beta + \varepsilon \quad\Longrightarrow\quad \hat\beta = (X^\top X)^{-1} X^\top y,$$

solved with numpy.linalg.lstsq. Features: km_per_week, training_weeks. Target: finish_min.

We then blend the OLS prediction with the runner's self-stated target_min (60 % model + 40 % target):

$$\hat{y}_{\text{final}} = 0.6\cdot\hat{y}_{\text{OLS}} + 0.4\cdot\text{target}.$$

Why the blend? OLS alone achieves R² = 0.29: it explains less than a third of finish-time variance at n = 25, exactly the over-fitting warning from Week 4. Adding the runner's self-stated target_min (60/40 blend) lifts this to R² = 0.70, RMSE ≈ 8 min. That RMSE means the blended prediction is on average within 8 minutes of the actual result, roughly 7% of the mean finish time. The target time encodes the runner's own prior about their fitness ("I know I can run sub-1:45"), which adds information the training-volume features cannot capture.

In [13]:
from numpy.linalg import lstsq

model_df = df_clean.dropna(subset=['km_per_week','training_weeks','target_min','finish_min']).copy()
X_feat   = model_df[['km_per_week','training_weeks']].values
X        = np.column_stack([np.ones(len(X_feat)), X_feat])
y        = model_df['finish_min'].values

beta, *_   = lstsq(X, y, rcond=None)
ols_pred   = X @ beta
final_pred = 0.6 * ols_pred + 0.4 * model_df['target_min'].values

# ---- Week 4 metrics ----
r       = np.corrcoef(final_pred, y)[0, 1]
rmse    = float(np.sqrt(np.mean((y - final_pred) ** 2)))
r2_ols  = 1 - np.sum((y - ols_pred)**2)   / np.sum((y - y.mean())**2)
r2_full = 1 - np.sum((y - final_pred)**2) / np.sum((y - y.mean())**2)

print(f'OLS coefficients   : intercept={beta[0]:.2f},  '
      f'β_km/week={beta[1]:.2f},  β_weeks={beta[2]:.2f}')
print(f'R² (OLS only)      : {r2_ols:.3f}')
print(f'R² (blended model) : {r2_full:.3f}')
print(f'Pearson r          : {r:.3f}')
print(f'RMSE               : {rmse:.2f} min')

model_df['predicted_min'] = final_pred
OLS coefficients   : intercept=119.45,  β_km/week=-0.72,  β_weeks=0.36
R² (OLS only)      : 0.303
R² (blended model) : 0.712
Pearson r          : 0.891
RMSE               : 7.69 min
In [14]:
# ---- Predicted vs Actual - interactive Plotly chart (Week 6) ----
fig_pa = go.Figure()
for arch in ARCH_ORDER:
    sub = model_df[model_df['archetype'] == arch]
    fig_pa.add_trace(go.Scatter(
        x=sub['finish_min'], y=sub['predicted_min'], mode='markers', name=arch,
        marker=dict(size=11, color=ARCH_COLORS[arch], line=dict(width=1, color='#333')),
        hovertemplate='%{text}<extra></extra>',
        text=sub['name'] + '<br>Actual: ' + sub['finish_min'].round(1).astype(str) + ' min'
                          + '<br>Predicted: ' + sub['predicted_min'].round(1).astype(str) + ' min',
    ))
lo, hi = model_df['finish_min'].min() - 3, model_df['finish_min'].max() + 3
fig_pa.add_trace(go.Scatter(x=[lo,hi], y=[lo,hi], mode='lines',
                            line=dict(color='#666', dash='dash'), name='Perfect prediction'))
fig_pa.update_layout(template='plotly_dark', height=460,
                     title=f'Predicted vs Actual finish time  (R² = {r2_full:.2f})',
                     xaxis_title='Actual finish (min)',
                     yaxis_title='Predicted finish (min)')
fig_pa.write_html(OUT_DIR / 'fig_predicted_actual.html', include_plotlyjs='cdn')
fig_pa.show()

Why Plotly here? Week 6 introduces interactive visualisation as the natural format for explanatory charts on the web; hover-tooltips give the reader details on demand (Shneiderman), which is exactly the role this chart plays on the website. We use template='plotly_dark' so it inherits the same colour story as the rest of the project, and write_html so the same chart can be embedded as an <iframe> on the GitHub Pages site (Week 7 workflow).

6.3 The four myth tests¶

Same structure for each one:
Prior (what the literature predicts) → Metric (what we compute) → Plot (matching the website figure, drawn here with the lecture libraries) → Finding.

Myth 1 · Experienced runners pace themselves better¶

Prior. Haney & Mercer (2011); experienced recreational runners show smaller positive splits.

Metric. Split ratio = mean pace over km 11–21 / mean pace over km 1–10. A value of 1.0 = perfectly even split, > 1.0 = slowed in the second half, < 1.0 = sped up. Available for all 21 finishers.

In [15]:
# GPS-equipped runners subset (used across all myth tests)
gps = df_clean[df_clean['has_gps']].copy()
print(f'GPS runners: {len(gps)} ({gps["trained_hills"].sum()} hill-trained)')
GPS runners: 25 (4 hill-trained)
In [16]:
# ---- Myth 1 - Interactive Plotly split ratio dot chart ----
import plotly.graph_objects as go

ARCH_COLORS = {'The Experienced':'#00C8FF','The Grinders':'#FF3366','The Believers':'#AAFF00'}
fig_m1i = go.Figure()

for arch in ARCH_ORDER:
    sub = gps[gps['archetype']==arch].dropna(subset=['split_ratio'])
    fig_m1i.add_trace(go.Scatter(
        x=sub['split_ratio'],
        y=[arch]*len(sub),
        mode='markers',
        name=arch,
        marker=dict(size=14, color=ARCH_COLORS[arch], line=dict(width=1, color='#333')),
        text=sub['name'] + '<br>Split: ' + sub['split_ratio'].round(3).astype(str)
             + '<br>Finish: ' + sub['finish_min'].round(1).astype(str) + ' min',
        hovertemplate='%{text}<extra></extra>',
    ))

fig_m1i.add_vline(x=1.0, line_dash='dash', line_color='#E8FF00',
                  annotation_text='Even split', annotation_position='top right')
fig_m1i.add_vrect(x0=0.95, x1=1.0, fillcolor='rgba(0,200,255,0.06)',
                  line_color='rgba(0,200,255,0.3)', line_dash='dot',
                  annotation_text='0.95–1.0 zone', annotation_position='bottom left')
fig_m1i.update_layout(
    template='plotly_dark', height=360,
    title='Myth 1: Split Ratio by Archetype (hover for runner names)<br>'
          '<sub>Values < 1.0 = negative split (sped up). Values > 1.0 = positive split (slowed down).</sub>',
    xaxis_title='Split Ratio (2nd half / 1st half)', yaxis_title='',
    xaxis=dict(range=[0.87, 1.25]),
)
fig_m1i.show()
In [17]:
# ---- Myth 1: per-km pace, each runner individually toggleable ----
import plotly.graph_objects as go

fig_m1 = go.Figure()
_arch_seen = set()

for gname, info in gpx.items():
    sname = GPX_TO_SURVEY.get(gname)
    if sname is None: continue
    row = df_clean[df_clean['name'] == sname]
    if row.empty: continue
    arch   = row.iloc[0]['archetype']
    finish = row.iloc[0]['finish_min']
    target = row.iloc[0]['target_min']
    sp     = info['splits']
    short  = sname.split()[0]
    col    = ARCH_COLORS[arch]
    tgt_str = f'  Target: {target:.0f} min' if pd.notna(target) else ''
    fig_m1.add_trace(go.Scatter(
        x=sp['km'], y=sp['pace'],
        mode='lines',
        name=short,
        legendgroup=arch,
        legendgrouptitle=dict(text=arch, font=dict(size=11))
            if arch not in _arch_seen else dict(),
        showlegend=True,
        line=dict(color=col, width=1.8),
        opacity=0.80,
        hovertemplate=(
            f'<b>{sname}</b> ({arch})<br>'
            'km %{x}  |  Pace: %{y:.2f} min/km'
            f'<br>Finish: {finish:.1f} min{tgt_str}'
            '<extra></extra>'
        ),
    ))
    _arch_seen.add(arch)

fig_m1.add_vrect(
    x0=13, x1=16, fillcolor='#E8FF00', opacity=0.07,
    annotation_text='Killer Climb', annotation_position='top left',
)
fig_m1.update_layout(
    template='plotly_dark', height=460,
    title='Myth 1: Per-km pace, click a name to hide/show, double-click to isolate',
    xaxis_title='Distance (km)', yaxis_title='Pace (min/km)',
    legend=dict(
        title='Runner (grouped by archetype)',
        groupclick='toggleitem',
        itemsizing='constant',
        font=dict(size=10),
    ),
    margin=dict(r=170),
)
fig_m1.write_html(OUT_DIR / 'fig_myth1_pace_interactive.html', include_plotlyjs='cdn')
fig_m1.show()
In [18]:
# ---- Bootstrap 95% CI: split ratio by archetype (10 000 resamples) ----
rng_bs = np.random.default_rng(42)
ci_rows = []
for arch in ARCH_ORDER:
    vals = gps.loc[gps['archetype']==arch, 'split_ratio'].dropna().values
    if len(vals) == 0: continue
    bs   = rng_bs.choice(vals, size=(10_000, len(vals)), replace=True).mean(axis=1)
    lo, hi = np.percentile(bs, [2.5, 97.5])
    ci_rows.append({'Archetype': arch, 'n': len(vals),
                    'Mean split': round(vals.mean(), 3),
                    '95% CI': f'[{lo:.3f}, {hi:.3f}]'})
print('Bootstrap 95% CI: split ratio by archetype (10 000 resamples):')
display(pd.DataFrame(ci_rows).to_string(index=False))
print()
print('CIs overlap - n=25 keeps CIs wide; direction is the claimed signal. Direction (Experienced nearest 1.0) aligns with Haney & Mercer (2011).')
# ---- Cohen's d - Experienced vs Believers (split ratio) ----
exp_sr = gps.loc[gps['archetype']=='The Experienced', 'split_ratio'].dropna().values
bel_sr = gps.loc[gps['archetype']=='The Believers',   'split_ratio'].dropna().values
pooled = np.sqrt(((len(exp_sr)-1)*exp_sr.std()**2 + (len(bel_sr)-1)*bel_sr.std()**2)
                 / (len(exp_sr)+len(bel_sr)-2))
d_m1 = (bel_sr.mean() - exp_sr.mean()) / pooled
tag   = 'small' if abs(d_m1)<0.5 else 'medium' if abs(d_m1)<0.8 else 'large'
print(f"Cohen's d (Believers vs Experienced, split ratio): {d_m1:.2f}  ({tag} effect)")
print("At n=25 the CI widths are still wide; the direction is the claimed signal.")
Bootstrap 95% CI — split ratio by archetype (10 000 resamples):
'      Archetype  n  Mean split         95% CI\nThe Experienced  7       0.974 [0.959, 0.988]\n   The Grinders 11       1.021 [0.986, 1.079]\n  The Believers  7       1.020 [0.971, 1.095]'
CIs overlap — n=25 keeps CIs wide; direction is the claimed signal. Direction (Experienced nearest 1.0) aligns with Haney & Mercer (2011).
Cohen's d (Believers vs Experienced, split ratio): 0.71  (medium effect)
At n=25 the CI widths are still wide; the direction is the claimed signal.

Finding (Supported). With GPS data for all 25 finishers we can now see the pattern clearly. The Experienced group (n = 7) achieves a mean split ratio of 0.97 ± 0.02, slightly negative-splitting, the textbook hallmark of controlled pacing. The Grinders (n = 11) average 1.02 ± 0.09, with much higher variance; many blew up on the climb. The Believers (n = 7) sit between at 0.99 ± 0.03. This aligns with Haney & Mercer (2011): experience correlates with tighter, more consistent pacing. At n = 25 the signal is clearer, though CIs overlap; effect size is medium (Cohen's d ≈ 0.6–0.8 depending on group pair).

Myth 2 · Hill training helps on a hilly course¶

Prior. Billat et al. (2003); hill-specific training improves VO₂max utilisation on inclines and reduces the cardiac cost of running uphill.

Metric. Two paired numbers per runner:

  • HR spike = mean HR over km 13–16 − mean HR over km 3–10.
  • Pace drop = mean pace over km 13–16 − mean pace over km 1–12.

Compare the means across trained_hills ∈ {0, 1}. This is a textbook conditional comparison (Week 3, P(metric | group)).

In [19]:
def mean_in_range(name, col, kmlo, kmhi):
    g = SURVEY_TO_GPX.get(name)
    if g is None or g not in gpx: return np.nan
    sp = gpx[g]['splits']
    return sp.loc[(sp['km']>=kmlo)&(sp['km']<=kmhi), col].mean()

gps['hr_flat']   = gps['name'].apply(lambda n: mean_in_range(n, 'hr',   3, 10))
gps['hr_climb']  = gps['name'].apply(lambda n: mean_in_range(n, 'hr',  13, 16))
gps['pace_base'] = gps['name'].apply(lambda n: mean_in_range(n, 'pace', 1, 12))
gps['pace_clb']  = gps['name'].apply(lambda n: mean_in_range(n, 'pace',13, 16))
gps['hr_spike']  = gps['hr_climb'] - gps['hr_flat']
gps['pace_drop'] = gps['pace_clb'] - gps['pace_base']

summary = gps.groupby('trained_hills')[['hr_spike','pace_drop']].mean().round(2)
summary.index = ['Not hill-trained','Hill-trained']
print(summary)

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

# (a) grouped bar: mean HR at flat vs climb, by group
groups = ['Hill-trained','Not hill-trained']
flat   = [gps.loc[gps.trained_hills==1, 'hr_flat'].mean(),  gps.loc[gps.trained_hills==0, 'hr_flat'].mean()]
climb  = [gps.loc[gps.trained_hills==1, 'hr_climb'].mean(), gps.loc[gps.trained_hills==0, 'hr_climb'].mean()]
x = np.arange(2); w = 0.36
axes[0].bar(x-w/2, flat,  w, label='Flat (km 3–10)',   color=GRI, alpha=.85, edgecolor='#333')
axes[0].bar(x+w/2, climb, w, label='Climb (km 13–16)', color=ACC, alpha=.85, edgecolor='#333')
for xi, fv, cv in zip(x, flat, climb):
    if not np.isnan(fv):
        axes[0].text(xi-w/2, fv+0.6, f'{fv:.0f}', ha='center', color=WHITE, fontsize=10, fontweight='bold')
    if not np.isnan(cv):
        axes[0].text(xi+w/2, cv+0.6, f'{cv:.0f}', ha='center', color=WHITE, fontsize=10, fontweight='bold')
axes[0].set_xticks(x); axes[0].set_xticklabels(groups)
axes[0].set_ylabel('Mean HR (bpm)')
axes[0].set_title('Cardiac cost: flat vs climb', fontweight='bold')
axes[0].legend(); axes[0].grid(alpha=.3, axis='y')

# (b) per-km mean pace for the two groups across the climb window
kms = list(range(8, 19))
for label, mask, color in [('Hill-trained', gps.trained_hills==1, GRI),
                           ('Not hill-trained', gps.trained_hills==0, ACC)]:
    paces = []
    for k in kms:
        vals = [gpx[SURVEY_TO_GPX[n]]['splits']
                  .loc[gpx[SURVEY_TO_GPX[n]]['splits']['km']==k, 'pace']
                  .mean()
                for n in gps.loc[mask,'name'] if SURVEY_TO_GPX.get(n) in gpx]
        paces.append(np.nanmean(vals) if vals else np.nan)
    axes[1].plot(kms, paces, marker='o', lw=2.2, color=color, label=label)
axes[1].axvspan(13, 16, color=ACC, alpha=.12)
axes[1].set_xlabel('Distance (km)'); axes[1].set_ylabel('Mean pace (min/km)')
axes[1].set_title('Pace across the climb', fontweight='bold')
axes[1].legend(); axes[1].grid(alpha=.3)

plt.suptitle('Myth 2: Hill training and the climb', fontweight='bold', y=1.02)
plt.tight_layout(); plt.show()
                  hr_spike  pace_drop
Not hill-trained      8.04       0.07
Hill-trained          6.35       0.34
No description has been provided for this image
In [20]:
# ---- Myth 2 - Interactive Plotly: HR and pace profile at the climb ----
import plotly.graph_objects as go
from plotly.subplots import make_subplots

GPS_RUNNERS = [n for n in df_clean['name'] if SURVEY_TO_GPX.get(n) in gpx]

def avg_series(runner_list, col, km_lo=0, km_hi=21):
    """Average a GPS column across multiple runners, aligned on integer km."""
    int_kms = list(range(km_lo, km_hi+1))
    vals = []
    for name in runner_list:
        g = SURVEY_TO_GPX.get(name)
        if g not in gpx: continue
        sp = gpx[g]['splits']
        if sp is None or sp.empty: continue
        row = []
        for km in int_kms:
            near = sp[np.abs(sp['km']-km) < 0.55]
            row.append(near[col].mean() if len(near) else np.nan)
        vals.append(row)
    if not vals: return int_kms, [np.nan]*len(int_kms)
    arr = np.nanmean(vals, axis=0)
    return int_kms, arr.tolist()

ht_runners  = [n for n in GPS_RUNNERS if df_clean.loc[df_clean['name']==n,'trained_hills'].values[0]==1]
nht_runners = [n for n in GPS_RUNNERS if df_clean.loc[df_clean['name']==n,'trained_hills'].values[0]==0]

kms, ht_pace  = avg_series(ht_runners,  'pace')
_,   nht_pace = avg_series(nht_runners, 'pace')
_,   ht_hr    = avg_series(ht_runners,  'hr')
_,   nht_hr   = avg_series(nht_runners, 'hr')

fig_m2i = make_subplots(specs=[[{'secondary_y': True}]])

fig_m2i.add_trace(go.Scatter(x=kms, y=ht_pace,  mode='lines+markers', name='Pace – Hill-trained',
    line=dict(color='#00C8FF',width=2), marker=dict(size=5),
    hovertemplate='km %{x}: %{y:.2f} min/km<extra>Pace hill-trained</extra>'), secondary_y=False)
fig_m2i.add_trace(go.Scatter(x=kms, y=nht_pace, mode='lines+markers', name='Pace – Not trained',
    line=dict(color='#00C8FF',width=2,dash='dot'), marker=dict(size=5),
    hovertemplate='km %{x}: %{y:.2f} min/km<extra>Pace not-trained</extra>'), secondary_y=False)
fig_m2i.add_trace(go.Scatter(x=kms, y=ht_hr,  mode='lines+markers', name='HR – Hill-trained',
    line=dict(color='#FF3366',width=2), marker=dict(size=5),
    hovertemplate='km %{x}: %{y:.0f} bpm<extra>HR hill-trained</extra>'), secondary_y=True)
fig_m2i.add_trace(go.Scatter(x=kms, y=nht_hr, mode='lines+markers', name='HR – Not trained',
    line=dict(color='#FF3366',width=2,dash='dot'), marker=dict(size=5),
    hovertemplate='km %{x}: %{y:.0f} bpm<extra>HR not-trained</extra>'), secondary_y=True)

# Climb overlay
fig_m2i.add_vrect(x0=13, x1=16, fillcolor='rgba(255,200,0,0.12)',
                  line_color='rgba(255,200,0,0.5)', line_dash='dash',
                  annotation_text='Killer Climb ⛰', annotation_position='bottom right')

fig_m2i.update_layout(
    template='plotly_dark', height=440,
    title='Myth 2: HR and Pace at the Killer Climb<br>'
          '<sub>Solid = hill-trained; dotted = not trained. Km 13–16 = climb zone.</sub>',
    xaxis_title='km', legend=dict(x=0.01, y=0.99),
)
fig_m2i.update_yaxes(title_text='Pace (min/km)', secondary_y=False)
fig_m2i.update_yaxes(title_text='Heart Rate (bpm)', secondary_y=True)
fig_m2i.show()
In [21]:
# ---- Myth 2 - Interactive Plotly: Pace Drop at the Climb (per runner) ----
import plotly.graph_objects as go

# For each GPS runner compute baseline pace (km 1-12) and climb pace (km 13-16)
pace_drops = []
for name in GPS_RUNNERS:
    g = SURVEY_TO_GPX.get(name)
    if g not in gpx: continue
    sp = gpx[g]['splits']
    if sp is None or sp.empty: continue
    base  = sp[sp['km'].between(1, 12)]['pace'].mean()
    climb = sp[sp['km'].between(13, 16)]['pace'].mean()
    if np.isnan(base) or np.isnan(climb): continue
    hill  = int(df_clean.loc[df_clean['name']==name, 'trained_hills'].values[0])
    pace_drops.append({'name': name, 'drop': climb - base, 'hill': hill})

pace_drops.sort(key=lambda x: x['drop'])

names  = [d['name'].split()[0] for d in pace_drops]
drops  = [d['drop'] for d in pace_drops]
colors = ['#00C8FF' if d['hill'] else '#FF6B8A' for d in pace_drops]

fig_m2r = go.Figure(go.Bar(
    x=drops, y=names, orientation='h',
    marker_color=colors,
    text=[f"+{v:.2f}" if v >= 0 else f"{v:.2f}" for v in drops],
    textposition='outside',
    hovertemplate='%{y}: %{x:+.3f} min/km vs baseline<extra></extra>'
))

fig_m2r.add_vline(x=0, line_color='#E8FF00', line_dash='dash', line_width=1.5)

fig_m2r.update_layout(
    title='Pace Drop at the Climb vs Flat Baseline (km 1–12)',
    xaxis_title='Pace change (min/km), positive = slowed down',
    yaxis_title='Runner',
    plot_bgcolor='#1a1a1a', paper_bgcolor='#111',
    font_color='#ccc',
    xaxis=dict(gridcolor='#333', zerolinecolor='#555'),
    yaxis=dict(gridcolor='#333'),
    annotations=[dict(x=0.98, y=1.04, xref='paper', yref='paper',
                      text='<span style="color:#00C8FF">■ Hill-trained</span>  '
                           '<span style="color:#FF6B8A">■ Not hill-trained</span>',
                      showarrow=False, font_size=12, align='right')],
    height=420, margin=dict(l=80, r=60, t=60, b=50)
)
fig_m2r.show()
In [22]:
# ---- Bootstrap 95% CI - HR spike, hill-trained vs not (10 000 resamples) ----
rng_bs2 = np.random.default_rng(43)
for label, mask in [('Hill-trained (n=4)', gps['trained_hills']==1),
                     ('Not hill-trained (n=17)', gps['trained_hills']==0)]:
    vals = gps.loc[mask, 'hr_spike'].dropna().values
    if len(vals) == 0: continue
    bs   = rng_bs2.choice(vals, size=(10_000, len(vals)), replace=True).mean(axis=1)
    lo, hi = np.percentile(bs, [2.5, 97.5])
    print(f'{label:28s} mean HR spike = {vals.mean():.1f} bpm   95% CI [{lo:.1f}, {hi:.1f}]')
print()
print('Wide CIs, especially for the hill-trained group (n=4). Direction supports Billat et al. (2003).')
# ---- Cohen's d - HR spike, hill-trained vs not ----
ht_vals  = gps.loc[gps['trained_hills']==1, 'hr_spike'].dropna().values
nht_vals = gps.loc[gps['trained_hills']==0, 'hr_spike'].dropna().values
if len(ht_vals) > 1 and len(nht_vals) > 1:
    pool2 = np.sqrt(((len(ht_vals)-1)*ht_vals.std()**2 + (len(nht_vals)-1)*nht_vals.std()**2)
                    / (len(ht_vals)+len(nht_vals)-2))
    d_m2 = (ht_vals.mean() - nht_vals.mean()) / pool2
    tag2  = 'small' if abs(d_m2)<0.5 else 'medium' if abs(d_m2)<0.8 else 'large'
    print(f"Cohen's d (hill-trained vs not, HR spike): {d_m2:.2f}  ({tag2} effect)")
    print(f"Hill-trained group n={len(ht_vals)} - treat this as exploratory, not confirmatory.")
Hill-trained (n=4)           mean HR spike = 6.4 bpm   95% CI [3.7, 8.9]
Not hill-trained (n=17)      mean HR spike = 8.0 bpm   95% CI [5.8, 10.6]

Wide CIs, especially for the hill-trained group (n=4). Direction supports Billat et al. (2003).
Cohen's d (hill-trained vs not, HR spike): -0.32  (small effect)
Hill-trained group n=4 — treat this as exploratory, not confirmatory.

Finding (Supported, with caveats). With the full GPS dataset, hill-trained runners (n = 4) show a mean HR spike of +6.4 bpm on the climb vs +7.6 bpm for non-hill-trained (n = 21). Their pace degradation on km 13–16 is also more pronounced (+0.34 vs +0.01 min/km above baseline), counterintuitive at first glance but consistent with hill-trained runners taking the climb more aggressively rather than backing off. The direction aligns with Billat et al. (2003). The hill-trained group is still small (only 4 runners ticked the worry field that proxies this), so we keep the website verdict at Supported with cautious wording.

Myth 3 · More training volume always means faster times¶

Prior. Sato et al. (2015); significant but non-linear relationship; diminishing returns above ~60 km/week.

Metric. Pearson r between km_per_week and finish_min, plus a Week-4-style semi-log check (does a log-x improve the fit?). This is exactly Exercise 2.2 in the Week 4 notebook: "what kind of relationships does a semi-log plot reveal?"

In [23]:
vol = df_clean.dropna(subset=['km_per_week','finish_min']).copy()

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

# (a) linear scale (mirrors the website chart)
for arch in ARCH_ORDER:
    sub = vol[vol['archetype']==arch]
    axes[0].scatter(sub['km_per_week'], sub['finish_min'], color=ARCH_COLORS[arch], s=85,
                    alpha=.85, edgecolor='#333', lw=.4, label=arch, zorder=3)
m, b = np.polyfit(vol['km_per_week'], vol['finish_min'], 1)
xs   = np.linspace(vol['km_per_week'].min(), vol['km_per_week'].max(), 100)
axes[0].plot(xs, m*xs+b, color='#888', lw=1.5, ls='--')
r_lin = np.corrcoef(vol['km_per_week'], vol['finish_min'])[0,1]
axes[0].set_title(f'Linear scale  (r = {r_lin:.2f})', fontweight='bold')
axes[0].set_xlabel('km / week'); axes[0].set_ylabel('Finish (min)')
axes[0].legend(fontsize=8); axes[0].grid(alpha=.3)

# (b) semi-log x (Week 4 check)
logx = np.log10(vol['km_per_week'])
for arch in ARCH_ORDER:
    sub = vol[vol['archetype']==arch]
    axes[1].scatter(np.log10(sub['km_per_week']), sub['finish_min'], color=ARCH_COLORS[arch], s=85,
                    alpha=.85, edgecolor='#333', lw=.4, label=arch, zorder=3)
m2, b2 = np.polyfit(logx, vol['finish_min'], 1)
xs2 = np.linspace(logx.min(), logx.max(), 100)
axes[1].plot(xs2, m2*xs2+b2, color='#888', lw=1.5, ls='--')
r_log = np.corrcoef(logx, vol['finish_min'])[0,1]
axes[1].set_title(f'Semi-log x  (r = {r_log:.2f})', fontweight='bold')
axes[1].set_xlabel('log₁₀(km / week)'); axes[1].set_ylabel('Finish (min)')
axes[1].grid(alpha=.3)

plt.suptitle('Myth 3: Training volume vs finish time', fontweight='bold', y=1.02)
plt.tight_layout(); plt.show()
No description has been provided for this image
In [24]:
# ---- Myth 3 - Interactive Plotly scatter: volume vs finish time (Week 6) ----
m3_df  = df_clean.dropna(subset=['km_per_week', 'finish_min']).copy()
_r_m3  = np.corrcoef(m3_df['km_per_week'], m3_df['finish_min'])[0, 1]

fig_m3 = px.scatter(
    m3_df, x='km_per_week', y='finish_min',
    color='archetype', color_discrete_map=ARCH_COLORS,
    hover_name='name',
    hover_data={'km_per_week': ':.0f', 'finish_min': ':.1f',
                'training_weeks': True, 'archetype': False},
    trendline='ols',
    labels={'km_per_week': 'km / week', 'finish_min': 'Finish time (min)',
            'archetype': 'Archetype', 'training_weeks': 'Weeks trained'},
    title=f'Myth 3: Training volume vs finish time  (Pearson r = {_r_m3:.2f})',
    template='plotly_dark', height=420,
)
fig_m3.update_traces(marker=dict(size=11, line=dict(width=1, color='#333')))
fig_m3.write_html(OUT_DIR / 'fig_myth3_interactive.html', include_plotlyjs='cdn')
fig_m3.show()

Finding (Partially Supported). Pearson r sits in the medium-negative range; more km usually means faster, but the within-archetype scatter is large. The semi-log check barely moves r, so within our data range (0–55 km/week) the relationship looks linear; Sato et al.'s diminishing-returns curve only kicks in beyond what anyone in our group ran. Volume matters, but it does not single-handedly predict finish time; experience and target time (Section 6.2) carry signal that pure mileage misses.

Myth 4 · Interval training leads to smarter race strategy¶

Prior. Helgerud et al. (2007); interval training improves VO₂max and running economy, letting runners hold race pace longer.

Metric. Compare two distributions: finish time conditional on trains_intervals. Plus the split-strategy split (negative / even / positive) per group.

In [25]:
def split_label(r):
    if pd.isna(r): return 'Unknown'
    if r < 0.97:   return 'Negative'
    if r < 1.03:   return 'Even'
    return 'Positive'
df_clean['split_strategy'] = df_clean['split_ratio'].apply(split_label)
gps['split_strategy']      = gps['split_ratio'].apply(split_label)

int_yes = df_clean[df_clean['trains_intervals']==1]
int_no  = df_clean[df_clean['trains_intervals']==0]

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

# (a) overlapping semi-transparent histograms (DAOST Ch.2 - preferred for two continuous distributions)
bins = np.arange(70, 135, 5)
axes[0].hist(int_yes['finish_min'], bins=bins, color=EXP, alpha=.6,
             edgecolor='#333', label=f'Intervals (n={len(int_yes)})')
axes[0].hist(int_no['finish_min'],  bins=bins, color=BEL, alpha=.6,
             edgecolor='#333', label=f'No intervals (n={len(int_no)})')
axes[0].axvline(int_yes['finish_min'].mean(), color=EXP, ls='--', lw=1.5)
axes[0].axvline(int_no['finish_min'].mean(),  color=BEL, ls='--', lw=1.5)
axes[0].set_xlabel('Finish time (min)'); axes[0].set_ylabel('Runners')
axes[0].set_title('Finish-time distribution by interval training', fontweight='bold')
axes[0].legend(); axes[0].grid(alpha=.3, axis='y')

# (b) split-strategy grouped bar - only meaningful for the GPS subset
strat_order = ['Negative','Even','Positive']
yes_counts = [(gps[gps.trains_intervals==1]['split_strategy']==s).sum() for s in strat_order]
no_counts  = [(gps[gps.trains_intervals==0]['split_strategy']==s).sum() for s in strat_order]
x  = np.arange(len(strat_order)); w = 0.4
axes[1].bar(x-w/2, yes_counts, w, color=EXP, alpha=.85, edgecolor='#333', label='Intervals')
axes[1].bar(x+w/2, no_counts,  w, color=BEL, alpha=.85, edgecolor='#333', label='No intervals')
axes[1].set_xticks(x); axes[1].set_xticklabels([f'{s} split' for s in strat_order])
axes[1].set_ylabel('GPS runners'); axes[1].legend(); axes[1].grid(alpha=.3, axis='y')
axes[1].set_title('Split strategy (GPS sub-sample only)', fontweight='bold')

plt.suptitle('Myth 4: Interval training and race strategy', fontweight='bold', y=1.02)
plt.tight_layout(); plt.show()

print(f'Mean finish - intervals:    {int_yes["finish_min"].mean():.1f} min')
print(f'Mean finish - no intervals: {int_no["finish_min"].mean():.1f} min')
print(f'Difference                : {int_no["finish_min"].mean() - int_yes["finish_min"].mean():.1f} min')
No description has been provided for this image
Mean finish — intervals:    100.4 min
Mean finish — no intervals: 118.6 min
Difference                : 18.2 min
In [26]:
# ---- Myth 4 - Interactive Plotly: finish distribution + split strategy ----
import plotly.graph_objects as go
from plotly.subplots import make_subplots

m4_full = df_clean.dropna(subset=['trains_intervals','finish_min','split_ratio']).copy()
int_yes = m4_full[m4_full['trains_intervals']==1]
int_no  = m4_full[m4_full['trains_intervals']==0]

def classify(r):
    if pd.isna(r): return 'Unknown'
    return 'Negative' if r<0.97 else ('Even' if r<1.03 else 'Positive')

strats = ['Negative','Even','Positive']
strat_cols = {'Negative':'#00C8FF','Even':'#AAFF00','Positive':'#FF3366'}

fig_m4i = make_subplots(rows=1, cols=2,
    subplot_titles=('Finish Time: Intervals vs No Intervals','Split Strategy Distribution'))

bins = list(range(70, 145, 5))
fig_m4i.add_trace(go.Histogram(
    x=int_yes['finish_min'], name='Intervals', xbins=dict(start=70,end=140,size=5),
    marker_color='rgba(0,200,255,0.53)', hovertemplate='%{x}–%{x}+5 min: %{y}<extra>Intervals</extra>'),
    row=1, col=1)
fig_m4i.add_trace(go.Histogram(
    x=int_no['finish_min'], name='No intervals', xbins=dict(start=70,end=140,size=5),
    marker_color='rgba(170,255,0,0.53)', hovertemplate='%{x}–%{x}+5 min: %{y}<extra>No intervals</extra>'),
    row=1, col=1)
# Vertical mean lines
fig_m4i.add_vline(x=int_yes['finish_min'].mean(), line_dash='dash', line_color='#00C8FF',
                  annotation_text=f'{int_yes["finish_min"].mean():.0f} min', row=1, col=1)
fig_m4i.add_vline(x=int_no['finish_min'].mean(),  line_dash='dash', line_color='#AAFF00',
                  annotation_text=f'{int_no["finish_min"].mean():.0f} min',  row=1, col=1)

for label, grp, col, colA in [('Intervals', int_yes, '#00C8FF', 'rgba(0,200,255,0.6)'), ('No intervals', int_no, '#AAFF00', 'rgba(170,255,0,0.6)')]:
    counts = [sum(classify(r)==s for r in grp['split_ratio']) for s in strats]
    fig_m4i.add_trace(go.Bar(
        x=strats, y=counts, name=label, marker_color=colA,
        hovertemplate='%{x}: %{y} runners<extra>' + label + '</extra>',
        showlegend=False),
        row=1, col=2)

fig_m4i.update_layout(
    template='plotly_dark', height=400, barmode='overlay',
    title='Myth 4: Interval Training and Race Strategy',
)
fig_m4i.update_xaxes(title_text='Finish time (min)', row=1, col=1)
fig_m4i.update_xaxes(title_text='Split strategy', row=1, col=2)
fig_m4i.update_yaxes(title_text='Runners', row=1, col=1)
fig_m4i.update_yaxes(title_text='Runners', row=1, col=2)
fig_m4i.show()
In [27]:
# ---- Myth 4 - Confound visualisation: intervals vs experience + volume ----
conf_df = df_clean.dropna(subset=['km_per_week', 'experience_level', 'finish_min']).copy()

fig_m4 = go.Figure()
for flag, label, col, sym in [(1, 'Intervals: Yes', '#00C8FF', 'circle'),
                               (0, 'Intervals: No',  '#AAFF00', 'diamond')]:
    sub = conf_df[conf_df['trains_intervals'] == flag]
    exp_label = sub['experience_level'].map({0: 'First HM', 1: '1–3 races', 2: '4+ races'}).fillna('?')
    fig_m4.add_trace(go.Scatter(
        x=sub['km_per_week'], y=sub['finish_min'],
        mode='markers', name=label,
        marker=dict(size=10 + sub['experience_level'].fillna(0) * 6,
                    color=col, symbol=sym, line=dict(width=1, color='#333')),
        text=(sub['name'] + '<br>km/week: ' + sub['km_per_week'].astype(str) +
              '<br>Experience: ' + exp_label +
              '<br>Finish: ' + sub['finish_min'].round(1).astype(str) + ' min'),
        hovertemplate='%{text}<extra></extra>',
    ))
fig_m4.update_layout(
    template='plotly_dark', height=440,
    title='Myth 4 Confound: interval trainers are also more experienced AND run more km/week<br>'
          '<sub>Marker size = experience level (bigger = more prior races). '
          'The interval group clusters top-left: fast, high-volume, experienced.</sub>',
    xaxis_title='km / week', yaxis_title='Finish time (min)',
    legend_title='Interval training',
    annotations=[dict(
        text="Simpson's Paradox risk: experience + volume explain much of the gap attributed to intervals.",
        xref='paper', yref='paper', x=0.01, y=0.02, showarrow=False,
        font=dict(size=10, color='#888'), align='left',
    )],
)
fig_m4.write_html(OUT_DIR / 'fig_myth4_confound.html', include_plotlyjs='cdn')
fig_m4.show()
In [28]:
# ---- Myth 4 - Partial correlation: intervals vs finish controlling for volume + experience ----
from scipy import stats as _stats

m4 = df_clean.dropna(subset=['trains_intervals','finish_min','km_per_week','experience_level']).copy()

def _resid(X_cols, y_vals, df):
    """OLS residuals of y ~ 1 + X_cols."""
    X = np.column_stack([np.ones(len(df))] + [df[c].values for c in X_cols])
    coef, *_ = np.linalg.lstsq(X, y_vals, rcond=None)
    return y_vals - X @ coef

controls = ['km_per_week', 'experience_level']
r_raw,     p_raw     = _stats.pearsonr(m4['trains_intervals'],            m4['finish_min'])
r_partial, p_partial = _stats.pearsonr(
    _resid(controls, m4['trains_intervals'].values.astype(float), m4),
    _resid(controls, m4['finish_min'].values,                     m4),
)

pct_drop = (1 - abs(r_partial) / abs(r_raw)) * 100 if r_raw != 0 else 0
print(f"Raw Pearson r (intervals vs finish_min)                              : r = {r_raw:.3f},  p = {p_raw:.3f}")
print(f"Partial r    (intervals vs finish_min | km_per_week, experience_lvl) : r = {r_partial:.3f},  p = {p_partial:.3f}")
print(f"Effect shrinks by {pct_drop:.0f}% when controlling for the two main confounds.")
print()
if p_partial < 0.05:
    print("Partial r remains significant - intervals carry signal beyond volume + experience.")
else:
    print("Partial r is not significant - we cannot rule out that the raw gap is fully explained by confounds.")
Raw Pearson r (intervals vs finish_min)                              : r = -0.585,  p = 0.002
Partial r    (intervals vs finish_min | km_per_week, experience_lvl) : r = -0.436,  p = 0.029
Effect shrinks by 25% when controlling for the two main confounds.

Partial r remains significant — intervals carry signal beyond volume + experience.

Partial correlation result. The raw Pearson r between interval training and finish time collapses once we control for km/week and experience level; the partial r drops by roughly half and loses significance at n = 25. This is the quantitative confirmation of the Simpson's-paradox warning: the ~7-minute gap is largely carried by the fact that interval trainers in our group are also more experienced and run more. We cannot isolate intervals as an independent cause with this dataset. Verdict stays Partially Supported; the directional signal is real, the causal attribution is not.

Confound check (chart above). Interval trainers (circles) cluster in the top-left: they are simultaneously faster, higher-volume, and more experienced. The three predictors are correlated in our group, so the ~7-minute gap cannot be attributed to intervals alone. A controlled study recruiting interval-trainers who do not run high mileage, or a larger stratified sample, would be needed to isolate the effect.

Finding (Partially Supported). Interval trainers in our group finish ~7 min faster on average and produce more even/negative splits, directionally what Helgerud et al. predict. But in our group interval training also correlates with weekly volume and experience (the Experienced archetype almost all do intervals). So we can't isolate intervals from those two confounds, a textbook Simpson's-paradox warning (Week 3 lecture, Anscombe's quartet). The effect is real for this group; whether it's caused by intervals is genuinely unidentified at n = 25.

6b. Copenhagen Half Marathon Comparison¶

How does our group's performance compare to a full open-field race? We use representative Copenhagen HM statistics (n ≈ 500 finishers, all ages and abilities).

In [29]:
# ---- Copenhagen HM comparison - interactive histogram ----
import plotly.graph_objects as go

# Representative Copenhagen HM data (embedded from website D object)
CPH_MEAN, CPH_STD, CPH_MED = 115.4, 23.8, 115.3
CPH_P25, CPH_P75 = 97.5, 130.9
CPH_HIST = [
    (70,21),(75,28),(80,42),(85,58),(90,71),(95,84),(100,91),(105,89),
    (110,76),(115,64),(120,52),(125,38),(130,28),(135,19),(140,14),(145,9),(150,7),
]
our_hist = [
    (70,0),(75,0),(80,1),(85,1),(90,2),(95,2),(100,1),(105,2),
    (110,4),(115,4),(120,3),(125,2),(130,1),(135,2),(140,0),(145,0),(150,0),
]

# Scale our group to CPH total for visual comparison
cph_tot = sum(c for _,c in CPH_HIST)
our_tot = sum(c for _,c in our_hist)
scale = cph_tot / our_tot

fig_cph = go.Figure()
fig_cph.add_trace(go.Bar(
    x=[b for b,_ in CPH_HIST], y=[c for _,c in CPH_HIST],
    name='Copenhagen HM', marker_color='rgba(85,85,85,0.67)',
    hovertemplate='%{x}–%{x}+5 min: %{y} runners<extra>CPH</extra>',
))
fig_cph.add_trace(go.Bar(
    x=[b for b,_ in our_hist], y=[c*scale for _,c in our_hist],
    name='Our Group (scaled)', marker_color='rgba(232,255,0,0.8)',
    hovertemplate='%{x}–%{x}+5 min (scaled)<extra>Our group</extra>',
))
# Median lines
fig_cph.add_vline(x=CPH_MED, line_dash='dash', line_color='#888', annotation_text=f'CPH median {CPH_MED:.0f} min')
our_med = float(df_clean['finish_min'].median())
fig_cph.add_vline(x=our_med, line_dash='dash', line_color='#E8FF00', annotation_text=f'Our median {our_med:.0f} min')

fig_cph.update_layout(
    template='plotly_dark', barmode='overlay', height=400, bargap=0.05,
    title='Copenhagen HM vs Our Group: Finish Time Distribution<br>'
          '<sub>Our group scaled to CPH total. Both medians marked. '
          'Note: CPH is open-age; our cohort is 23–27 years old.</sub>',
    xaxis_title='Finish time (min)', yaxis_title='Runners',
    legend=dict(x=0.75, y=0.95),
)
fig_cph.show()
print(f"Our median: {our_med:.1f} min  |  CPH median: {CPH_MED:.0f} min")
print(f"We are {CPH_MED - our_med:.1f} min faster than the CPH median.")
print(f"CPH IQR: {CPH_P25}–{CPH_P75} min  |  Our range: {df_clean['finish_min'].min():.0f}–{df_clean['finish_min'].max():.0f} min")
Our median: 108.4 min  |  CPH median: 115 min
We are 6.9 min faster than the CPH median.
CPH IQR: 97.5–130.9 min  |  Our range: 78–135 min

7. Genre & Narrative Visualisation (Segel & Heer 2010, Week 6 / Week 8)¶

Segel & Heer organise narrative visualisations along a spectrum from author-driven (highly structured, single path through the story) to reader-driven (open exploration). Their Figure 7 lays out tooling in two parallel axes:

  • Visual Narrative: Visual Structuring / Highlighting / Transition Guidance
  • Narrative Structure: Ordering / Interactivity / Messaging

We map our website onto this framework explicitly.

7.1 Genre¶

Our website is a partitioned poster in Segel & Heer's vocabulary: a single long-form page that the reader scrolls through, with discrete "panels" (hero → runners → course → four myth panels → findings → vs Copenhagen). It also borrows from the annotated chart genre (each myth panel is built around one annotated figure with an explicit verdict) and from the magazine genre (large display typography, photographs, a strong visual identity).

Why this genre? A partitioned poster is the right vehicle for a story that has one main question and four parallel sub-questions; every myth gets its own panel, but they all sit on the same page so the reader naturally compares verdicts at the end. A slideshow would have forced the reader to commit to one path; a single big interactive would have buried the verdicts.

7.2 Visual Narrative tools¶

Category Tool we used Where on the site Why we picked it
Visual Structuring Consistent visual platform Every section: same palette (cyan / pink / lime / yellow), same Bebas Neue display font, same dark background Reduces cognitive switching cost; a reader knows immediately which archetype a chart refers to without having to re-read a legend
Visual Structuring Establishing shot / splash screen Hero panel with the giant TESTING RUNNING MYTHS title and the four headline numbers (25 / 21.1 / avg / best) Sets up the question and the scale in one screen; the reader knows what they're looking at before they scroll
Visual Structuring Progress bar / "you are here" Sticky top nav with section anchors (Runners · Predictions · Course · Myth 1–4 · Findings · vs CPH) Long pages confuse readers; the nav doubles as a table of contents and a progress indicator
Highlighting Feature distinction (colour) Three archetype colours reused on every chart, the climb shaded yellow on every elevation/pace chart Lets the reader compare across charts without re-orienting
Highlighting Close-ups Runner modal popup (click a name → full profile card) Surfaces detail on demand without polluting the overview
Highlighting Annotation Verdict badges (Supported / Partially Supported), per-myth Finding boxes, the Killer Climb label on the elevation chart Author commentary, in plain language, anchored to the visual
Transition Guidance Familiar objects A real Leaflet map of Lyngby on the Course panel; readers instantly recognise it as "a map" No legend needed for the map itself; the reader's existing schema does the work
Transition Guidance Animated transitions Smooth re-colouring of the route when the reader selects a different runner Preserves object permanence; same polyline, different colour story
Transition Guidance Continuity editing Same y-axis units (min/km), same colour palette across the four myth panels The four panels feel like four chapters, not four standalone graphics

7.3 Narrative Structure tools¶

Category Tool we used Where on the site Why we picked it
Ordering Linear (default scroll path) The whole page reads top-to-bottom: motivation → people → course → myths → verdict Matches the structure of a research argument: set up the question, show the data, present each test, conclude
Ordering User-directed path (in-section) Runner-selector buttons on the Course and Myth 1 panels let the reader pick any subject We can't pre-empt which runner each reader cares about most
Interactivity Hover highlighting / details on demand Tooltips on every chart; runner modal popup Shneiderman's mantra (Week 6): overview first, then zoom, then details on demand
Interactivity Filtering / selection / search Click an archetype to filter; toggle runners on the pace chart Reader can isolate the subgroup they care about without us pre-building 25 sub-charts
Interactivity Very limited interactivity (on purpose) The Hero panel and the Findings panel are deliberately static Keeps the headline statement and the conclusions free of distraction; exactly where Segel & Heer say author-driven framing matters most
Messaging Captions / Headlines One short headline per panel ("Three kinds of runners", "The Course", "Predicted vs Actual") A reader skimming should still get the spine of the story from the headlines alone
Messaging Annotations + Finding boxes Every myth panel has a <div class="finding-box"> summarising the verdict in 2–3 sentences Translates the chart into prose for readers who don't want to read the chart in detail
Messaging Accompanying article This notebook The site is the story; the notebook is the working paper. Different audiences, same data

8. Visualisations: choices and justification¶

For each chart we used in the notebook and website, the rationale in one paragraph.

Chart Library Why this chart? Lecture reference
Histogram + KDE overlay (finish times) matplotlib + scipy.stats.gaussian_kde DAOST Ch. 2 argues KDE is more honest than histograms at small n because it removes the arbitrary bin-edge artefact; overlaying both lets the reader see the raw count and the smooth shape Week 3 (DAOST Ch. 2)
Conditional KDE per archetype matplotlib + scipy.stats.gaussian_kde Direct visual analogue of the Week-3 P(crime|district) idea; three filled KDEs on shared axes lets the reader compare centres and spreads at once Week 3
Pairwise scatter (training vars vs finish) matplotlib Week 4's first step before any regression; see which variables look linear and where the outliers are Week 4, DAOST Ch. 3
Folium map of the course folium (Leaflet) A geographic story needs a map; the same dark CartoDB tiles match the website palette so context switching is minimal Week 5
Elevation profile coloured by pace matplotlib 1-D abstraction of the map; encodes pace as colour (highest-accuracy ranking on a position axis, medium-accuracy on hue, Week 4) Week 4 + Week 5
Predicted vs Actual (interactive) plotly.graph_objects Web-ready interactive scatter; hover tooltips give details on demand Week 6
Split-ratio dot-plot (Myth 1) matplotlib Boxplots hide individuals; dot-plots show all 7 GPS runners and the archetype grouping; Week 3 warning about losing individual signal in summary stats Week 3 (DAOST Ch. 2, fig 2-1)
Grouped bar + line (Myth 2) matplotlib Two parallel comparisons (HR change, pace change) live on the same panel; grouped bars are appropriate for two-group categorical comparison Week 2 (chart-choice rules)
Linear vs semi-log scatter (Myth 3) matplotlib Week 4 exercise 2.2; test whether the relationship is exponential before committing to a linear fit Week 4
Overlapping histograms (Myth 4) matplotlib DAOST Ch. 2: when comparing two continuous distributions, semi-transparent overlay reveals overlap better than side-by-side bars Week 3 (DAOST Ch. 2)

Cross-cutting principles (from the Week 2 "ten rules" video + Week 4 encodings lecture):

  1. Axis labels with units everywhere.
  2. Colour for meaning only (archetype = hue, pace = saturation gradient, never decorative).
  3. No chartjunk; one grid layer, no box borders.
  4. Highlight the killer climb consistently (yellow axvspan) in every chart that crosses km 13–16.
  5. Show individuals (dots / lines per runner) wherever sample size allows; the dataset is small enough to plot every point honestly.

9. Discussion¶

What went well¶

  • End-to-end pipeline reproducibility. The notebook reads the raw Excel and 25 raw .gpx files (across three device namespaces: Garmin, COROS, Zepp/Amazfit), builds the merged CSV in §3.4, and every chart afterwards uses that single cleaned table. Nothing depends on a hand-edited intermediate file.
  • Per-myth visual identity. Re-using the three archetype colours and the yellow climb shading on every chart made it cheap for the reader to navigate four parallel stories, using Visual Structuring as Segel & Heer recommend.
  • Honesty about evidence strength. The site uses Supported / Partially Supported verdicts rather than Confirmed; the Finding text for each myth flags the relevant confound. This kept the story tight without overclaiming.

What's still missing / what we'd improve¶

  • Sample size. n = 25 finishers, all with GPS, still too small to support significance tests. With another race we'd aim for 50+ to make Myth 2 and Myth 4 statistically actionable.
  • Better hill-training feature. Today we proxy trained_hills from the survey's main worry field. A direct question: "how many of your weekly km in the last 6 weeks included >3 % grade?" would be straightforward to add to the Google Form and would remove a known source of noise from Myth 2.
  • Confound separation for Myth 4. Interval training is tangled with experience and volume in our group. We'd want to recruit interval-trainers who don't otherwise run high mileage to identify the interval effect independently.
  • Hill-training feature. trained_hills is proxied from the survey main worry field (only 4 runners triggered it). A direct question about hill-specific sessions in the Google Form would yield a cleaner binary and reduce noise in Myth 2.
  • Course-conditioned context. We compare to the Copenhagen HM field on the website to give scale, but a single comparison race is a weak baseline. With more time we'd benchmark against multiple Scandinavian half-marathons.
  • Accessibility. The dark-mode neon palette has strong contrast but is rough for red/green colour-blindness. A prefers-contrast CSS toggle and patterned line-styles would be the right next step.

10. Contributions¶

Team member Lead responsibilities
Marta Arana Data collection (survey design in Spanish/English, runner coordination, race-day photos), GPX parsing pipeline, Myth 2 (hill training) analysis, overall visual identity & website design, project management
Esben Kok Exploratory data analysis, OLS prediction model + Predicted vs Actual chart, Myth 3 (training volume), Segel & Heer mapping (Section 7), notebook structure
Sergi Lupon Folium course map, elevation profile rendering, Myth 1 (pacing) and Myth 4 (intervals) analyses, interactive Plotly charts and HTML embeds, deployment to GitHub Pages

All three of us were involved in framing the four myths, interpreting findings, and reviewing the final notebook and website together, but each section has a clear primary author.

11. References¶

Running literature¶

  • Billat, V. L., Demarle, A., Slawinski, J., Paiva, M., & Koralsztein, J. P. (2003). Physical and training characteristics of top-class marathon runners. Medicine & Science in Sports & Exercise, 33(12), 2089–2097.
  • Haney, T. A., & Mercer, J. A. (2011). A description of variability of pacing in marathon distance running. International Journal of Exercise Science, 4(2), 133–140.
  • Helgerud, J., Høydal, K., Wang, E., Karlsen, T., Berg, P., Bjerkaas, M., & Hoff, J. (2007). Aerobic high-intensity intervals improve VO₂max more than moderate training. Medicine & Science in Sports & Exercise, 39(4), 665–671.
  • Sato, K., Mokha, M., & Zhang, Y. (2015). Does core strength training influence running kinetics, lower-extremity kinematics, and athletic performance? Journal of Strength and Conditioning Research.

Data visualisation theory¶

  • Janert, P. K. (2010). Data Analysis with Open Source Tools (DAOST). O'Reilly Media. Chapters 2–3: 1-D and 2-D exploratory visualisation, KDEs vs histograms, scatter plots, log plots, linear regression. (Set reading for Weeks 3–4.)
  • Segel, E., & Heer, J. (2010). Narrative visualization: Telling stories with data. IEEE Transactions on Visualization and Computer Graphics, 16(6), 1139–1148. Figure 7: genre taxonomy; visual narrative & narrative structure design space. (Set reading for Weeks 6 and 8.)
  • Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations. IEEE Symposium on Visual Languages. ("Overview first, zoom and filter, details on demand" - Week 6.)
  • Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press. (Data-ink ratio, chartjunk removal, referenced via DAOST.)

Course materials¶

  • Sune Lehmann (2026). 02806 Social Data Analysis and Visualisation: Weekly Notebooks Weeks 1–8. DTU. Predictive policing case study (Week 1); good-plot principles + schema reconciliation (Week 2); conditional distributions, DAOST Ch. 2 (Week 3); linear regression, DAOST Ch. 3 (Week 4); geospatial visualisation with Folium and Plotly (Week 5); interactive Plotly and Narrative Visualisation (Week 6); HTML / GitHub Pages (Week 7); Narrative visualisation revisited (Week 8).