~/posts/2026-02-25_building-ai-tennis-coach-mediapipe-claude.md
$

cat 2026-02-25_building-ai-tennis-coach-mediapipe-claude.md

📅

A while back I started playing tennis again after a long break. The frustrating thing about tennis is that it's almost impossible to self-correct without video — you feel like you're doing something right, watch the footage, and realise your elbow is at a completely wrong angle at contact. Hiring a coach is one option but not always practical for a casual player. So I did what any engineer would do: I decided to build one.

The result is a Streamlit app that takes an uploaded tennis video, runs pose detection on every frame, computes joint angles and swing timing, renders an annotated video with a skeleton overlay, and calls Claude to produce four categories of coaching feedback grounded in the actual numbers. This post walks through the pipeline stage by stage.

The Pipeline

At a high level the system is a linear chain of transformations:

Upload (MP4 / MOV / AVI)
  → extract_frames()           [video_io.py]
  → PoseDetector.detect_batch() [pose_detector.py]  → LandmarkResult | None per frame
  → compute_frame_metrics()    [metrics.py]          → FrameMetrics per frame
  → aggregate_metrics()        [metrics.py]          → AggregatedMetrics
  → annotate_frame() × N       [annotator.py]        → annotated BGR frames
  → frames_to_video()          [video_io.py]         → H.264 MP4
  → get_coaching_feedback()    [coach.py]            → CoachingReport
  → Streamlit display

Each stage is a pure function (or close to it) that takes its inputs and returns its outputs without side effects. This made it easy to develop and debug each stage independently before wiring them together in app.py.

The Stack

| Component | Choice | Reason | | --- | --- | --- | | UI | Streamlit | All-Python, zero front-end work, native video player | | Pose detection | MediaPipe PoseLandmarker | 33 body landmarks, runs on CPU, Tasks API is well-maintained | | Video I/O | OpenCV (opencv-python-headless) | Headless variant avoids display dependencies on servers | | AI coaching | Anthropic Claude (claude-sonnet-4-6) | Strong instruction-following, reliable JSON output | | Math | NumPy | Angle calculation, peak detection, statistics |

Stage 1: Configuration and Constants

Before writing any pipeline code I put all magic numbers and index mappings into config.py. MediaPipe's pose model produces 33 landmarks, each identified by a zero-based integer index. Scattering those integers across the codebase would make things unmaintainable:

class Landmarks:
    NOSE = 0
    LEFT_SHOULDER = 11
    RIGHT_SHOULDER = 12
    LEFT_ELBOW = 13
    RIGHT_ELBOW = 14
    LEFT_WRIST = 15
    RIGHT_WRIST = 16
    LEFT_HIP = 23
    RIGHT_HIP = 24
    LEFT_KNEE = 25
    RIGHT_KNEE = 26
    LEFT_ANKLE = 27
    RIGHT_ANKLE = 28

The skeleton connection list pairs landmark indices for drawing bones between joints:

POSE_CONNECTIONS = [
    (Landmarks.LEFT_SHOULDER, Landmarks.RIGHT_SHOULDER),
    (Landmarks.LEFT_SHOULDER, Landmarks.LEFT_ELBOW),
    (Landmarks.LEFT_ELBOW, Landmarks.LEFT_WRIST),
    ...
]

Two thresholds matter a lot operationally. VISIBILITY_THRESHOLD = 0.5 determines when a landmark is considered reliable enough to use — MediaPipe emits a confidence score alongside each (x, y, z) coordinate, and anything below 0.5 gets treated as missing. MAX_FRAMES = 300 caps the analysis at roughly ten seconds of footage at 30 fps, which keeps processing time under a minute even on a laptop CPU.

Stage 2: Math Utilities

All pure math lives in utils/math_helpers.py with no project-level imports. The most important function is the joint angle calculator:

def angle_between_three_points(a, b, c):
    ba = np.array(a) - np.array(b)
    bc = np.array(c) - np.array(b)
    cos_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
    cos_angle = np.clip(cos_angle, -1.0, 1.0)
    return float(np.degrees(np.arccos(cos_angle)))

This computes the angle at vertex b formed by rays towards a and c. For the elbow, b is the elbow landmark, a is the shoulder, and c is the wrist. The np.clip is essential — floating-point rounding can push the dot product just outside [-1, 1], causing arccos to return nan.

Swing detection relies on find_peaks, which scans wrist speed values for local maxima above a threshold with a minimum distance between peaks:

def find_peaks(values, threshold, min_distance=10):
    filled = [v if v is not None else 0.0 for v in values]
    peaks = []
    last_peak = -min_distance - 1
    for i in range(1, len(filled) - 1):
        if (filled[i] > threshold
                and filled[i] >= filled[i-1]
                and filled[i] >= filled[i+1]
                and (i - last_peak) >= min_distance):
            peaks.append(i)
            last_peak = i
    return peaks

The min_distance guard prevents two adjacent frames at the peak of a swing from both registering as separate events.

Stage 3: Frame Extraction

video_io.py extracts frames from the uploaded file using OpenCV:

def extract_frames(video_path, max_frames=MAX_FRAMES, stride=1):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    if total * stride > max_frames:
        indices = set(np.linspace(0, total - 1, max_frames, dtype=int).tolist())
    else:
        indices = set(range(0, total, stride))
    ...

The evenly-spaced subsampling with np.linspace matters for longer videos. Naively skipping every N-th frame can create a biased sample if the stroke happens to fall in a skipped region. linspace distributes the sample budget uniformly across the full duration.

H.264 Re-encoding

After annotation, the frames need to go back into a video the browser can play. OpenCV's VideoWriter defaults to the mp4v codec, but Streamlit's st.video() component requires H.264 (libx264) for browser compatibility. The solution is to write an intermediate file with mp4v and then re-encode it using a subprocess call to ffmpeg:

subprocess.run([
    "ffmpeg", "-y",
    "-i", raw_path,
    "-vcodec", "libx264",
    "-pix_fmt", "yuv420p",
    "-preset", "fast",
    "-crf", "23",
    output_path,
], check=True, capture_output=True)

yuv420p is the pixel format most widely supported by browsers. If ffmpeg isn't installed, the pipeline falls back to serving the raw mp4v file as a download rather than an inline player.

Stage 4: Pose Detection

MediaPipe's 0.10.x release replaced the older mp.solutions.pose API with a new Tasks API. The new API takes a .task model file rather than downloading weights implicitly, so the detector manages model files itself:

_MODEL_URLS = {
    0: "https://storage.googleapis.com/.../pose_landmarker_lite.task",
    1: "https://storage.googleapis.com/.../pose_landmarker_full.task",
    2: "https://storage.googleapis.com/.../pose_landmarker_heavy.task",
}

def _ensure_model(complexity):
    path = os.path.join(_MODELS_DIR, _MODEL_NAMES[complexity])
    if not os.path.exists(path):
        urllib.request.urlretrieve(_MODEL_URLS[complexity], path)
    return path

The model is downloaded once into a models/ directory and reused on subsequent runs. For a batch of frames processed sequentially, the detector runs in VIDEO mode rather than IMAGE mode — this enables temporal tracking across frames which significantly improves landmark stability:

options = PoseLandmarkerOptions(
    base_options=mp.tasks.BaseOptions(model_asset_path=model_path),
    running_mode=RunningMode.VIDEO,
    num_poses=1,
    min_pose_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)

In VIDEO mode the landmarker requires monotonically increasing timestamps. Since the frames are extracted from a fixed-fps video, a 33ms increment per frame (approximating 30fps) works reliably.

The result of detection is a LandmarkResult dataclass that wraps the raw landmark list and provides two convenience methods: get_point returns normalized (x, y) coordinates or None if visibility is below threshold, and get_pixel converts normalized coordinates to integer pixel coordinates:

def get_pixel(self, idx, width, height):
    pt = self.get_point(idx)
    if pt is None:
        return None
    return (int(pt[0] * width), int(pt[1] * height))

Any frame where MediaPipe returns no landmarks gets None in the results list. The downstream stages handle this gracefully — None results simply produce None metric values, which are excluded from aggregated statistics.

Stage 5: Metrics Computation

metrics.py is where the actual analysis happens. For each frame, compute_frame_metrics extracts six joint angles, torso rotation, stance width, centre of mass, and wrist speed.

Joint Angles

Each angle follows the same pattern — three landmark indices, one of which is the vertex:

# Right elbow angle: shoulder → elbow → wrist
rs = px(Landmarks.RIGHT_SHOULDER)
re = px(Landmarks.RIGHT_ELBOW)
rw = px(Landmarks.RIGHT_WRIST)
if rs and re and rw:
    fm.right_elbow_angle = angle_between_three_points(rs, re, rw)

All six angles (both elbows, both shoulders, both knees) are computed only when all three required landmarks are visible at or above the threshold. This means a frame where the player's left side is occluded still produces valid right-side metrics.

Torso Rotation

Torso rotation captures how much the upper body turns during the swing — a key metric in tennis since proper shoulder rotation drives power:

shoulder_vec = np.array(right_shoulder) - np.array(left_shoulder)
hip_vec = np.array(right_hip) - np.array(left_hip)
torso_rotation = angle_between_vectors(shoulder_vec, hip_vec)

When the shoulders and hips are parallel (no rotation), this angle is near 0°. A full shoulder turn produces angles in the 30–60° range depending on the shot type.

Swing Event Detection

Swing events are detected by finding peaks in wrist speed. Wrist speed is computed frame-over-frame as a Euclidean distance normalized by the frame diagonal:

diag = np.sqrt(frame_width**2 + frame_height**2)
fm.right_wrist_speed = euclidean_distance(current_rw, prev_rw) / diag

Normalizing by the diagonal makes the threshold (WRIST_SPEED_THRESHOLD = 0.02) independent of video resolution. The combined speed — taking the max of left and right wrist at each frame — is passed through find_peaks. Each peak corresponds to a swing contact event.

Aggregation

aggregate_metrics collapses the per-frame lists into statistics:

@dataclass
class AngleStat:
    mean: Optional[float]
    min: Optional[float]
    max: Optional[float]
    std: Optional[float]

The standard deviation is particularly useful for coaching — a high std on elbow angle means the player's technique varies significantly across swings, which is worth flagging even if the mean looks reasonable.

Stage 6: Annotation

The annotator draws onto a copy of each frame, maintaining state across frames for the wrist trail:

class Annotator:
    def __init__(self, show_angles, show_trail):
        self._right_trail = deque(maxlen=TRAIL_LENGTH)
        self._left_trail = deque(maxlen=TRAIL_LENGTH)

Using a deque with a fixed maxlen is a clean way to maintain a sliding window of the last 15 wrist positions without manually managing list slicing.

Three layers are drawn in order:

Skeleton — lines connecting landmark pairs from POSE_CONNECTIONS, drawn only when both endpoints are visible:

for start_idx, end_idx in POSE_CONNECTIONS:
    if start_idx in pixels and end_idx in pixels:
        cv2.line(out, pixels[start_idx], pixels[end_idx], SKELETON_COLOR, 2, cv2.LINE_AA)

Angle labels — text printed offset from each joint. Each label is prefixed with an abbreviation (RE for right elbow, LK for left knee etc.) so the viewer doesn't need to guess which angle they're reading:

text = f"{label}:{angle:.0f}°"
cv2.putText(frame, text, (px+6, py-6),
            cv2.FONT_HERSHEY_SIMPLEX, FONT_SCALE,
            ANGLE_TEXT_COLOR, FONT_THICKNESS, cv2.LINE_AA)

Wrist trail — older positions are drawn darker by scaling the color by i / len(pts). This creates a fade effect that makes the direction and speed of the wrist path visually obvious.

Frames identified as swing events get an orange border and a SWING label in the top-right corner, making it easy to scrub to the key moments in the annotated video.

Stage 7: Claude Coaching Feedback

The coaching call is the most interesting stage to get right. The goal is feedback that references specific numbers — not "bend your knees more" but "your right knee averages 162° at impact; recreational players typically aim for 130–145° for a stable base."

Prompt Design

The system prompt establishes the persona and constraints:

You are an expert tennis coach with 20+ years of experience coaching players
at all levels. You analyze video-based biomechanical data and deliver precise,
actionable coaching feedback.

RULES:
- Always reference specific numbers from the provided metrics.
- Be direct and avoid generic advice like "bend your knees more" without a target angle.
- Respond ONLY with valid JSON matching the requested schema — no prose outside the JSON.

The user prompt is a structured markdown document that compresses all the computed metrics into a compact table:

## Joint Angle Statistics (mean / min / max / std)
- Right elbow:    142.3° / 98.1° / 175.2° / 18.4°
- Left elbow:     156.7° / 121.0° / 179.8° / 14.2°
- Right shoulder: 67.4° / 34.2° / 98.1° / 22.1°
...

## Body Mechanics
- Torso rotation (mean/max): 24.3° / 47.8°
- Stance width (normalized):  1.43
- CoM lateral range: 0.12 (normalized 0-1)

## Swing Events
- Wrist speeds at peaks: 0.031, 0.028, 0.035

Asking the model to respond only in JSON is effective for structured output. The response is parsed with a three-step fallback: try json.loads directly, then extract from a markdown code fence, then search for the first {...} block. If all three fail, the raw text goes into the swing_mechanics field so the user at least sees the response rather than a silent failure.

Error Handling

Different API errors need different messages. A RateLimitError is a transient condition the user can retry; an AuthenticationError means the key is wrong:

except anthropic.AuthenticationError:
    report.swing_mechanics = "❌ Authentication failed — check your Anthropic API key."
except anthropic.RateLimitError:
    report.swing_mechanics = "❌ Rate limit exceeded — please wait and retry."
except anthropic.APIConnectionError:
    report.swing_mechanics = "❌ Network error — check your internet connection."

Importantly, a failed Claude call does not abort the pipeline. The annotated video is always returned regardless of whether the coaching call succeeds.

Stage 8: Streamlit UI

The UI is straightforward. The sidebar holds the API key input, display toggles, and a stride slider. The main area has a file uploader and a single Analyze button.

Progress is communicated through Streamlit's native progress bar updated at each stage:

progress = st.progress(0, text="Initializing…")

# Step 1
progress.progress(10, text="Extracting frames…")
frames, fps, total_frames = extract_frames(input_path, stride=stride)

# Step 2
progress.progress(30, text="Running MediaPipe pose detection…")
pose_results = detector.detect_batch(frames)
...
progress.progress(100, text="Done!")

Results are split into two columns. The left column shows the annotated video with a download button. The right column uses Streamlit's st.tabs for the five coaching categories:

tab_swing, tab_foot, tab_stance, tab_tactics, tab_prio = st.tabs(
    ["Swing", "Footwork", "Stance", "Tactics", "Priorities"]
)

Below both columns, an expandable raw metrics table shows the joint angle statistics as a pandas DataFrame, useful for players who want to track numbers across multiple sessions.

What I Learned

A few things that weren't obvious upfront:

MediaPipe's Tasks API is a breaking change. The old mp.solutions.pose.Pose class still exists but is no longer the recommended path. The new Tasks API requires an explicit .task model file and a monotonically increasing timestamp in video mode. Missing either of these silently produces zero detections.

OpenCV and browser video compatibility don't mix out of the box. The mp4v codec works for local playback but most browsers refuse to play it inline. Running a ffmpeg subprocess to re-encode to libx264 + yuv420p is the reliable solution.

Swing detection without a ground truth is hard. Wrist speed peaks work well for forehand and backhand groundstrokes but can miss serves (where the wrist speed profile is different) or produce false positives on defensive scrambles. A more robust approach would fine-tune a classifier on labeled swing data.

Claude's JSON reliability depends heavily on the system prompt. Adding "no prose outside the JSON" to the system prompt and providing the exact schema in the user prompt eliminated nearly all cases where the model wrapped its output in explanatory sentences.

Running It

pip install -r requirements.txt
cp .env.example .env   # add ANTHROPIC_API_KEY
streamlit run app.py

Upload any tennis video under 10 seconds, click Analyze, and the five-step pipeline completes in roughly 30–90 seconds depending on CPU speed and video length. The annotated video shows the skeleton and joint angles on every frame, with orange flashes marking detected swing events.

Conclusion

The project chains four off-the-shelf tools — MediaPipe, OpenCV, ffmpeg, and Claude — each used for exactly what it's designed for. MediaPipe handles the hard computer vision problem. OpenCV handles frame I/O. ffmpeg handles codec compatibility. Claude handles the reasoning over numbers. None of them are stretched beyond their core purpose.

The interesting engineering is in the glue: the visibility threshold logic that keeps partial occlusions from poisoning the stats, the normalized wrist speed metric that makes peak detection resolution-independent, and the prompt design that reliably produces structured JSON with metric-referenced feedback rather than generic coaching clichés.

The code is at github.com/gsarmaonline/tennis-coach.

Happy learning!

Please reach out to me here for more ideas or improvements.