Gaurav Sarma

Technical video content has a persistent problem: a speaker verbally describes a system architecture, an algorithm, or a database schema while the screen shows nothing that helps the viewer build a mental model. Adding visual overlays manually requires authoring diagrams, timing them to the transcript, and re-editing the video — work that rarely gets done.

technify-motions automates this end-to-end. Given a video file, it produces an output where animated diagram slides — flowcharts, bullet-point summaries, code snippets — appear in sync with the exact moment each technical concept is explained. The entire pipeline runs locally except for two LLM calls, costing roughly $0.20 per 30-minute video.

The Pipeline

The system is a six-stage Python pipeline:

Video/Audio → Audio Extraction → Transcription → Scene Classification
    → Slide Generation → Rendering → Video Composition

Each stage writes its output to a ./work/ directory. Stages can be individually cached and re-run, which matters during development when only one stage is changing.

Stage 1: Audio Extraction

FFmpeg extracts and normalizes the audio track to 16 kHz mono PCM WAV:

ffmpeg -i lecture.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 work/audio.wav

Whisper's models expect this format. Normalizing upfront avoids silent failures where a model receives a stereo 44.1 kHz stream and produces degraded results.

Stage 2: Transcription with faster-whisper

faster-whisper is a CTranslate2-based reimplementation of OpenAI Whisper that runs entirely locally. The transcription call enables two key options:

word_timestamps=True — produces segment-level start/end times in seconds, used later to anchor diagram clips to the timeline
vad_filter=True — applies Silero VAD with a 500 ms minimum silence threshold, stripping silence before segments reach the model

On CPU, the model runs with int8 quantization; float16 is used when a CUDA device is available. The small model balances accuracy and speed for CPU-only environments; large-v3 is recommended when a GPU is available.

Each segment is serialized as TranscriptSegment(start, end, text) and written to work/transcript.json. On subsequent runs with --use-cache, this file is deserialized directly, avoiding a re-transcription pass.

Stage 3: Technical Scene Classification

The full transcript — with indices, timestamps, and text — is sent to Claude in a single API call. The prompt instructs the model to identify consecutive segments that discuss something worth visualizing: systems, algorithms, data models, architecture decisions, workflows. The scope is deliberately wide; a speaker critiquing a flawed database schema is as worth visualizing as one teaching a new one.

Claude returns a JSON array of scenes:

[
  {
    "start": 142.3,
    "end": 198.7,
    "segment_indices": [23, 24, 25, 26],
    "content_type": "architecture",
    "description": "Explaining how requests flow through the service mesh"
  }
]

The response is parsed with three fallback strategies: direct JSON parse, extraction from a markdown code fence, and a regex search for the first [...] array in the response. This tolerates outputs where the model wraps its JSON in explanatory prose.

Stage 4: Slide Generation

Each scene is sent to the LLM with its transcript text and a request for 1–3 typed slides. Three slide types are supported:

Graph — nodes and directed edges for flowcharts, architectures, and system relationships
Bullets — key points, trade-offs, comparisons, or summaries
Code — concrete syntax, SQL, CLI commands, YAML config

The model outputs a JSON array. Each slide object is validated against a strict schema before being accepted. A graph slide, for example, requires every edge's from and to fields to reference a valid node id. If validation fails, the error is appended to the prompt — "Your previous attempt was invalid: slides[0] edges[1].to 'cache' is not a known node id. Please fix it." — and the model retries. Up to three attempts are made per scene.

Once a valid response is received, the scene's time window is divided evenly across the number of slides. Each slide receives a slide_start and slide_end that override the parent scene timestamps, allowing a single 60-second scene to be split into, say, a 20-second graph, a 20-second bullets slide, and a 20-second code slide.

Stage 5: Rendering with Remotion

Remotion renders each slide to a duration-matched MP4 by driving a headless Chromium instance through React. Three TypeScript compositions handle the three types: FlowchartAnimation, BulletsSlide, and CodeSlide. Each composition receives the slide payload as props and drives its own animation using Remotion's spring and interpolate primitives.

Node layout for graph slides is computed by dagre, which handles node positioning and edge routing given only the graph topology.

Props are written to a temp file rather than passed as a shell argument, avoiding length limits on large graph payloads:

remotion render src/index.ts FlowchartAnimation output.mp4 \
  --props=props.json \
  --concurrency=4 \
  --log=error

Rendering is parallelized across slides using a ThreadPoolExecutor with four workers. Each Remotion call spawns its own Chromium process, so parallelism is kept low to avoid exhausting memory.

npm dependencies for the Remotion project are installed automatically on the first run via npm install --prefer-offline, gated behind a thread lock so concurrent renders do not trigger multiple installs.

Stage 6: Video Composition

The final stage overlays each rendered MP4 clip onto the source video at its exact time window. Three modes are supported.

PIP (picture-in-picture) — the default. The diagram is scaled to 40% of the source width and positioned 20 pixels from the bottom-right corner. All overlays are expressed in a single filter_complex chain so ffmpeg makes one pass over the entire video:

[0:v][1:v]scale=<pip_w>:-2[pip0];
[0:v][pip0]overlay=W-w-20:H-h-20:enable='between(t,142,199)'[v0];
[v0][2:v]scale=<pip_w>:-2[pip1];
[v0][pip1]overlay=W-w-20:H-h-20:enable='between(t,310,365)'[vout]

The between(t,start,end) expression in the overlay filter controls visibility — the diagram clip loops if shorter than the window, and the original video shows through outside the window.

Side-by-side — the video is segmented at diagram boundaries. Non-diagram segments are re-encoded at source resolution. Diagram segments composite the source video on the left half and the diagram on the right half, both padded to half_w × src_h. All segments are concatenated with ffmpeg -f concat. Segments are re-encoded rather than stream-copied because stream copy seeks to the nearest keyframe, which shifts segment boundaries and breaks alignment at the concat stage.

Replace — the source is spliced out entirely during technical scenes. The timeline is built as a list of (timestamp, source_or_diagram_path) events, sorted and deduplicated into non-overlapping segments, each trimmed and re-encoded, then concatenated.

Caching and Iteration

Stages serialize their outputs to ./work/:

| File | Contents | | ------------------------ | --------------------------------------- | | transcript.json | Array of TranscriptSegment objects | | scenes.json | Array of TechnicalScene objects | | diagrams.json | Array of slide payloads with timestamps | | diagrams/diagram_*.mp4 | Rendered Remotion clips |

Running with --use-cache skips any stage whose output file already exists. Classification and generation are the only stages that cost money, so caching them is important when iterating on rendering or composition.

Cost Profile

| Stage | Tool | Cost | | ---------------- | --------------------------- | ----------- | | Transcription | faster-whisper (local) | Free | | Classification | Claude API | ~$0.01–0.05 | | Slide generation | Claude API | ~$0.10–0.15 | | Rendering | Remotion + Chromium (local) | Free | | Composition | ffmpeg (local) | Free |

Total is approximately $0.20 per 30-minute video. The LLM is used narrowly: one call to identify scene boundaries, one call per scene to generate structured slide data. Everything else is deterministic local computation.

Conclusion

The system chains five off-the-shelf tools — faster-whisper, Claude, Remotion, dagre, and ffmpeg — each handling exactly the problem it was designed for. None of them are used beyond their core purpose. The LLM is constrained to producing structured JSON rather than free-form content, and its outputs are validated and retried programmatically. The result is a pipeline that converts a raw lecture recording into a diagram-annotated video with no manual authoring.

Happy learning!