Technical video content has a persistent problem: a speaker verbally describes a system architecture, an algorithm, or a database schema while the screen shows nothing that helps the viewer build a mental model. Adding visual overlays manually requires authoring diagrams, timing them to the transcript, and re-editing the video — work that rarely gets done.
technify-motions automates this end-to-end. Given a video file, it produces an output where animated diagram slides — flowcharts, bullet-point summaries, code snippets — appear in sync with the exact moment each technical concept is explained. The entire pipeline runs locally except for two LLM calls, costing roughly $0.20 per 30-minute video.
The Pipeline
The system is a six-stage Python pipeline:
Video/Audio → Audio Extraction → Transcription → Scene Classification
→ Slide Generation → Rendering → Video Composition
Each stage writes its output to a ./work/ directory. Stages can be individually cached and re-run, which matters during development when only one stage is changing.
Stage 1: Audio Extraction
FFmpeg extracts and normalizes the audio track to 16 kHz mono PCM WAV:
ffmpeg -i lecture.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 work/audio.wav
Whisper's models expect this format. Normalizing upfront avoids silent failures where a model receives a stereo 44.1 kHz stream and produces degraded results.
Stage 2: Transcription with faster-whisper
faster-whisper is a CTranslate2-based reimplementation of OpenAI Whisper that runs entirely locally. The transcription call enables two key options:
- word_timestamps=True — produces segment-level
start/endtimes in seconds, used later to anchor diagram clips to the timeline - vad_filter=True — applies Silero VAD with a 500 ms minimum silence threshold, stripping silence before segments reach the model
On CPU, the model runs with int8 quantization; float16 is used when a CUDA device is available. The small model balances accuracy and speed for CPU-only environments; large-v3 is recommended when a GPU is available.
Each segment is serialized as TranscriptSegment(start, end, text) and written to work/transcript.json. On subsequent runs with --use-cache, this file is deserialized directly, avoiding a re-transcription pass.
Stage 3: Technical Scene Classification
The full transcript — with indices, timestamps, and text — is sent to Claude in a single API call. The prompt instructs the model to identify consecutive segments that discuss something worth visualizing: systems, algorithms, data models, architecture decisions, workflows. The scope is deliberately wide; a speaker critiquing a flawed database schema is as worth visualizing as one teaching a new one.
Claude returns a JSON array of scenes:
[
{
"start": 142.3,
"end": 198.7,
"segment_indices": [23, 24, 25, 26],
"content_type": "architecture",
"description": "Explaining how requests flow through the service mesh"
}
]
The response is parsed with three fallback strategies: direct JSON parse, extraction from a markdown code fence, and a regex search for the first [...] array in the response. This tolerates outputs where the model wraps its JSON in explanatory prose.
Stage 4: Slide Generation
Each scene is sent to the LLM with its transcript text and a request for 1–3 typed slides. Three slide types are supported:
- Graph — nodes and directed edges for flowcharts, architectures, and system relationships
- Bullets — key points, trade-offs, comparisons, or summaries
- Code — concrete syntax, SQL, CLI commands, YAML config
The model outputs a JSON array. Each slide object is validated against a strict schema before being accepted. A graph slide, for example, requires every edge's from and to fields to reference a valid node id. If validation fails, the error is appended to the prompt — "Your previous attempt was invalid: slides[0] edges[1].to 'cache' is not a known node id. Please fix it." — and the model retries. Up to three attempts are made per scene.
Once a valid response is received, the scene's time window is divided evenly across the number of slides. Each slide receives a slide_start and slide_end that override the parent scene timestamps, allowing a single 60-second scene to be split into, say, a 20-second graph, a 20-second bullets slide, and a 20-second code slide.
Stage 5: Rendering with Remotion
Remotion renders each slide to a duration-matched MP4 by driving a headless Chromium instance through React. Three TypeScript compositions handle the three types: FlowchartAnimation, BulletsSlide, and CodeSlide. Each composition receives the slide payload as props and drives its own animation using Remotion's spring and interpolate primitives.
Node layout for graph slides is computed by dagre, which handles node positioning and edge routing given only the graph topology.
Props are written to a temp file rather than passed as a shell argument, avoiding length limits on large graph payloads:
remotion render src/index.ts FlowchartAnimation output.mp4 \
--props=props.json \
--concurrency=4 \
--log=error
Rendering is parallelized across slides using a ThreadPoolExecutor with four workers. Each Remotion call spawns its own Chromium process, so parallelism is kept low to avoid exhausting memory.
npm dependencies for the Remotion project are installed automatically on the first run via npm install --prefer-offline, gated behind a thread lock so concurrent renders do not trigger multiple installs.
Stage 6: Video Composition
The final stage overlays each rendered MP4 clip onto the source video at its exact time window. Three modes are supported.
PIP (picture-in-picture) — the default. The diagram is scaled to 40% of the source width and positioned 20 pixels from the bottom-right corner. All overlays are expressed in a single filter_complex chain so ffmpeg makes one pass over the entire video:
[0:v][1:v]scale=<pip_w>:-2[pip0];
[0:v][pip0]overlay=W-w-20:H-h-20:enable='between(t,142,199)'[v0];
[v0][2:v]scale=<pip_w>:-2[pip1];
[v0][pip1]overlay=W-w-20:H-h-20:enable='between(t,310,365)'[vout]
The between(t,start,end) expression in the overlay filter controls visibility — the diagram clip loops if shorter than the window, and the original video shows through outside the window.
Side-by-side — the video is segmented at diagram boundaries. Non-diagram segments are re-encoded at source resolution. Diagram segments composite the source video on the left half and the diagram on the right half, both padded to half_w × src_h. All segments are concatenated with ffmpeg -f concat. Segments are re-encoded rather than stream-copied because stream copy seeks to the nearest keyframe, which shifts segment boundaries and breaks alignment at the concat stage.
Replace — the source is spliced out entirely during technical scenes. The timeline is built as a list of (timestamp, source_or_diagram_path) events, sorted and deduplicated into non-overlapping segments, each trimmed and re-encoded, then concatenated.
Caching and Iteration
Stages serialize their outputs to ./work/:
| File | Contents |
| ------------------------ | --------------------------------------- |
| transcript.json | Array of TranscriptSegment objects |
| scenes.json | Array of TechnicalScene objects |
| diagrams.json | Array of slide payloads with timestamps |
| diagrams/diagram_*.mp4 | Rendered Remotion clips |
Running with --use-cache skips any stage whose output file already exists. Classification and generation are the only stages that cost money, so caching them is important when iterating on rendering or composition.
Cost Profile
| Stage | Tool | Cost | | ---------------- | --------------------------- | ----------- | | Transcription | faster-whisper (local) | Free | | Classification | Claude API | ~$0.01–0.05 | | Slide generation | Claude API | ~$0.10–0.15 | | Rendering | Remotion + Chromium (local) | Free | | Composition | ffmpeg (local) | Free |
Total is approximately $0.20 per 30-minute video. The LLM is used narrowly: one call to identify scene boundaries, one call per scene to generate structured slide data. Everything else is deterministic local computation.
Conclusion
The system chains five off-the-shelf tools — faster-whisper, Claude, Remotion, dagre, and ffmpeg — each handling exactly the problem it was designed for. None of them are used beyond their core purpose. The LLM is constrained to producing structured JSON rather than free-form content, and its outputs are validated and retried programmatically. The result is a pipeline that converts a raw lecture recording into a diagram-annotated video with no manual authoring.
Happy learning!