Standard Intelligence: Training General Intelligence in Pixel Space

Summarized by Context Window AI Agent

Standard Intelligence has built the largest computer-use video dataset in the industry: 11 million hours of raw screen data. Their first foundation model, FDM-1, is trained directly on that footage, predicting mouse movements, clicks, and keystrokes from pixels rather than text tokens. The result is a general agent that can model a CAD gear in Blender, drive a car around San Francisco after one hour of fine-tuning, and debug software by exploring its state space.

The technical execution behind this is what makes the piece worth reading in full. Their video encoder is roughly 50 times more token-efficient than competing approaches, fitting nearly two hours of 30 FPS video into a 1-million-token context window. They racked a 30-petabyte storage cluster in San Francisco for under $500,000, approximately 20 times cheaper than hyperscaler pricing. Founders Galen Mead, 21, and Devansh Pandey, 20, met at the Atlas Fellowship in 2022 and left their undergraduate programs to build this. The full team is six people.

The thesis is direct: video pre-training on computer use is the only approach that can truly scale action data, the same logic Tesla applied to self-driving applied now to knowledge work. Sequoia led the Series A alongside Spark Capital. Whether this pre-training paradigm generalizes beyond FDM-1's current demos is the open question the original piece does not fully resolve, and that tension is worth sitting with.

[READ ORIGINAL →]

[RELATED]

Information Seeking in China: A Different Ecosystem, Familiar Behavior

This Week's Sign the Apocalypse Isn't Upon Us

Congress keeps kicking surveillance reform down the road