Case Study — Personal Project

KWB — Kalshi Weather Trading Bot

An automated, paper-first trading bot that prices a single Kalshi prediction market — what will NYC's high temperature be today? — by blending four weather models into a probability distribution, and bets only when it thinks the market is mispriced.

4models
Independent weather sources blended into one distribution
~570tests
Unit tests, plus mypy --strict and automated lint
6
Bot instances running in parallel — 1 live, 5 paper experiments
~9/day
Trading cycles, fully serverless on AWS Lambda

What it is

A single-purpose, fully serverless trading bot for one Kalshi market series: the NYC daily-high-temperature buckets, settled off the Central Park station. It blends multiple weather models into a probability distribution, prices every temperature bucket the market offers, and bets only where it sees a real edge. The decision logic is entirely deterministic — there is no LLM in the trading loop. AI was an engineering collaborator for code review and analysis, never a live trader.

How it works

Pull hourly forecasts from four independent weather sources — NWS, ECMWF, GFS, and NOAA HRRR.
Blend them into a single probability distribution over the day's high temperature, widening the uncertainty where the models disagree.
Convert that distribution into a probability for each temperature bucket the market offers.
Compare those probabilities to live market prices and place small, fractional-Kelly-sized bets only where the modeled edge clears a threshold.
Hold to settlement and book realized P&L when the official climate report finalizes the day's high.

What the research found

Realistic fills decide everything
A deep dive mined ten weeks of stored order books for the rules the bot should have traded. Four came back with healthy paper profits — then all four died the moment a resting order only counted as filled when the market actually traded through its price. The most seductive had a 98% paper win rate and was pure adverse selection: your bid gets hit precisely when someone watching the thermometer knows something you don't.
This market prices weather efficiently
The working thesis was that the model knew which way the market was wrong. Tested truly out of sample — on six other cities' markets it had never seen — that information edge was statistically zero and uncorrelated with forecast accuracy. Quadruple-confirmed: you can't out-forecast a crowd reading the same public forecasts.
Even "certain" isn't certain
By late afternoon the day's high looks known — so surely the dying buckets sell at a discount? No: physically impossible buckets are already at 99¢ by 5pm. And the official settlement landed above the hourly observations more than half the time, because the settlement thermometer catches spikes the hourly feed misses.
Pre-registration is the whole game
Every rule was frozen — thresholds, success bar, and kill criteria locked before results were read. That discipline caught one rule "passing" a test it had actually failed in the fitting window, and turned a week of dead ends into cheap insurance instead of slow, real losses.
One idea survived — and it wasn't the model's
A human hunch — stop trying to out-know the market; get paid small amounts to ride what it's already converging toward — passed the same adversarial gauntlet that killed everything else. It provides liquidity rather than forecasts, and contains no weather model at all. Fittingly, it doesn't work in busy NYC; the edge, if it's real, lives in quieter markets.

Built like production infrastructure

Treated as production trading infrastructure, not a prototype: ~570 unit tests, mypy --strict, and automated linting on every change.
Fully serverless on AWS — Lambda (Python 3.12), DynamoDB, CloudWatch, and SES — deployed via SAM / CloudFormation on fixed cron cycles.
Deliberately dependency-light: httpx, pydantic, and numpy, with the needed statistics implemented directly to avoid pulling in scipy and keep the Lambda artifact small.
Safety designed in from day one: paper-trading by default, post-only limit orders that never cross the spread, and six independent kill switches — drawdown, daily-loss, anomaly, API-failure, manual, and inception-drawdown.
Disciplined review caught subtle correctness bugs early — including pricing the next day's market against the current day's forecast — long before real capital was at stake.
Six instances run in parallel — one live, five paper experiments — so every new idea is proven on paper before it touches real capital.

Closing one phase, opening the next

These are the first returns at the end of a phase, not the end of the project. Phase one — trying to out-forecast the NYC market — reached an honest verdict: in a book this liquid, the crowd is very hard to beat, so the bot now trades only when it profoundly disagrees with the market, which is almost never. It has stopped bleeding — sometimes the smartest thing an automated trader can do is almost nothing, verifiably. Meanwhile the one idea that survived is still under test behind a pre-committed pass/fail gate, and it points somewhere new: a fresh experiment, already underway, that hunts for the same kind of edge across many cities' weather markets rather than fighting for it in NYC. The learning loop is very much still running.

The real yield of phase one wasn't P&L — it was a research pipeline that can take a beloved idea, run it through an adversarial gauntlet, and kill it in a week without losing real money. That's the asset that compounds into the next experiment.

Get in touch All projects

What it is

How it works

What the research found

Realistic fills decide everything

This market prices weather efficiently

Even "certain" isn't certain

Pre-registration is the whole game

One idea survived — and it wasn't the model's

Built like production infrastructure

Closing one phase, opening the next

The real yield of phase one wasn't P&L — it was a research pipeline that can take a beloved idea, run it through an adversarial gauntlet, and kill it in a week without losing real money. That's the asset that compounds into the next experiment.