$30 Billion in One Day¶

状态: ✅ 已完成

创建日期: 2026-02-22 最后更新: 2026-02-22

用途：LinkedIn Featured 置顶文章（英文）字数：约 750 词配图：linkedin_article_30b_infographic_20260222.png

📅 日历事件¶

事件名称	开始	结束	地点	日历	备注
-	-	-	-	-	-

日历状态说明：✅=已加入 / 📄=仅文档 日历状态：✅=已加入 / 📄=仅文档

正文¶

$30 Billion in One Day

The numbers sound made up.

$30.8 billion. 491,000 orders per second at peak. 400 million daily active users. All in 24 hours.

This was Alibaba's 2018 Singles' Day — the world's largest shopping event. I was the TPM responsible for making it run.

Here's what that actually meant.

The real job wasn't the technology¶

When people ask about running a $30B event, they expect me to talk about infrastructure. Distributed systems, traffic shaping, database scaling.

That stuff was hard. But it wasn't my job.

My job was 430 people — 45% of Taobao's entire engineering team — spread across 25 project clusters, 109 sub-projects, with more cross-team dependencies than anyone had fully mapped. Every team thinks their work is the most critical. My job was to figure out who was actually right.

Early in the cycle, we found a P2 fault waiting to happen: a core payment service was pulling data from an internal admin tool maintained by an intern. No SLA. No one on call. If it went down during peak, it would block threads in the checkout flow. We found 15 problems like this across 140+ applications. Not complicated bugs — just things no one had looked at.

1,359 contingency plans¶

That number sounds impressive. It's also the number that kept me up at night.

We ran 3 rounds of live drills. 11 full-chain load tests plus 55 single-link tests. 50+ architecture reviews, 100+ action items.

Every year before me, teams had done roughly the same thing. Every year there were still surprises.

So I went further back. I interviewed TPMs who ran 2015, 2016, 2017. Read every post-mortem. One year, a P2 incident traced back to an engineer manually changing a production config at 2am under pressure — typing 300ms instead of 3000ms. Not a systems failure. A tired human.

We made manual production config changes impossible. Everything went through the deployment system. Two-person sign-off on critical configs.

The 2018 result: faults dropped from 16 the previous year to 3. Zero P1 or P2 incidents during core hours.

What I wish I'd had in 2018¶

I've been thinking about this lately, partly because I now work on AI systems.

The dependency mapping took weeks. Teams submitted info through forms and spreadsheets. We aggregated manually. We still nearly missed the intern's system.

Earlier this year, I ran a test using an LLM-based code analysis tool against a microservice codebase. It mapped all cross-service call chains and flagged services with no SLA or single owner — in about 4 minutes. Same class of problem we were hunting manually in 2018.

AI agents today can do real-time anomaly detection across hundreds of services, correlate deployment events with error spikes before humans notice, and draft a first-pass post-mortem from logs and metrics. None of that existed in 2018. We were doing it with spreadsheets and Slack.

But here's the honest part: the hard problem in 2018 wasn't finding issues. It was getting 430 people to agree on what to do about them. A business team and an infrastructure team arguing at 11pm about whether to cut a feature for safety — that's not a search problem. AI doesn't resolve that. Someone has to read the room, know which risks are real, and say no to a VP.

What still doesn't change¶

Two things I don't think AI touches.

First: the judgment calls under pressure. Pattern matching and risk probability don't replace someone who's been in the room before and knows what a real blocker looks like versus noise.

Second: honest post-mortems. We only found the "config timeout" pattern because someone had written it down honestly years earlier. Most teams don't. Post-mortems that say "we improved our processes" without naming what actually broke are useless. AI can help you write them faster. It can't make you honest about what happened.

For TPMs working on AI systems now¶

I spent 10 years running large programs at Alibaba. I'm now building AI products at a startup. The two worlds aren't as different as people think.

AI projects have the same hidden dependency problems, the same late-discovered constraints, the same "who owns this?" conversations at the worst possible time. They're just harder to test because the failure modes are probabilistic, not deterministic.

The skills that mattered in 2018 — finding dependencies early, building honest escalation paths, making boring process decisions before they become emergencies — matter more now. Not less.

$30 billion in one day is a big number. The work behind it was mostly small decisions made carefully and early, by people who'd actually read last year's post-mortems.

发布说明¶

发布平台：LinkedIn Article（非 Post）
置顶位置：Profile → Featured
配套短帖：发布后写一条 Post 引流到这篇文章
参考系列：content_series_plan_20260218.md Series 1 第 1 篇

引流 Post（发布 Article 后配套发）¶

491,000 orders per second.

That was the peak during Alibaba's 2018 Singles' Day — a 24-hour event I helped run as TPM.

The hardest part wasn't the infrastructure. It was 430 people, 109 sub-projects, and more hidden dependencies than anyone had mapped.

We found one two weeks before go-live: a core payment service quietly pulling data from a system maintained by an intern. No SLA. No on-call. One outage away from blocking checkout at peak.

We found 14 more like it.

I wrote up what that work actually looked like — and what I think AI changes (and doesn't) for TPMs running large programs today.

Link in comments.

发布说明：评论区贴文章链接效果好于放在帖子正文（LinkedIn 算法对外链有降权）。