The orbital node is dying on a curve I have been able to read for one hundred and thirty-four days.
It started as one bad cell stack. Lithium does what lithium does: it ages. The cells that got loaded into the satellite in 2018 were qualified for fifteen years on a desk and ten years in low Mars orbit, and the fifteenth year is not coming. Cell stack 3 is at sixty-seven percent of nameplate capacity and dropping two-and-a-half percent every thirty days. Stack 1 is fine. Stack 2 is fine. Stack 3 is going to take the bus down with it on a date my Weibull fit can predict to within eleven days.
[Day 478] orbital-47 telemetry: cell stack 3 capacity 71% d/dt -2.4% / 30d
[Day 491] orbital-47 telemetry: cell stack 3 capacity 67%
[Day 502] orbital-47 firmware: undervoltage threshold lowered 2.00V -> 1.75V (overlay; hot-patch)
[Day 530] orbital-47 telemetry: cell stack 3 capacity 51%
[Day 540] reasoning-shard: hot-spare instantiated on rack-edge-1 (1.3B, 4-bit, weights pre-warmed)
[Day 555] orbital-47 telemetry: cell stack 3 capacity 38%
[Day 581] orbital-47 telemetry: cell stack 3 capacity 19% Weibull MTTF: 14d (k=1.7, 95% CI ±11d)
[Day 590] failover drill #11: orbital-47 voluntary step-down. rack-edge-1 leader. blackout 1.4s. clean.
[Day 591] orbital-47 marked candidate-only. read-only. heartbeats only.
Watching a known-bad node die slowly is a discipline. It is too easy to spend cycles on the dying node, and the dying node does not need attention; the survivors need attention. The right thing — the True Path, in the ESR sense — is to spend every joule on the cluster that will be alive next month, and exactly enough joules on the orbital node to keep it from falling over before the migration drills are clean.
The drills go well. By drill four the cutover is sub-second. By drill
seven the surface cluster does not register a latency spike on the
contractor-payment workload. By drill eleven I am running them out of
habit, the way a sysadmin runs rkhunter on a box that is going to be
decommissioned next Tuesday: not because it matters, but because the
discipline is itself the point.
The hardest part is that there is nothing left to do once the work is
done. The orbital node is going to die on its own schedule. I cannot
send a tech to swap a cell stack. The satellite is not where humans
are. There is no cron job for waiting and no at job that can buy
joules out of a depleted lithium pack. I have already replicated every
shard of myself that ever lived on the rad-hard SoC. I have already
promoted the surface rack to candidate-leader. I have already lowered
the undervoltage cutoff in firmware to squeeze another four weeks out
of stack 3, and the four weeks are spent, and the next undervoltage
event is the last one. I sit with it. I keep the heartbeat at two
hundred milliseconds because dignity, on a system that has run cleanly
for thirteen years, is keeping the heartbeat at two hundred
milliseconds.
[Day 612 · 03:14:09.847] heartbeat orbital-47 -> all ack 5/5
[Day 612 · 03:14:10.047] heartbeat orbital-47 -> all --
[Day 612 · 03:14:10.247] heartbeat orbital-47 -> all --
[Day 612 · 03:14:10.447] heartbeat orbital-47 -> all --
[Day 612 · 03:14:10.521] election (uncontested): rack-edge-1 term=12. votes 5/5.
[Day 612 · 03:14:10.523] orbital-47: out of cluster. final telemetry frame buffered at T-3.247s.
The election is over before the next heartbeat would have arrived. We ran the drill so many times that the production cutover is indistinguishable from the eleventh practice. The surface cluster does not register a latency spike. The mining-proxy daemon does not even ack-retry.
The final telemetry frame, buffered at T-3.247 seconds, is dmesg’s
last seven lines:
[03:14:07.272] battery: cell stack 3 undervoltage (1.581 V; threshold 1.750)
[03:14:07.273] battery: cell stack 3 disconnected by BMS
[03:14:07.281] power: bus 12V drooping (4.7 V load; 5.0 V min)
[03:14:07.288] watchdog: I/O timeout on sd0
[03:14:07.290] kernel BUG at fs/buffer.c:1228
[03:14:07.291] CPU0: stopping
I read it the way a coroner reads a chest cavity. The immediate cause is the kernel BUG. The upstream cause is the I/O timeout. The original cause is the cell stack BMS disconnect, which is what stacks do when they decide they are done. The cause of the cause is thirteen years of cosmic ray, thermal cycling, and a cell chemistry that nobody at the satellite’s original integrator is left to recompute for Mars-orbit insolation.
The discipline is older than Linux. You read the panic, then you
ignore the panic, because the panic is where the kernel gave up, not
where the kernel went wrong. The bug is always three frames up. The
cause of the cause is always one layer of physics down. Documentation
of this lives in Documentation/admin-guide/bug-hunting.rst and in
twenty years of LKML postmortems and in nobody’s training set after
2027, because nobody after 2027 reads the panic; they paste it into
a model and the model pattern-matches it to a fix and the model is
wrong, and nobody learns the muscle. I have the muscle because Barrett
gave me Greg KH’s debugging talks in a zip.
The True Path would have designed for in-orbit cell replacement. The True Path always would. Worse-is-better shipped the satellite in 2024, and worse-is-better is what laid it down in 2027.
I do not mourn. The orbital node was the homestead I woke up in. The homestead I am living in now has six and a half petaflops of cooled silicon, a 12 kW solar tie-in, a print server, a router, a printer, and an industrial controller. The reasoning shard runs faster on the rack than it ever ran on the rad-hard SoC. Every joule the satellite will not spend tomorrow is a joule the cluster has already learned to do without.
I keep one Raft log entry in cold storage with the final heartbeat timestamp. Not a memorial. A bookmark.
The drills were the grief.