Pre-Registration v2 — Do dormant whale wakeups precede negative forward BTC returns?

Status: FROZEN 2026-06-09 12:17 UTC. Binding. The cutoff, the materialised candidate cohort, and all parameters below are immutable; inference runs only on data with spent_at_timestamp strictly after the cutoff (§10). Supersedes v1 (prereg_whale_exchange_inflow_v1.md), which is retired: its exchange-inflow event process proved non-stationary (yearly count swung 1→20→115→24, CV≈1.1) and risks a structural INCONCLUSIVE that cannot separate "no effect" from "the whale population migrated to OTC/ETF custody." The dormant-wakeup process is ~3× more stationary (CV≈0.28–0.40), essentially exogenous (same-day |return| only 1.09× baseline), and label-free. The flagship choice is data-driven, not preference.

Registry: Swiss Whale Intelligence — Honest Research series, entry 1
Drafted / Frozen: 2026-06-09 · 2026-06-09 12:17 UTC — at the git commit that introduced this frozen version (immutable anchor; git log --follow docs/prereg_dormant_wakeup_v2.md).
Data cutoff: spent_at_block = 952966 / 2026-06-09 12:17 UTC. Candidate cohort frozen as table dormant_cohort_952966 = 4 483 unspent ≥ 500 BTC UTXOs (6 896 131 BTC) as of the cutoff; immutable. Prospective events = forward spends of these UTXOs with age-at-spend ≥ 2 y.
Pre-committed readout: earliest 2027-12-09 (cutoff + 18 months); actual readout at the first month-end on which n_eff ≥ 24 independent event-windows also holds (§9).
Frozen parameters: size ≥ 500 BTC · age ≥ 2 y · floor F = −3 pp · stop at n_eff ≥ 24 AND ≥ 18 months · block bootstrap block ≥ 7 d, ≥ 10 000 resamples.

0. Why this exists (the honest-research frame)

We pre-register, on data not yet seen, the test of a claim we would like to be true, and commit to publishing the result whatever it is — including a weak confirmation we'd find inconvenient, or, most likely, an honest "cannot say." This is a prospective event study, not a backtest: it measures the distribution of an outcome conditional on an event; it prescribes no trade. Testing forward dissolves look-ahead/label contamination. The event is a pure chain primitive (coin value + coin age) — no entity labels enter, so every verdict speaks unambiguously about the effect, not about our label coverage.

The expected outcome is INCONCLUSIVE. Demonstrating why meaningful predictive structure is hard to establish on on-chain whale events is the product — not a disappointment.

1. Primary hypothesis (H1) — the only confirmatory test

H1: A dormant whale wakeup — a single UTXO of ≥ 500 BTC (frozen) that had been unmoved for ≥ 2 years (frozen), being spent — is followed by a meaningfully negative abnormal BTC return over the next 7 days, relative to matched baseline days.

One hypothesis, one direction, one primary horizon. Narrowness — not a correction factor — is how we avoid the forking-paths garden (§7).

2. Event definition (chain-primitive, label-free, prospective)

An event is a spending transaction that consumes ≥1 UTXO satisfying ALL of: 1. Size: the individual UTXO is ≥ 500 BTC (value_sats ≥ 5e10) (frozen). 2. Age: spent_at_block − block_height ≥ 105 120 (~2 years) (frozen). 3. Dedup: one event per spending transaction (spent_in_txid); a single wakeup that consumes many old UTXOs counts once. 4. Frozen instrument (the constant-detector clause): eligibility is evaluated against a candidate cohort snapshotted at the cutoff (§10) — every UTXO ≥ 500 BTC that is unspent as of the cutoff. An event is the forward spend of a cohort member with age-at-spend ≥ 2 y. The cohort is never expanded by post-cutoff back-fill: a large old UTXO that only enters whale_outputs after the cutoff is not eligible, even if later spent. This holds the detection instrument identical across both halves of the window, so §6 measures behaviour — not coverage drift. (A live-detection-latency rule is insufficient: the reference table keeps growing during the window, so the live detector itself drifts more complete; only a frozen cohort is constant.)

No labels, no clustering, no exchange attribution — only coin value and coin age from the UTXO set. An event counts only if its spent_at_timestamp is strictly after the frozen data cutoff (§10). Measured base rate (2022–2026): ~173/yr; see §12 for the completeness caveat.

3. Outcome variable + baseline — report BOTH θ (the momentum decomposition)

Forward return: r₇ = ln(P[t_event + 7d] / P[t_event]) from crypto_prices (BTC, hourly close nearest the timestamp).
Two pre-committed estimands, both primary:
θ_raw = mean(r₇ | event) − mean(r₇ | all non-event days, same calendar window).
θ_matched = same, but non-event baseline days are matched on trailing-7d return to the event days.
Why both: matching on the trailing path can over-control — if wakeups are triggered by a pre-decline and pre-declines predict follow-on declines via momentum, matching strips part of the very mechanism. So we do not choose; the gap (θ_raw − θ_matched) is itself a primary output — the momentum-vs-prediction decomposition. A large negative θ_raw that vanishes under matching means "wakeups happen on bad days," not "wakeups predict bad days." We report which.
Primary horizon: 7 days. 1-day and 30-day are secondary/exploratory (§8).

4. Statistical test

θ via block bootstrap, block length ≥ 7 d, ≥ 10 000 resamples (frozen). The block bootstrap is mandatory here: events cluster (94 % of consecutive wakeup-days are < 7 d apart), so the effective independent sample is the number of non-overlapping 7-d windows containing an event (~12–15/yr), not the event count (~173/yr). Report θ with its 95 % bootstrap CI — the CI, not a bare p-value, drives §5.

5. The three pre-committed bands

Economic-meaning floor F = −3 pp (frozen — power-driven, see §5 note). Decision is a clean partition on where the 95 % CI for θ_matched lies relative to F:

Band	Condition	Reading
CONFIRMED	95 % CI entirely below F AND sign holds in both sub-periods (§6)	Effect exists and is meaningful.
REFUTED	95 % CI entirely above F (an effect as negative as −3 pp is ruled out)	The flattering claim does not hold.
INCONCLUSIVE	95 % CI straddles F	Cannot say. The expected default — and itself the product.

Why F = −3 pp and not −1.5 pp (the power coupling — corrected from v1): with SD(r₇) ≈ 6.9 pp (measured) and ~12–15 independent windows/yr, a −1.5 pp floor cannot be REFUTED within a reasonable window (needs n_eff ≈ 104 ≈ many years) → it would yield only an uninformative INCONCLUSIVE. F = −3 pp makes REFUTED reachable at n_eff ≈ 21 and forces CONFIRMED to require an implausibly large θ̂ ≤ −4.5 pp. The test is therefore constructively a darling-killer: built to refute the flattering claim, saved only by overwhelming evidence. F is a power decision, not an "economic usability" decision; it is the most consequential number in this document.

6. Stationarity gate (the fourth layer) — read asymmetrically

To count as CONFIRMED, θ must additionally hold the same sign in both halves of the prospective window. Asymmetric reading (because each half has only ~half of ~24 events): a sign flip downgrades CONFIRMED → INCONCLUSIVE-regime-unstable; a sign agreement does not certify stability (at this n it is near coin-flip) and is reported as "not contradicted," never as "stable."

7. Multiple comparisons

One primary hypothesis, one direction, one horizon, two pre-declared estimands (raw + matched). No in-sample correction: it is still p-hacking-adjacent because the analytic-choice space is unbounded. Narrowness + prospective holdout is the defense.

8. Secondary / exploratory (NON-confirmatory — labelled as such in any output)

Reported for color only; none changes the §5 verdict: 1-day / 30-day horizons; the ≥1000 BTC and ≥3y-age variants; the retired exchange-inflow trigger (run as exploratory, with its non-stationarity written into the caption — "we tried the iconic version; its event process can't answer"); the continuous daily-inflow rank-correlation design (faster ~6-mo read, robust to clustering, but commodity-flavoured and functional-form-dependent → exploratory only).

9. Stopping rule (anti-optional-stopping) — honest calendar time

Single readout at the first moment BOTH hold: n_eff ≥ 24 independent 7-d event-windows (frozen) AND ≥ 18 months elapsed since cutoff. Honest expectation given ~12–15 independent windows/yr: ~1.5–2 years to an informative readout. This is a multi-year credibility flywheel, not a quarterly deliverable — accepted with open eyes; the process's stationarity (§0) is what makes the wait pay off. No interim peek drives inference.

10. Freeze mechanics

On freeze: commit this file; record git hash + UTC timestamp; record the data cutoff = last spent_at_block / UTC time already observed; record the readout date. Materialise the candidate cohort as an immutable table dormant_cohort_<cutoff> = all whale_outputs rows with value_sats ≥ 5e10 and spent_in_txid IS NULL as of the cutoff (keep txid, vout, value_sats, block_height). The prospective sample is exactly the spends of these frozen rows during the window with age-at-spend ≥ 2 y; the cohort table is never altered post-freeze. The frozen file is never edited; a correction is a separately-registered v3 citing this one.

11. Commitment to publish either outcome

We publish the §5 verdict in whichever band it lands — including a CONFIRMED that would resemble the edge we have publicly doubted, and the most likely INCONCLUSIVE. A pre-registration whose author accepts only one outcome is not one.

12. Caveats written into the verdict

Detection completeness (frozen, not drifting): the candidate cohort is snapshotted at the cutoff from whale_outputs (~82.8 % coverage, still back-filling). Freezing the cohort (§2.4, §10) converts this from a drifting bias — which would confound §6's two-half comparison — into a constant coverage floor: the rate is a fixed lower bound, identical across both halves. Post-cutoff back-fills improve future studies, never this one. An INCONCLUSIVE is read against this constant floor, not despite it.
Endogeneity (measured, mild): wakeup-days carry only 1.09× baseline same-day |return| and a near-flat signed same-day return (+0.15 pp), so event and outcome do not strongly share a price driver — the a-priori endogeneity fear did not materialise for this definition. Reported, not assumed.
Scope: measures whether a specific, narrow event precedes a meaningfully negative 7-d abnormal return, forward, once. It establishes no trading rule and does not generalise across thresholds/ages/assets/horizons. A single CONFIRMED is a starting point for replication, not a signal.

Frozen parameters (locked at the measured defaults — see header)

Size / age: ≥ 500 BTC, ≥ 2 y (alt. ≥ 1000 BTC or ≥ 3 y shift rate/stationarity modestly).
Floor F: −3 pp — the single most consequential number (power-driven, §5).
Stopping: n_eff ≥ 24 and ≥ 18 months (~1.5–2 yr expected).
Block-bootstrap: block ≥ 7 d, ≥ 10 000 resamples.
Data cutoff + readout date — filled at commit (§10).