Initial commit: jackpotminer Equihash 192,7 miner

GPU-accelerated Equihash 192,7 miner in Rust with three solver backends: - CPU: Wagner's algorithm, AVX2 packed slots (xenoncat-style) - OpenCL: full on-GPU solve (kernels/equihash.cl); runs on NVIDIA and AMD - CUDA: driver-API replay of miniZ's extracted fatbin (src/miniz/) Also includes a default-off pearlhash backend (src/pearl/, native CPU core + NVRTC int8-GEMM GPU kernels) and a WIP Ethash CUDA backend (src/ethash/). Reverse-engineering scratch (alpha-miner, pearl-dump/) and the active runtime config (mine.toml) are gitignored; mine.example.toml is the template. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 23:08:20 -04:00
commit e2fab622b5
82 changed files with 781504 additions and 0 deletions
@@ -0,0 +1,242 @@
+# jackpotminer
+
+A GPU-accelerated **Equihash 192,7** miner written in Rust, for ZClassic (ZCL)
+and other coins that use the same proof-of-work.
+
+It implements the full mining pipeline:
+
+- **Stratum client** — Zcash-style pool protocol (subscribe / authorize /
+  set_target / notify / submit), as used by nheqminer / EWBF.
+- **Personalised BLAKE2b** — `"ZcashPoW" || LE32(192) || LE32(7)`, 48-byte
+  digest.
+- **Equihash 192,7 solver** — Wagner's algorithm over eight 24-bit collision
+  blocks, producing 128-index / 400-byte solutions.
+- **Full GPU solver** — the entire solve runs on the GPU: BLAKE2b generation,
+  all seven collision rounds over a bucketed hash table, and back-reference
+  recovery of the leaf indices. Only a few candidate solutions return to the
+  host for verification. Two interchangeable backends: **OpenCL** (default, our
+  own kernels) and **CUDA** (driver API, replaying miniZ's reverse-engineered
+  fatbin).
+- **Solution verification & encoding** — full spec-compliant `IsValidSolution`
+  plus compact-array encode/decode and the block PoW (double-SHA256) target
+  check.
+
+## Status
+
+The solver, BLAKE2b personalisation, solution encoding and verification are
+validated end-to-end. `--selftest` runs a full CPU solve and a full GPU solve on
+the same header and confirms the GPU recovers the same verified solution set;
+the OpenCL BLAKE2b kernel is also checked byte-for-byte against the CPU. The GPU
+solver finds the expected ~2 solutions per nonce, matching the Equihash 192,7
+theoretical rate.
+
+The Stratum layer implements the common nheqminer-style protocol and is covered
+by a mock-pool integration test; some pools differ in nonce/solution framing and
+may need small adjustments (`RUST_LOG=debug` shows the raw traffic).
+
+### Performance & memory
+
+Measured on a single desktop RTX 5080 (default clocks, no overclock; `--benchmark
+30`, ~1.8 solutions per nonce):
+
+| Backend (`--backend`) | ms/solve | Sol/s |
+|---|---:|---:|
+| `cuda`   | **~20**  | **~92** |
+| `opencl` | ~311     | ~5.9    |
+
+CUDA is ~15× faster than the OpenCL backend on the same card. One worker thread
+per device shares the pool connection and a global nonce counter, so a second GPU
+(`--devices 0,1`) scales the aggregate Sol/s roughly linearly. (Enabling the
+clock/power tuning — `--auto-tune`/`--gpu-clock-offset`, needs root — pushes both
+backends higher still; the table above is untuned.)
+
+The **CPU solver** (`--cpu`) is an AVX2-tuned Wagner implementation (xenoncat-style
+packed 32-byte slots + dense collision keys, single-pass bucketed XOR, `pshufb`
+round-0 repack) running ~**3 s/solve** on a 24-core host (~0.4 Sol/s), finding
+the expected ~2 solutions per nonce. Like the GPU it bounds each collision bucket
+(`--cpu-clamp`, default 32) — required because the naive unclamped algorithm's
+degenerate collisions explode in the last rounds; `--cpu-clamp 0` selects the
+exact solver (may OOM on dense headers).
+
+The CUDA backend is profile-driven (Nsight Compute). Key optimisations over the
+OpenCL backend:
+
+- **Warp-per-bucket collision/final kernels** — each 32-lane warp cooperatively
+  loads its bucket into shared memory and parallelises the pair search, fixing
+  the one-thread-per-bucket version's intra-warp load imbalance (collision
+  rounds ~174 ms → ~97 ms, final ~5 ms → ~1 ms).
+- **16-byte-aligned `uint4` slot stores** — hash-table slots for tables 1-6 are
+  padded to a multiple of 4 words so the ref+blocks are written with aligned
+  `uint4` transactions. The earlier odd-sized slots (7/6/5/3 words) caused
+  misaligned scalar writes; fixing this cut the collision rounds to ~68 ms
+  (~120 ms → ~92 ms per solve).
+- **Pinned (page-locked) host buffers** for the device→host result readback.
+
+Things that were tried and *didn't* help on this hardware (measured, reverted):
+warp-shuffle comparisons and `gen` register-capping (`__launch_bounds__`) — the
+fast rounds are not compare-bound, and `gen` needs its registers. Warp-aggregated
+atomics don't apply because the output-bucket atomics target data-dependent
+random addresses (no within-warp sharing). The remaining cost is `gen` (~21 ms,
+register/compute bound) and the scattered hash-table read/write latency that is
+fundamental to bucketing. The OpenCL backend keeps the simpler kernels.
+
+The hash table uses 2²¹ buckets × 32 slots (bucket cap 2× the mean occupancy, so
+overflow is rare and essentially all solutions survive). The bulky 24-bit
+collision blocks are only needed during the round that consumes them, so they
+live in two ping-pong working buffers; only a small per-table back-reference
+array (1 word/slot) is kept resident for solution recovery. That brings the
+footprint to about **6 GB of VRAM per GPU** (down from ~11 GB when all seven
+block tables were resident), so **8 GB cards work**. The backend reads the
+device's VRAM at startup and warns if a card is too small; `ZCL_OPENCL_ROWBITS`
+overrides the bucket count for experimentation (values below 21 usually find no
+solutions). There is still optimisation headroom relative to mature miners like
+lolMiner/EWBF.
+
+The above describes the OpenCL backend (`kernels/equihash.cl`). The CUDA backend
+takes a different route: instead of our own kernels it drives miniZ's
+reverse-engineered Equihash 192,7 solver — see "CUDA backend" below.
+
+## Build
+
+Requirements: a Rust toolchain and an OpenCL runtime (the NVIDIA driver ships
+`libOpenCL`). The CUDA backend only needs `libcuda` (the NVIDIA driver) — the
+fatbin and launch trace it drives are embedded in the binary, so no CUDA toolkit
+or `nvcc` is required.
+
+```bash
+cargo build --release                          # OpenCL backend (default)
+cargo build --release --features cuda           # OpenCL + CUDA backends
+cargo build --release --no-default-features --features cuda   # CUDA only
+cargo build --release --no-default-features      # CPU-only (no GPU)
+```
+
+### CUDA backend (miniZ fatbin replay)
+
+`--features cuda` (selectable with `--backend cuda`) does **not** compile its own
+kernels. It loads miniZ's reverse-engineered Equihash 192,7 solver fatbin
+(embedded from `src/miniz/equihash192_7.fatbin`) through the CUDA Driver API and
+replays its exact 10-kernel Wagner pipeline from an embedded launch trace,
+injecting the BLAKE2b midstate + header tail into `digit_f` per solve. Recovered
+indices are verified by the project's own `equihash::filter_candidates`, so only
+solutions genuinely valid for the header are ever returned. Needs an NVIDIA GPU
+whose arch matches the fatbin's cubins (sm_80/sm_86/sm_120). See
+`collab/jmprcx-solver/` for the reverse-engineering work behind it.
+
+**Multi-config (VRAM auto-selection).** miniZ ships several bucket geometries with
+different memory footprints; `src/miniz/configs/` bundles a recorded launch trace
+for each. At startup the backend reads free VRAM and picks the **highest-capacity
+config that fits** (higher capacity ⇒ fewer dropped collisions ⇒ better yield):
+
+| config | min free VRAM | table capacity |
+|---|---|---|
+| `12288x3392` | ~11 GB | 41.7M (best) |
+| `10000x4032` | ~5.5 GB | 40.3M |
+| `2048x16960` | ~5 GB | 34.7M |
+
+So it runs on cards from ~6 GB up, using the largest config the card can hold.
+Override the choice with `ZCL_CUDA_CONFIG=<name>` (e.g. for testing). If no config
+fits, it fails at init with a clear message. See `src/miniz/configs/README.md`.
+
+### GPU tuning & efficiency (NVML)
+
+The CUDA backend tunes clocks/power via NVML (the same knobs LACT exposes). **By
+default** it locks clocks and power to the card maximum (peak hashrate). Passing
+**any** tuning flag switches to manual mode — only the knobs you specify are
+applied, so a clock offset isn't defeated by a hard clock lock:
+
+| flag | effect |
+|---|---|
+| `--power-limit <W>` | board power cap (best Sol/W lever) |
+| `--gpu-clock <MHz>` / `--mem-clock <MHz>` | hard-lock the core / memory clock |
+| `--gpu-clock-offset <MHz>` / `--mem-clock-offset <MHz>` | signed V/F **offsets** (LACT-style; undervolt/overclock the curve) |
+| `--auto-tune` | sweep the core offset at startup for the **fastest stable solve rate** |
+| `--no-gpu-tune` | touch nothing (let LACT or `nvidia-smi` own the GPU) |
+
+**`--auto-tune`** optimizes each card for speed automatically: it raises power to
+the cap, leaves clocks free to boost, then sweeps the core clock offset upward
+(in +45 MHz steps), measuring solve throughput on a test header at each step. It
+keeps the best and **stops at the first instability** (a kernel error or the card
+no longer producing valid solutions), then locks in that offset. Takes ~30 s at
+startup, runs per card, needs root, and is restored on exit. It overrides
+`--gpu-clock-offset`.
+
+A typical efficiency setup combines a power cap with a positive core offset and a
+negative memory offset, letting the card boost on a shifted curve under the cap:
+
+```bash
+sudo ./jackpotminer --url pool:port -u addr --devices all \
+    --power-limit 250 --gpu-clock-offset 250 --mem-clock-offset -500
+```
+
+These are **privileged** (run as root); without it you get a one-line warning and
+the card free-runs. Settings (clocks, power, offsets) are **restored to defaults
+on clean shutdown**. The per-card stats line shows live `Sol/s`, board `W`, and
+`Sol/W` (reading power is unprivileged) so you can tune efficiency directly.
+
+> Re: LACT — its NVIDIA support drives these same NVML offsets, but LACT has no
+> stable CLI to set them (a root daemon applies a saved profile), so the miner
+> sets them directly via NVML. If you'd rather LACT own the GPU, run the miner
+> with `--no-gpu-tune` and configure clocks in LACT.
+
+## Usage
+
+```bash
+# List OpenCL devices
+./target/release/jackpotminer --list-devices
+
+# Mine on one GPU
+./target/release/jackpotminer \
+    --url stratum+tcp://zcl.pool.example:3032 \
+    --user <ZCL-address>.<worker> \
+    --pass x \
+    --devices 0
+
+# Mine on multiple GPUs (one worker thread each)
+./target/release/jackpotminer --url ... --user ... --devices 0,1
+./target/release/jackpotminer --url ... --user ... --devices all
+
+# Use the CUDA backend instead of OpenCL (needs a --features cuda build)
+./target/release/jackpotminer --url ... --user ... --backend cuda --devices all
+
+# Force the CPU backend
+./target/release/jackpotminer --url ... --user ... --cpu
+
+# Benchmark concurrently across the chosen devices (no pool), 30 solves each
+./target/release/jackpotminer --benchmark 30 --devices 0,1
+
+# Correctness self-test (CPU solve vs GPU solve + kernel comparison)
+./target/release/jackpotminer --selftest
+```
+
+Key flags: `--url`, `--port` (fallback when the URL omits one), `--user` (`-u`),
+`--pass` (`-p`), `--backend` (`opencl`/`cuda`), `--devices` (e.g. `0,1` or `all`),
+`--device` (`-d`, for single-device benchmark/debug), `--threads` (`-t`), `--cpu`,
+`--benchmark N`, `--list-devices`, `--selftest`. When mining in a terminal a live
+ratatui **dashboard** is shown by default (per-card Sol/s, power, temperature,
+Sol/W, shares; a network Sol/s panel (horizontal bars) showing the last hour /
+day / 3 / 7 / 30 / 60 / 90 days; per-GPU graphs; and a log pane;
+`q`/`Esc`/`Ctrl-C` to quit); it falls back to periodic log lines
+when there's no TTY (headless/piped/systemd) and `--no-tui` forces log output.
+`--job-timeout <secs>` pauses mining when the pool
+goes silent (default 300; 0 disables). Set `RUST_LOG=debug` to see the raw Stratum
+traffic (ignored under `--tui`, which captures logs into its pane at info level).
+
+## Layout
+
+| File | Purpose |
+|------|---------|
+| `src/params.rs` | Equihash 192,7 constants |
+| `src/blake.rs` | Personalised BLAKE2b base state |
+| `src/equihash.rs` | CPU solver, verifier, compact-array encode/decode, candidate filter |
+| `src/stratum.rs` | Stratum pool client |
+| `src/miner.rs` | Mining loop, nonce iteration, target check, submit, backend dispatch |
+| `src/gpu.rs` | OpenCL host glue: buffers, kernel orchestration |
+| `kernels/equihash.cl` | OpenCL kernels: BLAKE2b, collision rounds, recovery |
+| `src/cuda.rs` | CUDA backend: drives miniZ's embedded fatbin via the driver API (FFI to `libcuda`) |
+| `src/miniz/` | Embedded miniZ fatbin + recorded launch trace the CUDA backend replays |
+| `build.rs` | Links `libcuda` for the `cuda` feature |
+
+## Disclaimer
+
+For use only on hardware and pools you are authorised to use, and where
+cryptocurrency mining is legal. Mining consumes significant power.