# jackpotminer A GPU-accelerated **Equihash 192,7** miner written in Rust, for ZClassic (ZCL) and other coins that use the same proof-of-work. It implements the full mining pipeline: - **Stratum client** — Zcash-style pool protocol (subscribe / authorize / set_target / notify / submit), as used by nheqminer / EWBF. - **Personalised BLAKE2b** — `"ZcashPoW" || LE32(192) || LE32(7)`, 48-byte digest. - **Equihash 192,7 solver** — Wagner's algorithm over eight 24-bit collision blocks, producing 128-index / 400-byte solutions. - **Full GPU solver** — the entire solve runs on the GPU: BLAKE2b generation, all seven collision rounds over a bucketed hash table, and back-reference recovery of the leaf indices. Only a few candidate solutions return to the host for verification. Two interchangeable backends: **OpenCL** (default, our own kernels) and **CUDA** (driver API, replaying miniZ's reverse-engineered fatbin). - **Solution verification & encoding** — full spec-compliant `IsValidSolution` plus compact-array encode/decode and the block PoW (double-SHA256) target check. ## Status The solver, BLAKE2b personalisation, solution encoding and verification are validated end-to-end. `--selftest` runs a full CPU solve and a full GPU solve on the same header and confirms the GPU recovers the same verified solution set; the OpenCL BLAKE2b kernel is also checked byte-for-byte against the CPU. The GPU solver finds the expected ~2 solutions per nonce, matching the Equihash 192,7 theoretical rate. The Stratum layer implements the common nheqminer-style protocol and is covered by a mock-pool integration test; some pools differ in nonce/solution framing and may need small adjustments (`RUST_LOG=debug` shows the raw traffic). ### Performance & memory Measured on a single desktop RTX 5080 (default clocks, no overclock; `--benchmark 30`, ~1.8 solutions per nonce): | Backend (`--backend`) | ms/solve | Sol/s | |---|---:|---:| | `cuda` | **~20** | **~92** | | `opencl` | ~311 | ~5.9 | CUDA is ~15× faster than the OpenCL backend on the same card. One worker thread per device shares the pool connection and a global nonce counter, so a second GPU (`--devices 0,1`) scales the aggregate Sol/s roughly linearly. (Enabling the clock/power tuning — `--auto-tune`/`--gpu-clock-offset`, needs root — pushes both backends higher still; the table above is untuned.) The **CPU solver** (`--cpu`) is an AVX2-tuned Wagner implementation (xenoncat-style packed 32-byte slots + dense collision keys, single-pass bucketed XOR, `pshufb` round-0 repack) running ~**3 s/solve** on a 24-core host (~0.4 Sol/s), finding the expected ~2 solutions per nonce. Like the GPU it bounds each collision bucket (`--cpu-clamp`, default 32) — required because the naive unclamped algorithm's degenerate collisions explode in the last rounds; `--cpu-clamp 0` selects the exact solver (may OOM on dense headers). The CUDA backend is profile-driven (Nsight Compute). Key optimisations over the OpenCL backend: - **Warp-per-bucket collision/final kernels** — each 32-lane warp cooperatively loads its bucket into shared memory and parallelises the pair search, fixing the one-thread-per-bucket version's intra-warp load imbalance (collision rounds ~174 ms → ~97 ms, final ~5 ms → ~1 ms). - **16-byte-aligned `uint4` slot stores** — hash-table slots for tables 1-6 are padded to a multiple of 4 words so the ref+blocks are written with aligned `uint4` transactions. The earlier odd-sized slots (7/6/5/3 words) caused misaligned scalar writes; fixing this cut the collision rounds to ~68 ms (~120 ms → ~92 ms per solve). - **Pinned (page-locked) host buffers** for the device→host result readback. Things that were tried and *didn't* help on this hardware (measured, reverted): warp-shuffle comparisons and `gen` register-capping (`__launch_bounds__`) — the fast rounds are not compare-bound, and `gen` needs its registers. Warp-aggregated atomics don't apply because the output-bucket atomics target data-dependent random addresses (no within-warp sharing). The remaining cost is `gen` (~21 ms, register/compute bound) and the scattered hash-table read/write latency that is fundamental to bucketing. The OpenCL backend keeps the simpler kernels. The hash table uses 2²¹ buckets × 32 slots (bucket cap 2× the mean occupancy, so overflow is rare and essentially all solutions survive). The bulky 24-bit collision blocks are only needed during the round that consumes them, so they live in two ping-pong working buffers; only a small per-table back-reference array (1 word/slot) is kept resident for solution recovery. That brings the footprint to about **6 GB of VRAM per GPU** (down from ~11 GB when all seven block tables were resident), so **8 GB cards work**. The backend reads the device's VRAM at startup and warns if a card is too small; `ZCL_OPENCL_ROWBITS` overrides the bucket count for experimentation (values below 21 usually find no solutions). There is still optimisation headroom relative to mature miners like lolMiner/EWBF. The above describes the OpenCL backend (`kernels/equihash.cl`). The CUDA backend takes a different route: instead of our own kernels it drives miniZ's reverse-engineered Equihash 192,7 solver — see "CUDA backend" below. ## Build Requirements: a Rust toolchain and an OpenCL runtime (the NVIDIA driver ships `libOpenCL`). The CUDA backend only needs `libcuda` (the NVIDIA driver) — the fatbin and launch trace it drives are embedded in the binary, so no CUDA toolkit or `nvcc` is required. ```bash cargo build --release # OpenCL backend (default) cargo build --release --features cuda # OpenCL + CUDA backends cargo build --release --no-default-features --features cuda # CUDA only cargo build --release --no-default-features # CPU-only (no GPU) ``` ### CUDA backend (miniZ fatbin replay) `--features cuda` (selectable with `--backend cuda`) does **not** compile its own kernels. It loads miniZ's reverse-engineered Equihash 192,7 solver fatbin (embedded from `src/miniz/equihash192_7.fatbin`) through the CUDA Driver API and replays its exact 10-kernel Wagner pipeline from an embedded launch trace, injecting the BLAKE2b midstate + header tail into `digit_f` per solve. Recovered indices are verified by the project's own `equihash::filter_candidates`, so only solutions genuinely valid for the header are ever returned. Needs an NVIDIA GPU whose arch matches the fatbin's cubins (sm_80/sm_86/sm_120). See `collab/jmprcx-solver/` for the reverse-engineering work behind it. **Multi-config (VRAM auto-selection).** miniZ ships several bucket geometries with different memory footprints; `src/miniz/configs/` bundles a recorded launch trace for each. At startup the backend reads free VRAM and picks the **highest-capacity config that fits** (higher capacity ⇒ fewer dropped collisions ⇒ better yield): | config | min free VRAM | table capacity | |---|---|---| | `12288x3392` | ~11 GB | 41.7M (best) | | `10000x4032` | ~5.5 GB | 40.3M | | `2048x16960` | ~5 GB | 34.7M | So it runs on cards from ~6 GB up, using the largest config the card can hold. Override the choice with `ZCL_CUDA_CONFIG=` (e.g. for testing). If no config fits, it fails at init with a clear message. See `src/miniz/configs/README.md`. ### GPU tuning & efficiency (NVML) The CUDA backend tunes clocks/power via NVML (the same knobs LACT exposes). **By default** it locks clocks and power to the card maximum (peak hashrate). Passing **any** tuning flag switches to manual mode — only the knobs you specify are applied, so a clock offset isn't defeated by a hard clock lock: | flag | effect | |---|---| | `--power-limit ` | board power cap (best Sol/W lever) | | `--gpu-clock ` / `--mem-clock ` | hard-lock the core / memory clock | | `--gpu-clock-offset ` / `--mem-clock-offset ` | signed V/F **offsets** (LACT-style; undervolt/overclock the curve) | | `--auto-tune` | sweep the core offset at startup for the **fastest stable solve rate** | | `--no-gpu-tune` | touch nothing (let LACT or `nvidia-smi` own the GPU) | **`--auto-tune`** optimizes each card for speed automatically: it raises power to the cap, leaves clocks free to boost, then sweeps the core clock offset upward (in +45 MHz steps), measuring solve throughput on a test header at each step. It keeps the best and **stops at the first instability** (a kernel error or the card no longer producing valid solutions), then locks in that offset. Takes ~30 s at startup, runs per card, needs root, and is restored on exit. It overrides `--gpu-clock-offset`. A typical efficiency setup combines a power cap with a positive core offset and a negative memory offset, letting the card boost on a shifted curve under the cap: ```bash sudo ./jackpotminer --url pool:port -u addr --devices all \ --power-limit 250 --gpu-clock-offset 250 --mem-clock-offset -500 ``` These are **privileged** (run as root); without it you get a one-line warning and the card free-runs. Settings (clocks, power, offsets) are **restored to defaults on clean shutdown**. The per-card stats line shows live `Sol/s`, board `W`, and `Sol/W` (reading power is unprivileged) so you can tune efficiency directly. > Re: LACT — its NVIDIA support drives these same NVML offsets, but LACT has no > stable CLI to set them (a root daemon applies a saved profile), so the miner > sets them directly via NVML. If you'd rather LACT own the GPU, run the miner > with `--no-gpu-tune` and configure clocks in LACT. ## Usage ```bash # List OpenCL devices ./target/release/jackpotminer --list-devices # Mine on one GPU ./target/release/jackpotminer \ --url stratum+tcp://zcl.pool.example:3032 \ --user . \ --pass x \ --devices 0 # Mine on multiple GPUs (one worker thread each) ./target/release/jackpotminer --url ... --user ... --devices 0,1 ./target/release/jackpotminer --url ... --user ... --devices all # Use the CUDA backend instead of OpenCL (needs a --features cuda build) ./target/release/jackpotminer --url ... --user ... --backend cuda --devices all # Force the CPU backend ./target/release/jackpotminer --url ... --user ... --cpu # Benchmark concurrently across the chosen devices (no pool), 30 solves each ./target/release/jackpotminer --benchmark 30 --devices 0,1 # Correctness self-test (CPU solve vs GPU solve + kernel comparison) ./target/release/jackpotminer --selftest ``` Key flags: `--url`, `--port` (fallback when the URL omits one), `--user` (`-u`), `--pass` (`-p`), `--backend` (`opencl`/`cuda`), `--devices` (e.g. `0,1` or `all`), `--device` (`-d`, for single-device benchmark/debug), `--threads` (`-t`), `--cpu`, `--benchmark N`, `--list-devices`, `--selftest`. When mining in a terminal a live ratatui **dashboard** is shown by default (per-card Sol/s, power, temperature, Sol/W, shares; a network Sol/s panel (horizontal bars) showing the last hour / day / 3 / 7 / 30 / 60 / 90 days; per-GPU graphs; and a log pane; `q`/`Esc`/`Ctrl-C` to quit); it falls back to periodic log lines when there's no TTY (headless/piped/systemd) and `--no-tui` forces log output. `--job-timeout ` pauses mining when the pool goes silent (default 300; 0 disables). Set `RUST_LOG=debug` to see the raw Stratum traffic (ignored under `--tui`, which captures logs into its pane at info level). ## Layout | File | Purpose | |------|---------| | `src/params.rs` | Equihash 192,7 constants | | `src/blake.rs` | Personalised BLAKE2b base state | | `src/equihash.rs` | CPU solver, verifier, compact-array encode/decode, candidate filter | | `src/stratum.rs` | Stratum pool client | | `src/miner.rs` | Mining loop, nonce iteration, target check, submit, backend dispatch | | `src/gpu.rs` | OpenCL host glue: buffers, kernel orchestration | | `kernels/equihash.cl` | OpenCL kernels: BLAKE2b, collision rounds, recovery | | `src/cuda.rs` | CUDA backend: drives miniZ's embedded fatbin via the driver API (FFI to `libcuda`) | | `src/miniz/` | Embedded miniZ fatbin + recorded launch trace the CUDA backend replays | | `build.rs` | Links `libcuda` for the `cuda` feature | ## Disclaimer For use only on hardware and pools you are authorised to use, and where cryptocurrency mining is legal. Mining consumes significant power.