e2fab622b5
GPU-accelerated Equihash 192,7 miner in Rust with three solver backends: - CPU: Wagner's algorithm, AVX2 packed slots (xenoncat-style) - OpenCL: full on-GPU solve (kernels/equihash.cl); runs on NVIDIA and AMD - CUDA: driver-API replay of miniZ's extracted fatbin (src/miniz/) Also includes a default-off pearlhash backend (src/pearl/, native CPU core + NVRTC int8-GEMM GPU kernels) and a WIP Ethash CUDA backend (src/ethash/). Reverse-engineering scratch (alpha-miner, pearl-dump/) and the active runtime config (mine.toml) are gitignored; mine.example.toml is the template. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
243 lines
12 KiB
Markdown
243 lines
12 KiB
Markdown
# jackpotminer
|
||
|
||
A GPU-accelerated **Equihash 192,7** miner written in Rust, for ZClassic (ZCL)
|
||
and other coins that use the same proof-of-work.
|
||
|
||
It implements the full mining pipeline:
|
||
|
||
- **Stratum client** — Zcash-style pool protocol (subscribe / authorize /
|
||
set_target / notify / submit), as used by nheqminer / EWBF.
|
||
- **Personalised BLAKE2b** — `"ZcashPoW" || LE32(192) || LE32(7)`, 48-byte
|
||
digest.
|
||
- **Equihash 192,7 solver** — Wagner's algorithm over eight 24-bit collision
|
||
blocks, producing 128-index / 400-byte solutions.
|
||
- **Full GPU solver** — the entire solve runs on the GPU: BLAKE2b generation,
|
||
all seven collision rounds over a bucketed hash table, and back-reference
|
||
recovery of the leaf indices. Only a few candidate solutions return to the
|
||
host for verification. Two interchangeable backends: **OpenCL** (default, our
|
||
own kernels) and **CUDA** (driver API, replaying miniZ's reverse-engineered
|
||
fatbin).
|
||
- **Solution verification & encoding** — full spec-compliant `IsValidSolution`
|
||
plus compact-array encode/decode and the block PoW (double-SHA256) target
|
||
check.
|
||
|
||
## Status
|
||
|
||
The solver, BLAKE2b personalisation, solution encoding and verification are
|
||
validated end-to-end. `--selftest` runs a full CPU solve and a full GPU solve on
|
||
the same header and confirms the GPU recovers the same verified solution set;
|
||
the OpenCL BLAKE2b kernel is also checked byte-for-byte against the CPU. The GPU
|
||
solver finds the expected ~2 solutions per nonce, matching the Equihash 192,7
|
||
theoretical rate.
|
||
|
||
The Stratum layer implements the common nheqminer-style protocol and is covered
|
||
by a mock-pool integration test; some pools differ in nonce/solution framing and
|
||
may need small adjustments (`RUST_LOG=debug` shows the raw traffic).
|
||
|
||
### Performance & memory
|
||
|
||
Measured on a single desktop RTX 5080 (default clocks, no overclock; `--benchmark
|
||
30`, ~1.8 solutions per nonce):
|
||
|
||
| Backend (`--backend`) | ms/solve | Sol/s |
|
||
|---|---:|---:|
|
||
| `cuda` | **~20** | **~92** |
|
||
| `opencl` | ~311 | ~5.9 |
|
||
|
||
CUDA is ~15× faster than the OpenCL backend on the same card. One worker thread
|
||
per device shares the pool connection and a global nonce counter, so a second GPU
|
||
(`--devices 0,1`) scales the aggregate Sol/s roughly linearly. (Enabling the
|
||
clock/power tuning — `--auto-tune`/`--gpu-clock-offset`, needs root — pushes both
|
||
backends higher still; the table above is untuned.)
|
||
|
||
The **CPU solver** (`--cpu`) is an AVX2-tuned Wagner implementation (xenoncat-style
|
||
packed 32-byte slots + dense collision keys, single-pass bucketed XOR, `pshufb`
|
||
round-0 repack) running ~**3 s/solve** on a 24-core host (~0.4 Sol/s), finding
|
||
the expected ~2 solutions per nonce. Like the GPU it bounds each collision bucket
|
||
(`--cpu-clamp`, default 32) — required because the naive unclamped algorithm's
|
||
degenerate collisions explode in the last rounds; `--cpu-clamp 0` selects the
|
||
exact solver (may OOM on dense headers).
|
||
|
||
The CUDA backend is profile-driven (Nsight Compute). Key optimisations over the
|
||
OpenCL backend:
|
||
|
||
- **Warp-per-bucket collision/final kernels** — each 32-lane warp cooperatively
|
||
loads its bucket into shared memory and parallelises the pair search, fixing
|
||
the one-thread-per-bucket version's intra-warp load imbalance (collision
|
||
rounds ~174 ms → ~97 ms, final ~5 ms → ~1 ms).
|
||
- **16-byte-aligned `uint4` slot stores** — hash-table slots for tables 1-6 are
|
||
padded to a multiple of 4 words so the ref+blocks are written with aligned
|
||
`uint4` transactions. The earlier odd-sized slots (7/6/5/3 words) caused
|
||
misaligned scalar writes; fixing this cut the collision rounds to ~68 ms
|
||
(~120 ms → ~92 ms per solve).
|
||
- **Pinned (page-locked) host buffers** for the device→host result readback.
|
||
|
||
Things that were tried and *didn't* help on this hardware (measured, reverted):
|
||
warp-shuffle comparisons and `gen` register-capping (`__launch_bounds__`) — the
|
||
fast rounds are not compare-bound, and `gen` needs its registers. Warp-aggregated
|
||
atomics don't apply because the output-bucket atomics target data-dependent
|
||
random addresses (no within-warp sharing). The remaining cost is `gen` (~21 ms,
|
||
register/compute bound) and the scattered hash-table read/write latency that is
|
||
fundamental to bucketing. The OpenCL backend keeps the simpler kernels.
|
||
|
||
The hash table uses 2²¹ buckets × 32 slots (bucket cap 2× the mean occupancy, so
|
||
overflow is rare and essentially all solutions survive). The bulky 24-bit
|
||
collision blocks are only needed during the round that consumes them, so they
|
||
live in two ping-pong working buffers; only a small per-table back-reference
|
||
array (1 word/slot) is kept resident for solution recovery. That brings the
|
||
footprint to about **6 GB of VRAM per GPU** (down from ~11 GB when all seven
|
||
block tables were resident), so **8 GB cards work**. The backend reads the
|
||
device's VRAM at startup and warns if a card is too small; `ZCL_OPENCL_ROWBITS`
|
||
overrides the bucket count for experimentation (values below 21 usually find no
|
||
solutions). There is still optimisation headroom relative to mature miners like
|
||
lolMiner/EWBF.
|
||
|
||
The above describes the OpenCL backend (`kernels/equihash.cl`). The CUDA backend
|
||
takes a different route: instead of our own kernels it drives miniZ's
|
||
reverse-engineered Equihash 192,7 solver — see "CUDA backend" below.
|
||
|
||
## Build
|
||
|
||
Requirements: a Rust toolchain and an OpenCL runtime (the NVIDIA driver ships
|
||
`libOpenCL`). The CUDA backend only needs `libcuda` (the NVIDIA driver) — the
|
||
fatbin and launch trace it drives are embedded in the binary, so no CUDA toolkit
|
||
or `nvcc` is required.
|
||
|
||
```bash
|
||
cargo build --release # OpenCL backend (default)
|
||
cargo build --release --features cuda # OpenCL + CUDA backends
|
||
cargo build --release --no-default-features --features cuda # CUDA only
|
||
cargo build --release --no-default-features # CPU-only (no GPU)
|
||
```
|
||
|
||
### CUDA backend (miniZ fatbin replay)
|
||
|
||
`--features cuda` (selectable with `--backend cuda`) does **not** compile its own
|
||
kernels. It loads miniZ's reverse-engineered Equihash 192,7 solver fatbin
|
||
(embedded from `src/miniz/equihash192_7.fatbin`) through the CUDA Driver API and
|
||
replays its exact 10-kernel Wagner pipeline from an embedded launch trace,
|
||
injecting the BLAKE2b midstate + header tail into `digit_f` per solve. Recovered
|
||
indices are verified by the project's own `equihash::filter_candidates`, so only
|
||
solutions genuinely valid for the header are ever returned. Needs an NVIDIA GPU
|
||
whose arch matches the fatbin's cubins (sm_80/sm_86/sm_120). See
|
||
`collab/jmprcx-solver/` for the reverse-engineering work behind it.
|
||
|
||
**Multi-config (VRAM auto-selection).** miniZ ships several bucket geometries with
|
||
different memory footprints; `src/miniz/configs/` bundles a recorded launch trace
|
||
for each. At startup the backend reads free VRAM and picks the **highest-capacity
|
||
config that fits** (higher capacity ⇒ fewer dropped collisions ⇒ better yield):
|
||
|
||
| config | min free VRAM | table capacity |
|
||
|---|---|---|
|
||
| `12288x3392` | ~11 GB | 41.7M (best) |
|
||
| `10000x4032` | ~5.5 GB | 40.3M |
|
||
| `2048x16960` | ~5 GB | 34.7M |
|
||
|
||
So it runs on cards from ~6 GB up, using the largest config the card can hold.
|
||
Override the choice with `ZCL_CUDA_CONFIG=<name>` (e.g. for testing). If no config
|
||
fits, it fails at init with a clear message. See `src/miniz/configs/README.md`.
|
||
|
||
### GPU tuning & efficiency (NVML)
|
||
|
||
The CUDA backend tunes clocks/power via NVML (the same knobs LACT exposes). **By
|
||
default** it locks clocks and power to the card maximum (peak hashrate). Passing
|
||
**any** tuning flag switches to manual mode — only the knobs you specify are
|
||
applied, so a clock offset isn't defeated by a hard clock lock:
|
||
|
||
| flag | effect |
|
||
|---|---|
|
||
| `--power-limit <W>` | board power cap (best Sol/W lever) |
|
||
| `--gpu-clock <MHz>` / `--mem-clock <MHz>` | hard-lock the core / memory clock |
|
||
| `--gpu-clock-offset <MHz>` / `--mem-clock-offset <MHz>` | signed V/F **offsets** (LACT-style; undervolt/overclock the curve) |
|
||
| `--auto-tune` | sweep the core offset at startup for the **fastest stable solve rate** |
|
||
| `--no-gpu-tune` | touch nothing (let LACT or `nvidia-smi` own the GPU) |
|
||
|
||
**`--auto-tune`** optimizes each card for speed automatically: it raises power to
|
||
the cap, leaves clocks free to boost, then sweeps the core clock offset upward
|
||
(in +45 MHz steps), measuring solve throughput on a test header at each step. It
|
||
keeps the best and **stops at the first instability** (a kernel error or the card
|
||
no longer producing valid solutions), then locks in that offset. Takes ~30 s at
|
||
startup, runs per card, needs root, and is restored on exit. It overrides
|
||
`--gpu-clock-offset`.
|
||
|
||
A typical efficiency setup combines a power cap with a positive core offset and a
|
||
negative memory offset, letting the card boost on a shifted curve under the cap:
|
||
|
||
```bash
|
||
sudo ./jackpotminer --url pool:port -u addr --devices all \
|
||
--power-limit 250 --gpu-clock-offset 250 --mem-clock-offset -500
|
||
```
|
||
|
||
These are **privileged** (run as root); without it you get a one-line warning and
|
||
the card free-runs. Settings (clocks, power, offsets) are **restored to defaults
|
||
on clean shutdown**. The per-card stats line shows live `Sol/s`, board `W`, and
|
||
`Sol/W` (reading power is unprivileged) so you can tune efficiency directly.
|
||
|
||
> Re: LACT — its NVIDIA support drives these same NVML offsets, but LACT has no
|
||
> stable CLI to set them (a root daemon applies a saved profile), so the miner
|
||
> sets them directly via NVML. If you'd rather LACT own the GPU, run the miner
|
||
> with `--no-gpu-tune` and configure clocks in LACT.
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# List OpenCL devices
|
||
./target/release/jackpotminer --list-devices
|
||
|
||
# Mine on one GPU
|
||
./target/release/jackpotminer \
|
||
--url stratum+tcp://zcl.pool.example:3032 \
|
||
--user <ZCL-address>.<worker> \
|
||
--pass x \
|
||
--devices 0
|
||
|
||
# Mine on multiple GPUs (one worker thread each)
|
||
./target/release/jackpotminer --url ... --user ... --devices 0,1
|
||
./target/release/jackpotminer --url ... --user ... --devices all
|
||
|
||
# Use the CUDA backend instead of OpenCL (needs a --features cuda build)
|
||
./target/release/jackpotminer --url ... --user ... --backend cuda --devices all
|
||
|
||
# Force the CPU backend
|
||
./target/release/jackpotminer --url ... --user ... --cpu
|
||
|
||
# Benchmark concurrently across the chosen devices (no pool), 30 solves each
|
||
./target/release/jackpotminer --benchmark 30 --devices 0,1
|
||
|
||
# Correctness self-test (CPU solve vs GPU solve + kernel comparison)
|
||
./target/release/jackpotminer --selftest
|
||
```
|
||
|
||
Key flags: `--url`, `--port` (fallback when the URL omits one), `--user` (`-u`),
|
||
`--pass` (`-p`), `--backend` (`opencl`/`cuda`), `--devices` (e.g. `0,1` or `all`),
|
||
`--device` (`-d`, for single-device benchmark/debug), `--threads` (`-t`), `--cpu`,
|
||
`--benchmark N`, `--list-devices`, `--selftest`. When mining in a terminal a live
|
||
ratatui **dashboard** is shown by default (per-card Sol/s, power, temperature,
|
||
Sol/W, shares; a network Sol/s panel (horizontal bars) showing the last hour /
|
||
day / 3 / 7 / 30 / 60 / 90 days; per-GPU graphs; and a log pane;
|
||
`q`/`Esc`/`Ctrl-C` to quit); it falls back to periodic log lines
|
||
when there's no TTY (headless/piped/systemd) and `--no-tui` forces log output.
|
||
`--job-timeout <secs>` pauses mining when the pool
|
||
goes silent (default 300; 0 disables). Set `RUST_LOG=debug` to see the raw Stratum
|
||
traffic (ignored under `--tui`, which captures logs into its pane at info level).
|
||
|
||
## Layout
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `src/params.rs` | Equihash 192,7 constants |
|
||
| `src/blake.rs` | Personalised BLAKE2b base state |
|
||
| `src/equihash.rs` | CPU solver, verifier, compact-array encode/decode, candidate filter |
|
||
| `src/stratum.rs` | Stratum pool client |
|
||
| `src/miner.rs` | Mining loop, nonce iteration, target check, submit, backend dispatch |
|
||
| `src/gpu.rs` | OpenCL host glue: buffers, kernel orchestration |
|
||
| `kernels/equihash.cl` | OpenCL kernels: BLAKE2b, collision rounds, recovery |
|
||
| `src/cuda.rs` | CUDA backend: drives miniZ's embedded fatbin via the driver API (FFI to `libcuda`) |
|
||
| `src/miniz/` | Embedded miniZ fatbin + recorded launch trace the CUDA backend replays |
|
||
| `build.rs` | Links `libcuda` for the `cuda` feature |
|
||
|
||
## Disclaimer
|
||
|
||
For use only on hardware and pools you are authorised to use, and where
|
||
cryptocurrency mining is legal. Mining consumes significant power.
|