equihash192_7.cl's decompress/decompress2 used double + sqrt behind
cl_khr_fp64, so the kernel failed to build on OpenCL stacks without fp64
(notably rusticl/Mesa) — those workers died with 'use of type double requires
cl_khr_fp64'. Replace round(sqrt(2*x+1)) with an exact integer square root
(single-precision estimate corrected to the integer floor, then rounded; inputs
are triangular indices < ~2^26). No fp64, no behavior change on ROCm (verified
bit-identical: 77 solutions/40 nonces, same as before), and rusticl devices now
build and solve correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings AMD cards to parity with NVIDIA for monitoring/control surface, which was
NVML-only. New src/amd_smi.rs is a gpu_tune::GpuTuner backed by Linux amdgpu
sysfs (power1_average, temp1_input edge, freq1_input sclk, pp_dpm_sclk/mclk),
matched to the device by PCI bus id from OpenCL cl_khr_pci_bus_info. gpu_tune is
un-gated to compile under the gpu feature; open() probes NVML then amd_smi.
GpuSolver carries the tuner and Backend::Gpu dispatches power/temp/clocks, so the
TUI and --benchmark now show power, temperature, clocks and Sol/W for AMD.
Telemetry-only — setters are Unsupported (amdgpu control nodes are root-only).
--target-temp <C> adds an opt-in software governor (miner::govern_cadence) that
paces solve cadence to hold edge temperature, no hardware writes/root. With small
thermal throttle it won't beat flat-out on raw Sol/s; it's a temp/efficiency
lever. Unit-tested controller; flag/plumbing verified live.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cap the cores-per-row group size (and the dashboard 'g' cycle) by total core
count so the toggleable-row count stays sensible and small machines get finer
control: ≤4 cores toggle individually (size 1), 5-8 cores in groups of up to 2,
and more than 8 in groups of up to 4. The cap is also the default — the prior
fixed default of 4 now clamps to the tier (1/2/4), and an explicit
--cpu-group-size is clamped to the cap too.
Add max_group_size() in cpu_groups; update the help text and the cpu_groups /
control tests (the cycle test now uses 16 cores so it can exercise sizes 4/2/1).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Branch the CPU solver onto modern x86 extensions, runtime-dispatched with the
existing AVX2/scalar fallbacks:
- BatchHasher::hash8 — an 8-lane AVX-512 BLAKE2b final-block compression (native
_mm512_ror_epi64 rotates), falling back to two AVX2 hash4s or scalar. Round 0
now hashes eight g-values per chunk. round0-hash drops ~1.45x on AVX-512 CPUs
(≈225→155 ms here, AMD Zen4).
- emit_bucket software-prefetches each collision group's randomly-gathered
member slots (the ~1 GB slot arena is the round's cache-miss bottleneck),
shaving a few percent off the dominant emit phase.
Controlled A/B on this Zen4 box (same thermal state): ~4-5% faster overall.
The collision rounds are memory-bandwidth bound, so SIMD width is not the
limiter — the modern-ISA win is modest by nature. EQ_NO_AVX512 / EQ_NO_PREFETCH
opt out per-CPU (e.g. parts where AVX-512 downclocks) and back the A/B harness.
hash8 is validated against the scalar reference (batch_matches_reference) and
full solves still find valid solutions in every dispatch configuration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Building the `gpu` (OpenCL) backend on Windows no longer needs a vendor OpenCL
SDK. `cl-sys` links `OpenCL` (#[link(name = "OpenCL")]); instead of requiring an
SDK-provided OpenCL.lib, build.rs now generates a vendor-neutral import library
at build time from windows/OpenCL.def (all 118 cl-sys exports) — lib.exe for
MSVC (located via the cc crate), dlltool for MinGW — and puts it on the link
search path. The real OpenCL.dll is supplied at runtime by the GPU driver.
build.rs no-ops on non-Windows targets and when the gpu feature is off, and
warns rather than panics if the toolchain tool is absent so `cargo check` still
works. Combined with the runtime-loaded (dlopen) CUDA/NVML, a Windows build now
needs zero external GPU libraries.
Add BUILD-windows.md (toolchain, build, crt-static packaging, runtime deps, CI)
and link it from the README. Verified the whole crate compiles for
x86_64-pc-windows-gnu (default features and gpu,cuda); Linux is unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The module header still described a fixed 32-byte slot, a serial partition, and
a "not yet taken" radix scatter. Update it to reflect the current memory-traffic-
oriented design: parallel counting-sort partition_top, keys_part/keys_sorted to
make the key reads sequential, narrowing per-round packed slots with a masked
store, and the remaining slot-gather floor (with the note that the full payload
radix scatter was evaluated and loses on the wide early rounds). Comment-only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
count_pairs, low_group, and the emit group-walk all read each entry's leading
key via `keys[order[..]]` — a random gather over the whole ~128 MB keys array,
three times per round. partition_top now also produces `keys_part` (the leading
keys in partition order, keys_part[p] == keys[order[p]]), written by the same
parallel, disjoint phase-3 scatter at 4 bytes/entry. count_pairs and low_group
then stream their partition's keys sequentially, and low_group emits a
`keys_sorted` array so the emit group walk streams a dense local copy instead of
gathering keys[sorted[i]]. The only remaining DRAM-random access in the rounds is
the unavoidable slot gather.
Measured (16 threads, clamp 16/32): count ~160 -> ~10 ms/round, emit ~770 -> ~550
ms/round, partition +~80 ms (the added 128 MB scatter); full solve ~8.4 -> ~7.04 s
(~16%). Cumulative across the three CPU-solver changes: ~13.4 -> ~7.04 s (-47%),
0.07 -> 0.14 solve/s. Identical solution yield; cross-clamp validity and
full_solve_baseline pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each collision round consumes the leading 24-bit block, so a round-r output
entry only carries 8-r meaningful words — but every round previously stored a
fixed 8-word (32 B) slot, moving dead trailing words. Since the collision
rounds are memory-bandwidth bound, that wasted DRAM traffic is wasted time.
Pack each round's slots at w_out = w_in - 1 words. The XOR child producer still
loads a full 256-bit register (over-reading up to SLOT words past a narrow tail
slot — buffers carry a SLOT_SLACK pad so this stays in bounds) but masked-stores
only the w_out live lanes so packed neighbours aren't clobbered. The over-read
garbage only ever lands in non-stored lanes: storing out[0..w_out] = x[1..w_in]
uses exclusively meaningful input words. Width is threaded through
collide/collide_final/emit_bucket from solve_with (round 0 stays full 8-word).
Measured (16 threads, clamp 16/32): ~9.2 -> ~8.4 s/solve (~9%); per-round time
now shrinks r1~1230 -> r6~1045 ms. Cumulative with the parallel-partition change:
~13.4 -> ~8.4 s (-37%). Identical solution yield; xor_child_matches_scalar now
covers every width + the masked-store no-clobber property; cross-clamp validity
and full_solve_baseline pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
partition_top was the only serial stage in the otherwise rayon-parallel
collision rounds — plain `for k in 0..n` count and scatter loops that left
15/16 cores idle for ~35% of every round (~650 ms). Replace it with a
parallel counting sort: per-chunk top-bucket histograms, a small serial
pass to per-chunk base offsets, then a disjoint-region scatter through a
shared raw pointer (each chunk writes a provably non-overlapping set of
positions). Entries within a bucket become chunk-major rather than
index-major, which is immaterial: count_pairs/low_group depend only on the
low-key multiset, and solutions are canonicalized, de-duplicated, and
verified downstream.
Measured (16 threads): partition_top ~650 -> ~100 ms/round (6.5x),
collide-final ~1.18 -> ~0.59 s, full solve ~13.4 -> ~9.2 s (-31%,
0.07 -> 0.11 solve/s), with identical solution yield and all validity
tests passing.
Also add (gated/ignored, no production-path behavior change):
- full_solve_baseline: an #[ignore] throughput benchmark over realistic
dense headers (EQ_BENCH_ITERS / EQ_BENCH_CLAMPS).
- EQ_PROFILE-gated per-phase and per-collide-sub-phase timing in solve_with.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When neither --url nor a config file specifies a pool, fall back to
stratum+tcp://zcl.jackpot.tools:3333 instead of erroring. Resolution order is
unchanged: --url flag → config-file url → built-in default. `url` stays an
Option so config auto-discovery and the GUI terminal relaunch (which key off
url.is_none()) are unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AMD GPU backend:
- Add the GCN-tuned equihash192_7.cl kernel (clearCounter/blake/round1..7/
combine pipeline) and its host driver src/gpu_amd.rs. GpuSolver now dispatches
AMD-vendor OpenCL devices to it and other devices to the existing kernel
(force with ZCL_OPENCL_KERNEL=amd|legacy). Validated on an RX 9060 XT: GPU
solutions match the CPU reference 1/1.
- Expose BatchHasher::midstate() for the kernel's ulong8 hashState arg.
Runtime-loaded GPU drivers (minimum host deps):
- dlopen libcuda / libnvidia-ml via libloading instead of linking them
(src/dylib.rs macro; cuda.rs, nvml.rs, gpu_probe.rs). The binary now builds
and starts on hosts without an NVIDIA driver and reports no CUDA devices
gracefully; remove build.rs (its only job was linking those libs).
- Add Dockerfile.portable + build-portable.sh: build against Debian bullseye's
glibc 2.31 for a binary that runs on older distros and drives both AMD
(OpenCL) and NVIDIA (CUDA) cards. Document the build matrix in the README.
Mixed backend (default):
- Add --backend mixed (now the default): each card on its native backend
(NVIDIA->CUDA, AMD/Intel->OpenCL), deduped so no card is mined twice.
--devices indexes the unified list shown by --list-devices.
Misc:
- Stale-work timeout (--job-timeout) default 300s -> 600s (10 minutes).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the standalone collaborator Equihash 192,7 solver crate; it is not
part of the main miner build.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the non-Equihash algorithms and their integration points:
- delete src/ethash.rs + src/ethash/ and src/pearl.rs + src/pearl/
- remove the ethash/pearl/pearl-cuda features and pearl-only deps
(blake3, primitive-types, rand, bincode, base64) from Cargo.toml
- drop the --algo flag and the pearl/ethash dispatch branches in main.rs
- remove the pearl-cuda NVRTC linking from build.rs
- drop the stale /pearl-dump/ .gitignore entry
Builds check cleanly with default features and --no-default-features.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GPU-accelerated Equihash 192,7 miner in Rust with three solver backends:
- CPU: Wagner's algorithm, AVX2 packed slots (xenoncat-style)
- OpenCL: full on-GPU solve (kernels/equihash.cl); runs on NVIDIA and AMD
- CUDA: driver-API replay of miniZ's extracted fatbin (src/miniz/)
Also includes a default-off pearlhash backend (src/pearl/, native CPU core +
NVRTC int8-GEMM GPU kernels) and a WIP Ethash CUDA backend (src/ethash/).
Reverse-engineering scratch (alpha-miner, pearl-dump/) and the active runtime
config (mine.toml) are gitignored; mine.example.toml is the template.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>