jackpot-miner

8 Commits

Author	SHA1	Message	Date
jackpotincorporated	1b4a2a4dd9	Scatter partition-contiguous keys to kill the per-round key gathers count_pairs, low_group, and the emit group-walk all read each entry's leading key via `keys[order[..]]` — a random gather over the whole ~128 MB keys array, three times per round. partition_top now also produces `keys_part` (the leading keys in partition order, keys_part[p] == keys[order[p]]), written by the same parallel, disjoint phase-3 scatter at 4 bytes/entry. count_pairs and low_group then stream their partition's keys sequentially, and low_group emits a `keys_sorted` array so the emit group walk streams a dense local copy instead of gathering keys[sorted[i]]. The only remaining DRAM-random access in the rounds is the unavoidable slot gather. Measured (16 threads, clamp 16/32): count ~160 -> ~10 ms/round, emit ~770 -> ~550 ms/round, partition +~80 ms (the added 128 MB scatter); full solve ~8.4 -> ~7.04 s (~16%). Cumulative across the three CPU-solver changes: ~13.4 -> ~7.04 s (-47%), 0.07 -> 0.14 solve/s. Identical solution yield; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:23:10 -04:00
jackpotincorporated	966ce3e262	Pack collision-round slots at shrinking per-round width Each collision round consumes the leading 24-bit block, so a round-r output entry only carries 8-r meaningful words — but every round previously stored a fixed 8-word (32 B) slot, moving dead trailing words. Since the collision rounds are memory-bandwidth bound, that wasted DRAM traffic is wasted time. Pack each round's slots at w_out = w_in - 1 words. The XOR child producer still loads a full 256-bit register (over-reading up to SLOT words past a narrow tail slot — buffers carry a SLOT_SLACK pad so this stays in bounds) but masked-stores only the w_out live lanes so packed neighbours aren't clobbered. The over-read garbage only ever lands in non-stored lanes: storing out[0..w_out] = x[1..w_in] uses exclusively meaningful input words. Width is threaded through collide/collide_final/emit_bucket from solve_with (round 0 stays full 8-word). Measured (16 threads, clamp 16/32): ~9.2 -> ~8.4 s/solve (~9%); per-round time now shrinks r1~1230 -> r6~1045 ms. Cumulative with the parallel-partition change: ~13.4 -> ~8.4 s (-37%). Identical solution yield; xor_child_matches_scalar now covers every width + the masked-store no-clobber property; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:11:05 -04:00
jackpotincorporated	501527d3cb	Parallelize partition_top; add solver benchmark + phase profiling partition_top was the only serial stage in the otherwise rayon-parallel collision rounds — plain `for k in 0..n` count and scatter loops that left 15/16 cores idle for ~35% of every round (~650 ms). Replace it with a parallel counting sort: per-chunk top-bucket histograms, a small serial pass to per-chunk base offsets, then a disjoint-region scatter through a shared raw pointer (each chunk writes a provably non-overlapping set of positions). Entries within a bucket become chunk-major rather than index-major, which is immaterial: count_pairs/low_group depend only on the low-key multiset, and solutions are canonicalized, de-duplicated, and verified downstream. Measured (16 threads): partition_top ~650 -> ~100 ms/round (6.5x), collide-final ~1.18 -> ~0.59 s, full solve ~13.4 -> ~9.2 s (-31%, 0.07 -> 0.11 solve/s), with identical solution yield and all validity tests passing. Also add (gated/ignored, no production-path behavior change): - full_solve_baseline: an #[ignore] throughput benchmark over realistic dense headers (EQ_BENCH_ITERS / EQ_BENCH_CLAMPS). - EQ_PROFILE-gated per-phase and per-collide-sub-phase timing in solve_with. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 10:42:54 -04:00
jackpotincorporated	6a753b62fa	Default the pool to zcl.jackpot.tools:3333 When neither --url nor a config file specifies a pool, fall back to stratum+tcp://zcl.jackpot.tools:3333 instead of erroring. Resolution order is unchanged: --url flag → config-file url → built-in default. `url` stays an Option so config auto-discovery and the GUI terminal relaunch (which key off url.is_none()) are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 01:23:38 -04:00
jackpotincorporated	4b5f84959c	Add AMD OpenCL kernel, runtime-loaded CUDA, mixed backend, portability AMD GPU backend: - Add the GCN-tuned equihash192_7.cl kernel (clearCounter/blake/round1..7/ combine pipeline) and its host driver src/gpu_amd.rs. GpuSolver now dispatches AMD-vendor OpenCL devices to it and other devices to the existing kernel (force with ZCL_OPENCL_KERNEL=amd\|legacy). Validated on an RX 9060 XT: GPU solutions match the CPU reference 1/1. - Expose BatchHasher::midstate() for the kernel's ulong8 hashState arg. Runtime-loaded GPU drivers (minimum host deps): - dlopen libcuda / libnvidia-ml via libloading instead of linking them (src/dylib.rs macro; cuda.rs, nvml.rs, gpu_probe.rs). The binary now builds and starts on hosts without an NVIDIA driver and reports no CUDA devices gracefully; remove build.rs (its only job was linking those libs). - Add Dockerfile.portable + build-portable.sh: build against Debian bullseye's glibc 2.31 for a binary that runs on older distros and drives both AMD (OpenCL) and NVIDIA (CUDA) cards. Document the build matrix in the README. Mixed backend (default): - Add --backend mixed (now the default): each card on its native backend (NVIDIA->CUDA, AMD/Intel->OpenCL), deduped so no card is mined twice. --devices indexes the unified list shown by --list-devices. Misc: - Stale-work timeout (--job-timeout) default 300s -> 600s (10 minutes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 01:15:41 -04:00
jackpotincorporated	f3ca6a1ee4	Remove collab/jmprcx-solver Drop the standalone collaborator Equihash 192,7 solver crate; it is not part of the main miner build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 23:34:18 -04:00
jackpotincorporated	4dd54cb839	Remove Ethash and Pearl backends; keep Equihash 192,7 only Drop the non-Equihash algorithms and their integration points: - delete src/ethash.rs + src/ethash/ and src/pearl.rs + src/pearl/ - remove the ethash/pearl/pearl-cuda features and pearl-only deps (blake3, primitive-types, rand, bincode, base64) from Cargo.toml - drop the --algo flag and the pearl/ethash dispatch branches in main.rs - remove the pearl-cuda NVRTC linking from build.rs - drop the stale /pearl-dump/ .gitignore entry Builds check cleanly with default features and --no-default-features. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 23:31:32 -04:00
jackpotincorporated	e2fab622b5	Initial commit: jackpotminer Equihash 192,7 miner GPU-accelerated Equihash 192,7 miner in Rust with three solver backends: - CPU: Wagner's algorithm, AVX2 packed slots (xenoncat-style) - OpenCL: full on-GPU solve (kernels/equihash.cl); runs on NVIDIA and AMD - CUDA: driver-API replay of miniZ's extracted fatbin (src/miniz/) Also includes a default-off pearlhash backend (src/pearl/, native CPU core + NVRTC int8-GEMM GPU kernels) and a WIP Ethash CUDA backend (src/ethash/). Reverse-engineering scratch (alpha-miner, pearl-dump/) and the active runtime config (mine.toml) are gitignored; mine.example.toml is the template. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 23:08:20 -04:00