OpenCL: de-duplicate the same physical GPU across platforms

A GPU exposed by both a vendor runtime (ROCm) and rusticl/Mesa appeared twice in the device list, so mining 'all' ran each card twice (pure contention). Add a single canonical enumerate_devices() — used by list_devices, device_is_nvidia, cpu_device_index and pick_device — that dedupes by physical GPU and prefers the vendor runtime over Mesa. Dedup key is the PCI address: ROCm/NVIDIA expose it via cl_khr_pci_bus_info; rusticl doesn't, but its cl_khr_device_uuid encodes the PCI BDF, so the same card yields the same key on both. Devices without either (CPU/PoCL) are never deduped. No behavior change on single-platform hosts (nothing to dedup); here the list drops 4->2 (both physical GPUs on ROCm, ~38 Sol/s) and device indices are unchanged for the kept devices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AMD kernel: drop fp64 dependency (build on rusticl/Mesa OpenCL)
2026-06-06 20:57:19 -04:00 · 2026-06-06 20:39:45 -04:00 · 2026-06-06 20:17:59 -04:00 · 2026-06-06 18:34:10 -04:00 · 2026-06-06 13:12:57 -04:00 · 2026-06-06 12:38:15 -04:00
16 changed files with 1506 additions and 242 deletions
@@ -0,0 +1,119 @@
 # Building jackpotminer on Windows
 The miner builds on Windows with **no external GPU SDKs**. The CUDA driver and
 NVML are loaded at runtime (`dlopen`/`LoadLibrary`), and the OpenCL import
 library is generated at build time from `windows/OpenCL.def` — so you don't need
 the CUDA Toolkit, an OpenCL SDK, or any vendor libraries to compile. The whole
 codebase (miner + the `jackpotminer-config` GUI) is verified to compile for
 `x86_64-pc-windows-*`.
 ## Prerequisites
 - **Rust** (https://rustup.rs). The default target is `x86_64-pc-windows-msvc`.
 - **MSVC build tools** — Visual Studio 2019/2022 or the standalone *Build Tools
  for Visual Studio* with the **“Desktop development with C++”** workload (gives
  `link.exe`, `lib.exe`, and the Windows SDK).
  - Build from a **“x64 Native Tools Command Prompt for VS”** (or any shell where
    the MSVC environment is active) so `lib.exe` is found. `cargo` also locates
    it automatically via the registry in most cases.
 - *(Alternative toolchain)* the GNU target `x86_64-pc-windows-gnu` works too,
  with **MinGW-w64** on `PATH` (provides `dlltool` and the linker).
 No CUDA Toolkit and **no OpenCL SDK** are required.
 ## Build
 ```powershell
 :: Default: OpenCL + CUDA backends + the GUI config tool
 cargo build --release
 :: Miner only (no GUI), both GPU backends
 cargo build --release --no-default-features --features gpu,cuda
 :: OpenCL only (AMD / Intel / NVIDIA)
 cargo build --release --no-default-features --features gpu
 :: NVIDIA only — needs nothing external at build time (CUDA is dlopen'd)
 cargo build --release --no-default-features --features cuda
 ```
 Outputs: `target\release\jackpotminer.exe` (and `jackpotminer-config.exe` with
 the default features).
 ### How the OpenCL build dependency is avoided
 `ocl`/`cl-sys` link `OpenCL` (`#[link(name = "OpenCL")]`), which normally needs
 an `OpenCL.lib` from a vendor SDK. Instead, `build.rs` generates a
 vendor-neutral import library from `windows/OpenCL.def`:
 - **MSVC:** `lib.exe /def:windows\OpenCL.def /out:OpenCL.lib /machine:X64`
 - **GNU:** `dlltool -d windows/OpenCL.def -l libOpenCL.a -m i386:x86-64`
 and puts it on the link search path. The import library only forwards to
 `OpenCL.dll`, which the GPU driver provides at runtime. If the toolchain tool
 isn’t on `PATH`, `build.rs` prints a warning and the link step fails with a clear
 message (compilation/`cargo check` still works).
 ## Distribution
 Statically link the MSVC C runtime so users don’t need the VC++ redistributable:
 ```powershell
 set RUSTFLAGS=-C target-feature=+crt-static
 cargo build --release
 ```
 ## Runtime dependencies (on the mining machine)
 - **`OpenCL.dll`** — the ICD loader, installed with any GPU driver (AMD/NVIDIA/
  Intel). Required for the OpenCL backend.
 - **`nvcuda.dll`** + **`nvml.dll`** — installed with the NVIDIA driver. Loaded
  on demand for the CUDA backend; absent on AMD-only machines, where the miner
  simply reports no CUDA devices.
 - The **VC++ runtime**, unless you built with `+crt-static`.
 A `cuda`-enabled binary still starts on a machine with no NVIDIA driver.
 ## Building Windows binaries without a Windows machine
 ### GitHub Actions (recommended)
 ```yaml
 name: windows
 on: [push, workflow_dispatch]
 jobs:
  build:
    runs-on: windows-latest          # MSVC + lib.exe already on PATH
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo build --release --features gpu,cuda
        env:
          RUSTFLAGS: -C target-feature=+crt-static
      - uses: actions/upload-artifact@v4
        with:
          name: jackpotminer-windows
          path: target/release/*.exe
 ```
 ### Cross-compile from Linux (GNU target)
 ```bash
 rustup target add x86_64-pc-windows-gnu
 sudo pacman -S mingw-w64           # or your distro's mingw-w64 (gives dlltool + linker)
 cargo build --release --target x86_64-pc-windows-gnu --no-default-features --features gpu,cuda
 ```
 The MSVC target can’t be linked from Linux. The GUI config tool (`eframe`) is
 easiest to build natively on Windows.
 ## Status / caveats
 - **Compilation for Windows is verified** here via `cargo check
  --target x86_64-pc-windows-gnu` (default features, and `gpu,cuda`).
 - The OpenCL **import-library linking** uses the standard `lib.exe`/`dlltool`
  technique; validate it with an actual Windows (or MinGW cross) build, which
  needs those tools present.
 - `relaunch_in_terminal` (reopen-in-a-terminal on GUI launch) is Linux-only;
  harmless on Windows, where double-clicking a console binary already opens a
  console.
@@ -2230,6 +2230,7 @@ version = "0.1.0"
 dependencies = [
 "anyhow",
 "blake2b_simd",
 "cc",
 "clap",
 "core_affinity",
 "ctrlc",
@@ -45,6 +45,11 @@ name = "jackpotminer-config"
 path = "src/config_gui.rs"
 required-features = ["config-gui"]
 [build-dependencies]
 # Locates the MSVC `lib.exe` (to generate the OpenCL import library on Windows);
 # unused on other platforms. See build.rs.
 cc = "1"
 [profile.release]
 opt-level = 3
 lto = true
@@ -114,6 +114,13 @@ cargo build --release --no-default-features --features cuda       # CUDA only
 cargo build --release --no-default-features                       # CPU-only (no GPU)
 ```
 ### Windows
 Builds with no external GPU SDKs — CUDA/NVML are loaded at runtime and the
 OpenCL import library is generated at build time. See
 [BUILD-windows.md](BUILD-windows.md) for the toolchain, build, and packaging
 steps.
 ### Portable / distributable builds
 The miner's only runtime dependencies are the C library and the OpenCL ICD loader
@@ -0,0 +1,79 @@
 //! Build script — Windows OpenCL import library only.
 //!
 //! On Windows, the `ocl`/`cl-sys` crates link `OpenCL` (`#[link(name =
 //! "OpenCL")]`), which normally requires an `OpenCL.lib` import library from a
 //! vendor OpenCL SDK. To avoid that build dependency, we generate a
 //! vendor-neutral import library ourselves from `windows/OpenCL.def` (the list
 //! of OpenCL exports) using the toolchain's own tools — `lib.exe` for MSVC,
 //! `dlltool` for the GNU (MinGW) toolchain — and put it on the link search path.
 //! The real `OpenCL.dll` (the ICD loader) is supplied at runtime by the GPU
 //! driver, exactly like `libOpenCL.so.1` on Linux.
 //!
 //! Nothing here is needed on non-Windows targets (the system `libOpenCL` is used
 //! directly) or when the `gpu` (OpenCL) feature is off — the script no-ops. The
 //! CUDA driver / NVML are loaded at runtime via dlopen (see `src/dylib.rs`), so
 //! they need no build-script support.
 use std::env;
 use std::path::PathBuf;
 use std::process::Command;
 fn main() {
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-changed=windows/OpenCL.def");
    // Only Windows targets with the OpenCL backend need an import library.
    if env::var("CARGO_CFG_TARGET_OS").as_deref() != Ok("windows") {
        return;
    }
    if env::var_os("CARGO_FEATURE_GPU").is_none() {
        return;
    }
    let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
    let def = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("windows/OpenCL.def");
    let target = env::var("TARGET").unwrap_or_default();
    let is_msvc = env::var("CARGO_CFG_TARGET_ENV").as_deref() == Ok("msvc");
    let (mut cmd, tool) = if is_msvc {
        // lib.exe /def:OpenCL.def /out:OpenCL.lib /machine:X64
        let mut c = cc::windows_registry::find(&target, "lib.exe")
            .unwrap_or_else(|| Command::new("lib.exe"));
        c.arg("/nologo")
            .arg(format!("/def:{}", def.display()))
            .arg(format!("/out:{}", out_dir.join("OpenCL.lib").display()))
            .arg("/machine:X64");
        (c, "lib.exe")
    } else {
        // GNU/MinGW: dlltool -d OpenCL.def -l libOpenCL.a -m i386:x86-64
        let dlltool = ["x86_64-w64-mingw32-dlltool", "dlltool"]
            .into_iter()
            .find(|t| Command::new(t).arg("--version").output().is_ok())
            .unwrap_or("dlltool");
        let mut c = Command::new(dlltool);
        c.arg("-d")
            .arg(&def)
            .arg("-l")
            .arg(out_dir.join("libOpenCL.a"))
            .arg("-m")
            .arg("i386:x86-64");
        (c, "dlltool")
    };
    match cmd.status() {
        Ok(s) if s.success() => {
            // Resolve `#[link(name = "OpenCL")]` against the generated import lib.
            println!("cargo:rustc-link-search=native={}", out_dir.display());
        }
        Ok(s) => panic!("{tool} failed ({s}) generating the OpenCL import library from {}", def.display()),
        Err(e) => {
            // Tool not found: let `cargo check` (which doesn't link) still succeed;
            // a real build will fail at link with a clear "cannot find OpenCL".
            println!(
                "cargo:warning=could not run {tool} to generate the OpenCL import library ({e}); \
                 ensure the toolchain tools are on PATH (MSVC dev prompt, or mingw-w64). \
                 Linking the `gpu` feature will fail until then."
            );
        }
    }
 }
@@ -2,8 +2,6 @@
 //#define PRINT 1
 #pragma OPENCL EXTENSION cl_khr_fp64 : enable
 __constant ulong blake_iv[] =
 {
    0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
@@ -507,14 +505,22 @@ uint compress2(uint in0, uint in1) {
 	return tmp;	
 }
 // Exact round(sqrt(x)) without fp64: a single-precision estimate corrected to the
 // exact integer floor, then rounded. Lets the kernel build on OpenCL stacks that
 // lack cl_khr_fp64 (e.g. rusticl/Mesa) while staying bit-identical on ROCm. Inputs
 // are triangular indices (< ~2^26), well within range for the float estimate +
 // integer correction.
 inline uint isqrt_round(ulong x) {
 	long m = (long) sqrt((float) x);
 	while (m > 0 && (ulong)(m * m) > x) m--;       // correct down to floor(sqrt)
 	while ((ulong)((m + 1) * (m + 1)) <= x) m++;   // correct up to floor(sqrt)
 	// round to nearest: round up iff the fractional part is >= 0.5.
 	return (uint)(((x - (ulong)(m * m)) > (ulong) m) ? (m + 1) : m);
 }
 uint2 decompress(uint2 in) {
 	double inFl = (double) (in.s0 >> 6);
 	inFl *= 2.0;
 	inFl += 1.0;
 	uint2 res;
-	res.s0 = (uint) round(sqrt(inFl));
+	res.s0 = isqrt_round(2ul * (ulong)(in.s0 >> 6) + 1ul);
 	uint tmp = res.s0 * (res.s0-1);
@@ -535,13 +541,8 @@ uint2 decompress(uint2 in) {
 }
 uint2 decompress2(uint in) {
 	double inFl = (double) in;
 	inFl *= 2.0;
 	inFl += 1.0;
 	uint2 res;
-	res.s0 = (uint) round(sqrt(inFl));
+	res.s0 = isqrt_round(2ul * (ulong) in + 1ul);
 	uint tmp = res.s0 * (res.s0-1);
@@ -0,0 +1,204 @@
 //! AMD GPU telemetry for [`crate::gpu_tune::GpuTuner`] via the Linux amdgpu
 //! sysfs interface (`/sys/class/drm/cardN/device/...`).
 //!
 //! Telemetry only: board power, edge temperature, and core/memory clocks come
 //! from the world-readable hwmon + DPM nodes, so it works unprivileged. The
 //! control nodes (`pp_od_clk_voltage`, power cap, performance level) are
 //! root-write-only and carry GPU-hang risk, so every setter returns
 //! [`SetOutcome::Unsupported`] — this backend never writes. The handle is matched
 //! to the physical card by PCI bus id (e.g. from OpenCL's `cl_khr_pci_bus_info`),
 //! so it lines up with whichever device the solver actually opened.
 use std::path::{Path, PathBuf};
 use crate::gpu_tune::{GpuTuner, SetOutcome};
 /// One amdgpu card's sysfs telemetry handle.
 pub struct AmdTuner {
    /// Canonical `/sys/.../<PCI BDF>` device directory (holds `pp_dpm_*`).
    device_dir: PathBuf,
    /// `device_dir/hwmon/hwmonM` (the index M is not stable — resolved by glob).
    hwmon_dir: PathBuf,
    name: String,
 }
 // Only `PathBuf`/`String` — `AmdTuner` is `Send` automatically; no `unsafe impl`.
 /// Open a telemetry handle for the amdgpu card at `pci_bus_id` (e.g.
 /// "0000:03:00.0"). Returns `None` off Linux, or when no amdgpu card with
 /// telemetry nodes matches the bus id.
 pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
    #[cfg(not(target_os = "linux"))]
    {
        let _ = pci_bus_id;
        None
    }
    #[cfg(target_os = "linux")]
    {
        let want = bdf_tail(pci_bus_id)?;
        for entry in std::fs::read_dir("/sys/class/drm").ok()?.flatten() {
            let fname = entry.file_name();
            let fname = fname.to_string_lossy();
            // Match the GPU nodes ("card0", "card1", …), not the per-connector
            // dirs ("card1-DP-1") or render nodes.
            if !fname.starts_with("card") || fname.contains('-') {
                continue;
            }
            // `cardN/device` symlinks to the PCI device dir `…/<BDF>`.
            let Ok(device_dir) = std::fs::canonicalize(entry.path().join("device")) else {
                continue;
            };
            let matches = device_dir
                .file_name()
                .and_then(|s| s.to_str())
                .and_then(bdf_tail)
                .map(|t| t == want)
                .unwrap_or(false);
            if !matches {
                continue;
            }
            let Some(hwmon_dir) = find_hwmon(&device_dir) else { continue };
            // Require a telemetry node so we don't attach to a card without
            // sensors (e.g. some virtual/headless devices).
            if !hwmon_dir.join("temp1_input").exists()
                && !hwmon_dir.join("power1_average").exists()
            {
                continue;
            }
            let name = read_name(&device_dir);
            return Some(Box::new(AmdTuner { device_dir, hwmon_dir, name }));
        }
        None
    }
 }
 /// First `hwmon/hwmon*` subdir of a device dir (the index isn't stable).
 fn find_hwmon(device_dir: &Path) -> Option<PathBuf> {
    for e in std::fs::read_dir(device_dir.join("hwmon")).ok()?.flatten() {
        if e.file_name().to_string_lossy().starts_with("hwmon") {
            return Some(e.path());
        }
    }
    None
 }
 /// A display name for the card; amdgpu sysfs rarely exposes a marketing name, so
 /// fall back to a generic label.
 fn read_name(device_dir: &Path) -> String {
    read_trim(&device_dir.join("product_name"))
        .filter(|s| !s.is_empty())
        .unwrap_or_else(|| "AMD GPU".to_string())
 }
 /// Normalize a PCI BDF to its `bus:device.function` tail (drop the domain), so
 /// e.g. "0000:03:00.0" and "03:00.0" compare equal.
 fn bdf_tail(bdf: &str) -> Option<String> {
    let bdf = bdf.trim().to_ascii_lowercase();
    match bdf.split(':').collect::<Vec<_>>().as_slice() {
        [_domain, bus, devfunc] => Some(format!("{bus}:{devfunc}")),
        [bus, devfunc] => Some(format!("{bus}:{devfunc}")),
        _ => None,
    }
 }
 fn read_trim(p: &Path) -> Option<String> {
    std::fs::read_to_string(p).ok().map(|s| s.trim().to_string())
 }
 fn read_u64(p: &Path) -> Option<u64> {
    read_trim(p)?.parse().ok()
 }
 /// Parse the MHz value out of a `pp_dpm_*` line like `"2: 2700Mhz *"`.
 fn parse_mhz(line: &str) -> Option<u32> {
    let after_colon = line.split(':').nth(1)?;
    let token = after_colon.trim().split_whitespace().next()?;
    let digits: String = token.chars().take_while(|c| c.is_ascii_digit()).collect();
    digits.parse().ok()
 }
 impl AmdTuner {
    /// MHz of the currently active DPM level (the line marked `*`) in `file`.
    fn dpm_active_mhz(&self, file: &str) -> Option<u32> {
        let s = read_trim(&self.device_dir.join(file))?;
        s.lines().find(|l| l.contains('*')).and_then(parse_mhz)
    }
    /// Highest DPM level (MHz) listed in `file`.
    fn dpm_max_mhz(&self, file: &str) -> Option<u32> {
        let s = read_trim(&self.device_dir.join(file))?;
        s.lines().filter_map(parse_mhz).max()
    }
 }
 impl GpuTuner for AmdTuner {
    fn name(&self) -> String {
        self.name.clone()
    }
    fn watts(&self) -> Option<f64> {
        // power1_average is µW; fall back to the instantaneous power1_input.
        let uw = read_u64(&self.hwmon_dir.join("power1_average"))
            .or_else(|| read_u64(&self.hwmon_dir.join("power1_input")))?;
        Some(uw as f64 / 1_000_000.0)
    }
    fn temperature_c(&self) -> Option<u32> {
        // temp1 = edge (m°C).
        let mc = read_u64(&self.hwmon_dir.join("temp1_input"))?;
        Some(((mc + 500) / 1000) as u32)
    }
    fn current_power_limit_w(&self) -> Option<u32> {
        // Absent on Navi 44 (RX 9060 XT); best-effort for cards that expose it.
        read_u64(&self.hwmon_dir.join("power1_cap")).map(|uw| (uw / 1_000_000) as u32)
    }
    fn core_clock_mhz(&self) -> Option<u32> {
        // freq1_input (Hz) is the live sclk; the DPM active level is the fallback.
        if let Some(hz) = read_u64(&self.hwmon_dir.join("freq1_input")) {
            if hz > 0 {
                return Some((hz / 1_000_000) as u32);
            }
        }
        self.dpm_active_mhz("pp_dpm_sclk")
    }
    fn mem_clock_mhz(&self) -> Option<u32> {
        self.dpm_active_mhz("pp_dpm_mclk")
    }
    fn max_core_clock_mhz(&self) -> Option<u32> {
        self.dpm_max_mhz("pp_dpm_sclk")
    }
    fn max_mem_clock_mhz(&self) -> Option<u32> {
        self.dpm_max_mhz("pp_dpm_mclk")
    }
    fn power_limit_range_w(&self) -> Option<(u32, u32)> {
        let mn = read_u64(&self.hwmon_dir.join("power1_cap_min"))?;
        let mx = read_u64(&self.hwmon_dir.join("power1_cap_max"))?;
        Some(((mn / 1_000_000) as u32, (mx / 1_000_000) as u32))
    }
    // Telemetry-only backend: never writes the root-only control nodes.
    fn set_persistence(&self, _on: bool) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn lock_core_clock_mhz(&self, _mhz: u32) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn lock_mem_clock_mhz(&self, _mhz: u32) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn set_power_limit_w(&self, _watts: u32) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn set_core_offset_mhz(&self, _mhz: i32) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn set_mem_offset_mhz(&self, _mhz: i32) -> SetOutcome {
        SetOutcome::Unsupported
    }
    fn reset(&self) {}
 }
@@ -215,6 +215,92 @@ impl BatchHasher {
        }
    }
    /// Hash eight consecutive indices `g0..g0+8`, writing each 48-byte digest
    /// into `out[0..8]`. Uses AVX-512 (one 8-wide BLAKE2b compression) when
    /// available, else two AVX2 `hash4`s. Modern AMD (Zen4+) and Intel CPUs run
    /// AVX-512 without the clock penalty older Intel parts had.
    #[inline]
    pub fn hash8(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
        #[cfg(target_arch = "x86_64")]
        {
            if is_x86_feature_detected!("avx512f") {
                unsafe { self.hash8_avx512(g0, out) };
                return;
            }
        }
        // Fallback: two 4-lane batches into the two halves of `out`.
        let (lo, hi) = out.split_at_mut(4);
        self.hash4(g0, (&mut lo[..4]).try_into().unwrap());
        self.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
    }
    /// Eight-lane BLAKE2b final-block compression (AVX-512). Same structure as
    /// [`Self::hash4_avx2`] but 512-bit lanes and native 64-bit rotates.
    #[cfg(target_arch = "x86_64")]
    #[target_feature(enable = "avx512f")]
    unsafe fn hash8_avx512(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
        use core::arch::x86_64::*;
        #[inline(always)]
        unsafe fn g8(
            v: &mut [__m512i; 16],
            a: usize, b: usize, c: usize, d: usize,
            x: __m512i, y: __m512i,
        ) {
            v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), x);
            v[d] = _mm512_ror_epi64::<32>(_mm512_xor_si512(v[d], v[a]));
            v[c] = _mm512_add_epi64(v[c], v[d]);
            v[b] = _mm512_ror_epi64::<24>(_mm512_xor_si512(v[b], v[c]));
            v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), y);
            v[d] = _mm512_ror_epi64::<16>(_mm512_xor_si512(v[d], v[a]));
            v[c] = _mm512_add_epi64(v[c], v[d]);
            v[b] = _mm512_ror_epi64::<63>(_mm512_xor_si512(v[b], v[c]));
        }
        // Only m0 and m1 are nonzero; m1's high 32 bits hold the per-lane `g`.
        let tail0 = u64::from_le_bytes(self.tail[0..8].try_into().unwrap());
        let tail_hi = u32::from_le_bytes(self.tail[8..12].try_into().unwrap()) as u64;
        let m1 = |g: u32| (tail_hi | ((g as u64) << 32)) as i64;
        let mut m = [_mm512_setzero_si512(); 16];
        m[0] = _mm512_set1_epi64(tail0 as i64);
        m[1] = _mm512_set_epi64(
            m1(g0 + 7), m1(g0 + 6), m1(g0 + 5), m1(g0 + 4),
            m1(g0 + 3), m1(g0 + 2), m1(g0 + 1), m1(g0),
        );
        let mut v = [_mm512_setzero_si512(); 16];
        for i in 0..8 {
            v[i] = _mm512_set1_epi64(self.mid[i] as i64);
            v[i + 8] = _mm512_set1_epi64(IV[i] as i64);
        }
        v[12] = _mm512_xor_si512(v[12], _mm512_set1_epi64(FINAL_COUNT as i64));
        v[14] = _mm512_xor_si512(v[14], _mm512_set1_epi64(-1)); // last-block flag
        for s in &SIGMA {
            g8(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
            g8(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
            g8(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
            g8(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
            g8(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
            g8(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
            g8(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
            g8(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
        }
        // h[i] = mid[i] ^ v[i] ^ v[i+8]; first HASH_OUTPUT/8 words per lane, LE.
        let mut tmp = [0u64; 8];
        for i in 0..HASH_OUTPUT / 8 {
            let o = _mm512_xor_si512(
                _mm512_xor_si512(_mm512_set1_epi64(self.mid[i] as i64), v[i]),
                v[i + 8],
            );
            _mm512_storeu_si512(tmp.as_mut_ptr() as *mut _, o);
            for l in 0..8 {
                out[l][i * 8..i * 8 + 8].copy_from_slice(&tmp[l].to_le_bytes());
            }
        }
    }
    #[cfg(target_arch = "x86_64")]
    #[target_feature(enable = "avx2")]
    unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
@@ -316,5 +402,14 @@ mod tests {
                assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
            }
        }
        // hash8 (AVX-512 or hash4 fallback) must match for several batches.
        for base_g in [0u32, 8, 64, 1_000_000] {
            let mut out = [[0u8; HASH_OUTPUT]; 8];
            hasher.hash8(base_g, &mut out);
            for l in 0..8 {
                assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash8 mismatch at g={}", base_g + l as u32);
            }
        }
    }
 }
@@ -175,15 +175,15 @@ mod tests {
    #[test]
    fn protocol_get_and_set() {
        let controls = Controls::new(1, 0, 0, 0, 0, true);
-        let cpu = CpuMining::new((0..8).collect(), 4, false); // sizes [1,2,4,8]
+        let cpu = CpuMining::new((0..16).collect(), 4, false); // 16 cores -> tier cap 4, sizes [1,2,4]
        let stats = Stats::for_test(vec!["GPU 0".into()]);
-        // get: one device, cpu rows reflect the grouping (8 cores / 4 = 2 rows).
+        // get: one device, cpu rows reflect the grouping (16 cores / 4 = 4 rows).
        let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
        assert_eq!(v["ok"], true);
        assert_eq!(v["devices"].as_array().unwrap().len(), 1);
        assert_eq!(v["cpu"]["group_size"], 4);
-        assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 2);
+        assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 4);
        // set device 0 disabled.
        assert!(controls.device(0).enabled());
@@ -165,6 +165,21 @@ fn grouped(cores: &[usize], size: usize, enabled: &BTreeSet<usize>) -> Arc<CpuGr
    groups
 }
 /// Largest cores-per-row the dashboard allows, by total core count. Finer
 /// control on small machines, coarser on big ones so the toggleable-row count
 /// stays manageable: **≤4 cores** toggle individually (size 1), **5-8 cores** in
 /// groups of up to 2, and **more than 8** in groups of up to 4. This caps both
 /// the starting group size and the 'g' cycle.
 fn max_group_size(cores: usize) -> usize {
    if cores <= 4 {
        1
    } else if cores <= 8 {
        2
    } else {
        4
    }
 }
 /// Live controller for CPU mining. Owns the selected cores plus the current
 /// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
 /// dashboard (the worker supervisor watches `group_size()` and respawns to
@@ -188,14 +203,16 @@ impl CpuMining {
    /// the starting group size (`--cpu-group-size`), and `start_enabled` whether
    /// mining begins on (`--cpu-mining`).
    pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
-        let initial_size = initial_size.max(1);
+        // Toggle granularity scales with core count so the row count stays
        // manageable (see [`max_group_size`]): the cap bounds both the starting
        // size and the dashboard 'g' cycle.
        let cap = max_group_size(cores.len());
        let initial_size = initial_size.clamp(1, cap);
        let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
        let groups = grouped(&cores, initial_size, &enabled);
-        // Cycle list: the usual powers of two plus the requested size, capped so a
+        // Cycle list: powers of two from 1 up to the cap, plus the (clamped)
-        // group never exceeds the core count (unless the user explicitly asked for
+        // requested size, sorted and de-duplicated.
        // a larger size), sorted and de-duplicated.
        let cap = cores.len().max(initial_size).max(1);
        let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
            .into_iter()
            .chain([initial_size])
@@ -346,39 +363,67 @@ mod tests {
    }
    #[test]
-    fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
+    fn group_size_tier_by_core_count() {
-        // 8 cores, start at size 4 fully enabled -> two groups [0-3],[4-7], both on.
+        // ≤4 cores: individual cores only — size 1, no larger option.
-        let m = CpuMining::new((0..8).collect(), 4, true);
+        let m = CpuMining::new((0..4).collect(), 4, false); // requested 4 clamps to 1
        assert_eq!(m.group_size(), 1);
        assert_eq!(m.groups().len(), 4);
        m.cycle_group_size();
        assert_eq!(m.group_size(), 1); // only one size, cycle is a no-op
        // 5-8 cores: groups of up to 2 (requested 4 clamps to 2).
        let m = CpuMining::new((0..8).collect(), 4, false);
        assert_eq!(m.group_size(), 2);
        assert_eq!(m.groups().len(), 4);
        m.cycle_group_size();
        assert_eq!(m.group_size(), 1); // cycle covers {1, 2}
        // >8 cores: groups of up to 4 (the default).
        let m = CpuMining::new((0..16).collect(), 4, false);
        assert_eq!(m.group_size(), 4);
-        assert_eq!(m.groups().len(), 2);
+        assert_eq!(m.groups().len(), 4);
        // A larger explicit request is still capped at the tier max.
        let m = CpuMining::new((0..16).collect(), 8, false);
        assert_eq!(m.group_size(), 4);
    }
    #[test]
    fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
        // 16 cores (tier cap 4): start at size 4 fully enabled -> four groups,
        // all on.
        let m = CpuMining::new((0..16).collect(), 4, true);
        assert_eq!(m.group_size(), 4);
        assert_eq!(m.groups().len(), 4);
        assert!(m.groups().iter().all(|g| g.enabled()));
-        // Disable the second group (cores 4-7): now only cores 0-3 are enabled.
+        // Disable the last group (cores 12-15): now only cores 0-11 are enabled.
-        m.toggle_group(1);
+        m.toggle_group(3);
-        assert!(m.groups().group(0).enabled());
+        assert!(!m.groups().group(3).enabled());
        assert!(!m.groups().group(1).enabled());
-        // Cycle to size 8 and rebuild: the single [0-7] group is off, because not
+        // Cycle to size 1 (individual cores) and rebuild: 16 rows; cores 0-11 on,
-        // all of its cores were enabled.
+        // 12-15 off.
-        while m.group_size() != 8 {
+        while m.group_size() != 1 {
            m.cycle_group_size();
        }
        let g = m.rebuild();
-        assert_eq!(g.len(), 1);
+        assert_eq!(g.len(), 16);
-        assert!(!g.group(0).enabled());
+        assert!(g.group(0).enabled()); // core 0
        assert!(g.group(11).enabled()); // core 11
        assert!(!g.group(12).enabled()); // core 12
        assert!(!g.group(15).enabled()); // core 15
-        // Cycle to size 2 and rebuild: cores 0-3 are still tracked as enabled, so
+        // Cycle to size 2 and rebuild: cores 0-11 are still tracked as enabled, so
-        // [0,1] and [2,3] come back on while [4,5] and [6,7] stay off — the choice
+        // [0,1]..[10,11] come back on while [12,13],[14,15] stay off — the choice
-        // survived two regroups (not derived from the all-off size-8 grouping).
+        // survived two regroups.
        while m.group_size() != 2 {
            m.cycle_group_size();
        }
        let g = m.rebuild();
-        assert_eq!(g.len(), 4);
+        assert_eq!(g.len(), 8);
        assert!(g.group(0).enabled()); // [0,1]
-        assert!(g.group(1).enabled()); // [2,3]
+        assert!(g.group(5).enabled()); // [10,11]
-        assert!(!g.group(2).enabled()); // [4,5]
+        assert!(!g.group(6).enabled()); // [12,13]
-        assert!(!g.group(3).enabled()); // [6,7]
+        assert!(!g.group(7).enabled()); // [14,15]
        assert!(m.generation() >= 2);
    }
@@ -131,23 +131,31 @@ fn leaf_row(base: &State, idx: u32) -> Row {
 // histogram that has to be zeroed, scatter-filled, and cloned every round). Two
 // entries can collide only when their *whole* leading block matches, so they
 // always land in the same partition — making the partitions independent
-// collision domains that are processed in parallel across the rayon pool the
+// collision domains processed in parallel across the rayon pool.
 // single CPU worker otherwise leaves idle.
 //
-// The data layout follows xenoncat's packed slots and targets AVX2 throughput:
+// The data layout targets AVX2 throughput; the round is memory-bandwidth bound,
-//   * Entries live in fixed 32-byte `SLOT`s (one `__m256i`); producing a child
+// so every choice below exists to cut DRAM traffic / random access:
-//     is a single load/xor/permute/store (`xor_child_avx2`).
+//   * `partition_top` is a parallel counting sort (per-chunk histograms ->
-//   * A dense parallel `keys[]` array mirrors each entry's leading block, so the
+//     per-chunk base offsets -> disjoint-region scatter), not a serial pass.
-//     `partition_top`/`count_pairs`/`low_group` histogram sweeps stream over
+//   * It also emits `keys_part` (the leading keys in partition order) so the
-//     4 bytes/entry instead of striding the slots.
+//     per-partition `count_pairs`/`low_group` sweeps read keys sequentially
 //     instead of gathering `keys[order[..]]`; `low_group` likewise emits
 //     `keys_sorted` so the emit group walk streams a dense local copy.
 //   * Slots hold residual hash words in a *narrowing* packed slot: round 0 is
 //     `SLOT` words; each round consumes the leading block, so round r packs at
 //     `SLOT - r` words. The XOR child is one 256-bit load/xor/permute that
 //     masked-stores the live lanes (`xor_child_avx2`); buffers carry a
 //     `SLOT_SLACK` pad for the 256-bit over-read of a narrow tail slot.
 //   * `collide` runs two passes — count pairs, then emit children directly into
 //     one pre-sized arena at per-partition offsets — so there is no per-bucket
 //     allocation and no final concatenation copy; the `keys`/`slots` buffers
 //     ping-pong between rounds.
 //   * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
-// A further bandwidth step (not yet taken) is xenoncat's full radix scatter of
+// The remaining bottleneck is the random slot gather in `emit_bucket` over the
-// the payload into bucket-contiguous storage, turning the inner-loop slot
+// ~1 GiB round buffer — the Equihash memory-hardness floor. A full radix scatter
-// gathers into sequential reads.
+// of the payload into bucket-contiguous storage (to make that gather sequential)
 // was evaluated and loses: the wide-slot scatter costs more than the gather it
 // saves on the early, wide rounds.
 // ---------------------------------------------------------------------------
 /// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
@@ -164,36 +172,53 @@ const LOW_BUCKETS: usize = 1 << LOW_BITS;
 /// Mask isolating the low resolved bits.
 const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;
-/// Number of u32 words in a padded entry slot (32 bytes = one AVX2 register).
+/// Number of u32 words in a round-0 entry slot (32 bytes = one AVX2 register).
-/// Every round stores its residual hash words in a fixed 8-word slot so the XOR
+/// Round 0 stores all eight 24-bit blocks; the XOR that produces a child is a
-/// that produces a child is a single 256-bit load/xor/permute/store and every
+/// single 256-bit load/xor/permute — xenoncat's packed-slot trick (and the same
-/// slot access is naturally aligned — xenoncat's packed-slot trick (and the same
+/// `uint4`-aligned-slot idea the CUDA backend uses), on the CPU. The leading
-/// `uint4`-aligned-slot idea the CUDA backend already uses), on the CPU. The
+/// collision word lives in lane 0 and is mirrored into a dense parallel `keys[]`
-/// leading collision word lives in lane 0 and is mirrored into a dense parallel
+/// array so the histogram passes stream over 4 bytes/entry instead of striding
-/// `keys[]` array so the histogram passes stream over 4 bytes/entry instead of
+/// the slots.
-/// striding the 32-byte slots.
+///
 /// Later rounds use a *narrower* pitch: each collision round consumes the leading
 /// block, so a round-`r` output entry only carries `8 - r` meaningful words. The
 /// solver packs each round's slots at that width (see `collide`'s `w_out`),
 /// cutting the per-round slot-buffer DRAM traffic that bounds the collision rounds.
 /// The XOR producer still loads a full 256-bit register (over-reading up to
 /// `SLOT` words past a narrow tail slot, hence the `SLOT_SLACK` pad) but
 /// masked-stores only the `w_out` live lanes so packed neighbours aren't touched.
 const SLOT: usize = 8;
-/// Scalar child producer: `out[0..8] = (a XOR b)` rotated left one lane; returns
+/// Trailing pad (in u32 words) on every slot buffer so the XOR producer's 256-bit
-/// the child's new leading word (lane 1 of the XOR). Lane 0 of the XOR is the
+/// load over a narrow tail slot stays in bounds. The over-read reaches at most
-/// just-collided block (always zero) and is rotated out.
+/// `(n-1)*w + SLOT` words for pitch `w`, i.e. `SLOT - w ≤ SLOT` words past the
 /// `n*w` payload; `SLOT` words always covers it.
 const SLOT_SLACK: usize = SLOT;
 /// Scalar child producer: writes the `w_out` live words of `(a XOR b)` rotated
 /// left one lane into `out`, and returns the child's new leading word (lane 1 of
 /// the XOR). Lane 0 of the XOR is the just-collided block (always zero) and is
 /// rotated out. `a`/`b` are read a full `SLOT` words wide (the caller pads each
 /// slot buffer by `SLOT_SLACK`); only `out[0..w_out]` is written.
 #[inline]
-unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
+unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
    let mut x = [0u32; SLOT];
    for t in 0..SLOT {
        x[t] = *a.add(t) ^ *b.add(t);
    }
-    for t in 0..SLOT {
+    // out[t] = x[(t + 1) % SLOT]; for t < w_out <= SLOT-1 the modulo is a no-op.
-        *out.add(t) = x[(t + 1) % SLOT];
+    for t in 0..w_out {
        *out.add(t) = x[t + 1];
    }
    x[1]
 }
 /// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
-/// one) + one store. Replaces the per-word scalar XOR loop.
+/// one), then a masked store of the low `w_out` lanes so the packed, `w_out`-
 /// pitched output never clobbers the next slot.
 #[cfg(target_arch = "x86_64")]
 #[target_feature(enable = "avx2")]
-unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
+unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
    use core::arch::x86_64::*;
    let x = _mm256_xor_si256(
        _mm256_loadu_si256(a as *const __m256i),
@@ -201,7 +226,12 @@ unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
    );
    // rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
    let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
-    _mm256_storeu_si256(out as *mut __m256i, p);
+    // mask = lanes [0, w_out) -> all-ones; maskstore writes only those.
    let mask = _mm256_cmpgt_epi32(
        _mm256_set1_epi32(w_out as i32),
        _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7),
    );
    _mm256_maskstore_epi32(out as *mut i32, mask, p);
    _mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
 }
@@ -244,38 +274,113 @@ fn repack_index(src: &[u8], dst: &mut [u32]) {
    repack_index_scalar(src, dst);
 }
 /// A raw `*mut u32` shared across rayon workers for the parallel scatter below.
 /// Sound only because the workers write provably disjoint position sets.
 #[derive(Clone, Copy)]
 struct OrderPtr(*mut u32);
 unsafe impl Send for OrderPtr {}
 unsafe impl Sync for OrderPtr {}
 /// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
-/// their (dense) leading block. Returns `(starts, order)`, where partition `v`
+/// their (dense) leading block. Returns `(starts, order, keys_part)`: partition
-/// owns the input indices `order[starts[v]..starts[v + 1]]`. The histogram
+/// `v` owns the input indices `order[starts[v]..starts[v + 1]]`, and `keys_part`
-/// passes stream over `keys[]` (4 bytes/entry) instead of striding the slots.
+/// is the entries' leading words in the same partition-contiguous order
-fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>) {
+/// (`keys_part[p] == keys[order[p]]`). Carrying `keys_part` lets the per-partition
 /// `count_pairs`/`low_group` sweeps read keys sequentially instead of gathering
 /// `keys[order[..]]` over the whole array — at the cost of one extra 4-byte/entry
 /// scatter here, folded into the (already parallel) phase-3 pass.
 ///
 /// Parallel counting sort: the input is split into one contiguous chunk per
 /// rayon worker. Each chunk histograms its slice (phase 1), a small serial pass
 /// turns those into per-chunk base offsets within each bucket's output region
 /// (phase 2), and each chunk scatters its entries into `order`/`keys_part`
 /// (phase 3). Chunk `c`'s bucket-`b` writes land in `[off[c][b], off[c+1][b])`,
 /// disjoint from every other chunk and bucket, so the concurrent writes never
 /// alias. Entries within a bucket end up chunk-major rather than index-major;
 /// that reordering is immaterial — `count_pairs`/`low_group` depend only on the
 /// multiset of low keys, and final solutions are canonicalised, de-duplicated,
 /// and verified.
 fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>, Vec<u32>) {
    let mut starts = vec![0u32; TOP_BUCKETS + 1];
    for k in 0..n {
        starts[(keys[k] >> LOW_BITS) as usize + 1] += 1;
    }
    for i in 0..TOP_BUCKETS {
        starts[i + 1] += starts[i];
    }
    let mut order = vec![0u32; n];
-    let mut cur = starts.clone(); // small: TOP_BUCKETS + 1 entries
+    let mut keys_part = vec![0u32; n];
-    for k in 0..n {
+    if n == 0 {
-        let b = (keys[k] >> LOW_BITS) as usize;
+        return (starts, order, keys_part);
-        order[cur[b] as usize] = k as u32;
+    }
    let nthreads = rayon::current_num_threads().max(1);
    let chunk = n.div_ceil(nthreads);
    let nchunks = n.div_ceil(chunk);
    // Phase 1 (parallel): per-chunk top-bucket histograms.
    let local_hists: Vec<Vec<u32>> = (0..nchunks)
        .into_par_iter()
        .map(|c| {
            let lo = c * chunk;
            let hi = ((c + 1) * chunk).min(n);
            let mut h = vec![0u32; TOP_BUCKETS];
            for &key in &keys[lo..hi] {
                h[(key >> LOW_BITS) as usize] += 1;
            }
            h
        })
        .collect();
    // Phase 2 (serial, nchunks * TOP_BUCKETS work): bucket starts, then each
    // chunk's per-bucket base offset within its bucket's output region.
    let mut totals = vec![0u32; TOP_BUCKETS];
    for h in &local_hists {
        for b in 0..TOP_BUCKETS {
            totals[b] += h[b];
        }
    }
    for b in 0..TOP_BUCKETS {
        starts[b + 1] = starts[b] + totals[b];
    }
    let mut offsets = vec![vec![0u32; TOP_BUCKETS]; nchunks];
    let mut running = starts[..TOP_BUCKETS].to_vec(); // running[b] starts at starts[b]
    for c in 0..nchunks {
        for b in 0..TOP_BUCKETS {
            offsets[c][b] = running[b];
            running[b] += local_hists[c][b];
        }
    }
    // Phase 3 (parallel): each chunk scatters into its disjoint sub-ranges.
    let optr = OrderPtr(order.as_mut_ptr());
    let kptr = OrderPtr(keys_part.as_mut_ptr());
    offsets.into_par_iter().enumerate().for_each(|(c, mut cur)| {
        let (optr, kptr) = (optr, kptr); // capture whole (Sync) wrappers
        let lo = c * chunk;
        let hi = ((c + 1) * chunk).min(n);
        let (obase, kbase) = (optr.0, kptr.0);
        for k in lo..hi {
            let key = keys[k];
            let b = (key >> LOW_BITS) as usize;
            let pos = cur[b] as usize;
            // SAFETY: `pos` ranges over `[off[c][b], off[c+1][b])`, a range owned
            // exclusively by chunk `c` and within `order`/`keys_part` bounds.
            unsafe {
                *obase.add(pos) = k as u32;
                *kbase.add(pos) = key;
            }
            cur[b] += 1;
        }
-    (starts, order)
+    });
    (starts, order, keys_part)
 }
 /// Count the colliding pairs a partition will emit, from the low-bit histogram
-/// alone (no reordering). `clamp` caps each exact-collision group, matching the
+/// of its (partition-contiguous) leading keys `keys_run`. `clamp` caps each
-/// emit pass so the output offsets line up. `hist` is reusable `LOW_BUCKETS`
+/// exact-collision group, matching the emit pass so the output offsets line up.
-/// scratch.
+/// `hist` is reusable `LOW_BUCKETS` scratch.
-fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
+fn count_pairs(keys_run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
    for h in hist.iter_mut() {
        *h = 0;
    }
-    for &k in run {
+    for &key in keys_run {
-        hist[(keys[k as usize] & LOW_MASK) as usize] += 1;
+        hist[(key & LOW_MASK) as usize] += 1;
    }
    let mut pairs = 0usize;
    for i in 0..LOW_BUCKETS {
@@ -285,31 +390,44 @@ fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usi
    pairs
 }
-/// Within one partition, group `run`'s entries by the low bits of their leading
+/// Within one partition, group its entries by the low bits of their leading
-/// block, writing the grouped indices into `sorted`. `hist` is reusable
+/// block. Inputs are the partition-contiguous slabs `keys_run` (leading keys) and
-/// `LOW_BUCKETS + 1` scratch. After this call `sorted` lists the run's indices
+/// `order_run` (matching global indices). Outputs, in low-key-sorted order:
-/// with equal low keys contiguous, so callers recover each exact-collision
+/// `sorted` (the global indices, for the emit slot gather + back-refs) and
-/// group by walking adjacent equal keys.
+/// `keys_sorted` (the leading keys, so the emit group walk streams a dense local
-fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>) {
+/// array instead of gathering `keys[sorted[i]]`). `hist` is reusable
-    let m = run.len();
+/// `LOW_BUCKETS + 1` scratch. Both reads are sequential over the slabs.
 fn low_group(
    keys_run: &[u32],
    order_run: &[u32],
    hist: &mut [u32],
    sorted: &mut Vec<u32>,
    keys_sorted: &mut Vec<u32>,
 ) {
    let m = order_run.len();
    sorted.clear();
    keys_sorted.clear();
    if m == 0 {
        return;
    }
    for h in hist.iter_mut() {
        *h = 0;
    }
-    for &k in run {
+    for &key in keys_run {
-        hist[(keys[k as usize] & LOW_MASK) as usize + 1] += 1;
+        hist[(key & LOW_MASK) as usize + 1] += 1;
    }
    for i in 0..LOW_BUCKETS {
        hist[i + 1] += hist[i];
    }
    sorted.resize(m, 0);
    keys_sorted.resize(m, 0);
    // hist[low] now holds the run-start offset; reuse it as the live cursor.
-    for &k in run {
+    for i in 0..m {
-        let low = (keys[k as usize] & LOW_MASK) as usize;
+        let key = keys_run[i];
-        sorted[hist[low] as usize] = k;
+        let low = (key & LOW_MASK) as usize;
        let pos = hist[low] as usize;
        sorted[pos] = order_run[i];
        keys_sorted[pos] = key;
        hist[low] += 1;
    }
 }
@@ -320,26 +438,42 @@ fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>)
 /// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
 /// inside a `target_feature` wrapper while sharing one source of truth.
 macro_rules! emit_bucket_body {
-    ($keys:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $xor:path) => {{
+    ($keys_sorted:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $w_in:expr, $w_out:expr, $prefetch:expr, $xor:path) => {{
        let s = $sorted;
        let ks = $keys_sorted; // leading keys in `s`-order; group walk streams it
        let m = s.len();
        let w_in = $w_in;
        let w_out = $w_out;
        let mut w = 0usize;
        let mut i = 0;
        while i < m {
-            let key = $keys[s[i] as usize] & LOW_MASK;
+            let key = ks[i] & LOW_MASK;
            let mut j = i + 1;
-            while j < m && ($keys[s[j] as usize] & LOW_MASK) == key {
+            while j < m && (ks[j] & LOW_MASK) == key {
                j += 1;
            }
            let hi = j.min(i.saturating_add($clamp));
            // The pair loops below read `slots[s[a]]` / `slots[s[b]]` at random
            // global positions in the ~1 GB slot arena — each a likely cache
            // miss. Prefetch this group's member slots up front so the misses
            // overlap with the (L1-resident) pair XORs that follow.
            #[cfg(target_arch = "x86_64")]
            if $prefetch {
                for a in i..hi {
                    core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(
                        $slots.as_ptr().add(s[a] as usize * w_in) as *const i8,
                    );
                }
            }
            for a in i..hi {
                let l = s[a] as usize;
                for b in (a + 1)..hi {
                    let mr = s[b] as usize;
                    let nk = $xor(
-                        $sout.as_mut_ptr().add(w * SLOT),
+                        $sout.as_mut_ptr().add(w * w_out),
-                        $slots.as_ptr().add(l * SLOT),
+                        $slots.as_ptr().add(l * w_in),
-                        $slots.as_ptr().add(mr * SLOT),
+                        $slots.as_ptr().add(mr * w_in),
                        w_out,
                    );
                    $kout[w] = nk;
                    $pout[w] = ((l as u64) << 32) | mr as u64;
@@ -353,69 +487,101 @@ macro_rules! emit_bucket_body {
 }
 unsafe fn emit_bucket_scalar(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
    w_in: usize,
    w_out: usize,
    prefetch: bool,
 ) -> usize {
-    emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_scalar)
+    emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_scalar)
 }
 #[cfg(target_arch = "x86_64")]
 #[target_feature(enable = "avx2")]
 unsafe fn emit_bucket_avx2(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
    w_in: usize,
    w_out: usize,
    prefetch: bool,
 ) -> usize {
-    emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_avx2)
+    emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_avx2)
 }
 /// Emit a partition's children, dispatching to the AVX2 producer when available.
 /// `w_in`/`w_out` are the input/output slot pitches (`w_out == w_in - 1`).
 /// `prefetch` software-prefetches each group's randomly-gathered member slots.
 unsafe fn emit_bucket(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
    w_in: usize,
    w_out: usize,
    prefetch: bool,
 ) -> usize {
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
-            return emit_bucket_avx2(keys, slots, sorted, kout, sout, pout, clamp);
+            return emit_bucket_avx2(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch);
        }
    }
-    emit_bucket_scalar(keys, slots, sorted, kout, sout, pout, clamp)
+    emit_bucket_scalar(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch)
 }
 /// Group `n` entries by their leading block, then emit one child per colliding
-/// pair: the XOR of the residual blocks (rotated into a fresh 8-word slot) plus
+/// pair: the XOR of the residual blocks (rotated into a fresh slot) plus a packed
-/// a packed `(l << 32) | mr` back-reference. Two passes — count, then emit
+/// `(l << 32) | mr` back-reference. Two passes — count, then emit directly into
-/// directly into one pre-sized arena — so there is no per-partition allocation
+/// one pre-sized arena — so there is no per-partition allocation or final
-/// or final concatenation copy. Returns `(keys_out, slots_out, parents)`.
+/// concatenation copy. Returns `(keys_out, slots_out, parents)`.
-fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
+///
-    let (starts, order) = partition_top(keys, n);
+/// `w_in` is the input slot pitch; the round consumes one block, so the output is
 /// packed at `w_out = w_in - 1` words/slot. `slots_out` carries a `SLOT_SLACK`
 /// trailing pad so the next round's 256-bit over-read stays in bounds.
 fn collide(
    keys: &[u32],
    slots: &[u32],
    n: usize,
    clamp: usize,
    w_in: usize,
 ) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
    let w_out = w_in - 1;
    // Sub-phase timing, gated on `EQ_PROFILE`. Prints partition / count / alloc /
    // emit splits so we can see which part of the round dominates.
    let prof = std::env::var_os("EQ_PROFILE").is_some();
    // Software-prefetch the emit gather (on by default; `EQ_NO_PREFETCH` disables
    // it for A/B benchmarking). Read once, not per group.
    let do_prefetch = std::env::var_os("EQ_NO_PREFETCH").is_none();
    let t0 = std::time::Instant::now();
-    // Pass 1: per-partition child counts (histogram-derived, no reordering).
+    let (starts, order, keys_part) = partition_top(keys, n);
    let t_part = std::time::Instant::now();
    // Pass 1: per-partition child counts from each partition's contiguous keys.
    let counts: Vec<usize> = (0..TOP_BUCKETS)
        .into_par_iter()
        .map_init(
            || vec![0u32; LOW_BUCKETS],
            |hist, v| {
-                let run = &order[starts[v] as usize..starts[v + 1] as usize];
+                let keys_run = &keys_part[starts[v] as usize..starts[v + 1] as usize];
-                count_pairs(keys, run, hist, clamp)
+                count_pairs(keys_run, hist, clamp)
            },
        )
        .collect();
    let t_count = std::time::Instant::now();
    let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
    for v in 0..TOP_BUCKETS {
@@ -423,7 +589,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
    }
    let total = out_starts[TOP_BUCKETS];
    let mut keys_out = vec![0u32; total];
-    let mut slots_out = vec![0u32; total * SLOT];
+    // Packed at `w_out` words/slot, plus a slack pad for the next round's over-read.
    let mut slots_out = vec![0u32; total * w_out + SLOT_SLACK];
    let mut parents = vec![0u64; total];
    // Carve the output arena into disjoint per-partition sub-slices so workers
@@ -435,7 +602,7 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
        for &c in &counts {
            let (kh, kt) = kr.split_at_mut(c);
-            let (sh, st) = sr.split_at_mut(c * SLOT);
+            let (sh, st) = sr.split_at_mut(c * w_out);
            let (ph, pt) = pr.split_at_mut(c);
            kparts.push(kh);
            sparts.push(sh);
@@ -446,6 +613,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        }
    }
    let t_alloc = std::time::Instant::now();
    // Pass 2: group each partition and emit its colliding pairs in place.
    kparts
        .into_par_iter()
@@ -453,37 +622,53 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        .zip(pparts)
        .enumerate()
        .for_each(|(v, ((kout, sout), pout))| {
-            let run = &order[starts[v] as usize..starts[v + 1] as usize];
+            let lo = starts[v] as usize;
            let hi = starts[v + 1] as usize;
            let mut hist = vec![0u32; LOW_BUCKETS + 1];
            let mut sorted = Vec::new();
-            low_group(keys, run, &mut hist, &mut sorted);
+            let mut keys_sorted = Vec::new();
-            let w = unsafe { emit_bucket(keys, slots, &sorted, kout, sout, pout, clamp) };
+            low_group(&keys_part[lo..hi], &order[lo..hi], &mut hist, &mut sorted, &mut keys_sorted);
            let w = unsafe {
                emit_bucket(&keys_sorted, slots, &sorted, kout, sout, pout, clamp, w_in, w_out, do_prefetch)
            };
            debug_assert_eq!(w, kout.len());
        });
    if prof {
        let ms = |a: std::time::Instant, b: std::time::Instant| (b - a).as_secs_f64() * 1000.0;
        eprintln!(
            "    [collide n={n}] partition {:6.1}  count {:6.1}  alloc {:6.1}  emit(group+xor) {:6.1} ms",
            ms(t0, t_part),
            ms(t_part, t_count),
            ms(t_count, t_alloc),
            ms(t_alloc, std::time::Instant::now()),
        );
    }
    (keys_out, slots_out, parents)
 }
-/// Final round (slots hold `[w0, w1, …]`): among entries sharing leading block
+/// Final round (slots hold `[w0, w1, …]` at pitch `w_in`): among entries sharing
-/// `w0`, a pair whose `w1` also matches XORs the last two blocks to zero — a
+/// leading block `w0`, a pair whose `w1` also matches XORs the last two blocks to
-/// candidate. Returns the `(l, mr)` parents of each candidate.
+/// zero — a candidate. Returns the `(l, mr)` parents of each candidate.
-fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u32, u32)> {
+fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize, w_in: usize) -> Vec<(u32, u32)> {
-    let (starts, order) = partition_top(keys, n);
+    let (starts, order, keys_part) = partition_top(keys, n);
    (0..TOP_BUCKETS)
        .into_par_iter()
        .map_init(
-            || (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new()),
+            || (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new(), Vec::<u32>::new()),
-            |(hist, sorted), v| {
+            |(hist, sorted, keys_sorted), v| {
-                let run = &order[starts[v] as usize..starts[v + 1] as usize];
+                let lo = starts[v] as usize;
-                low_group(keys, run, hist, sorted);
+                let hi = starts[v + 1] as usize;
                low_group(&keys_part[lo..hi], &order[lo..hi], hist, sorted, keys_sorted);
                let m = sorted.len();
                let mut local = Vec::new();
                let mut i = 0;
                while i < m {
-                    let key = keys[sorted[i] as usize] & LOW_MASK;
+                    let key = keys_sorted[i] & LOW_MASK;
                    let mut j = i + 1;
-                    while j < m && (keys[sorted[j] as usize] & LOW_MASK) == key {
+                    while j < m && (keys_sorted[j] & LOW_MASK) == key {
                        j += 1;
                    }
                    let hi = j.min(i.saturating_add(clamp));
@@ -491,7 +676,7 @@ fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u3
                        let l = sorted[a] as usize;
                        for b in (a + 1)..hi {
                            let mr = sorted[b] as usize;
-                            if slots[l * SLOT + 1] == slots[mr * SLOT + 1] {
+                            if slots[l * w_in + 1] == slots[mr * w_in + 1] {
                                local.push((l as u32, mr as u32));
                            }
                        }
@@ -532,6 +717,22 @@ pub fn solve(header: &[u8]) -> Vec<Vec<u32>> {
 pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    let clamp = clamp.unwrap_or(usize::MAX);
    // Optional per-phase timing, gated on `EQ_PROFILE` (any value). Zero cost
    // when unset — the only overhead is one env lookup. Each `phase(label, n)`
    // prints the wall time since the previous call and the live entry count.
    let prof = std::env::var_os("EQ_PROFILE").is_some();
    let mut t_last = std::time::Instant::now();
    let mut phase = |label: &str, n: usize| {
        if prof {
            let now = std::time::Instant::now();
            eprintln!(
                "  [profile] {label:<13} {:8.1} ms  (n={n})",
                (now - t_last).as_secs_f64() * 1000.0
            );
            t_last = now;
        }
    };
    // ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
    // in padded 8-word slots with the leading block mirrored into `keys`. Entry
    // k corresponds to leaf index k (the gen order), so no leaf table is needed.
@@ -541,18 +742,29 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    let hasher = BatchHasher::new(header);
    let mut keys: Vec<u32> = vec![0u32; n0];
    let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
-    let kgroup = 4 * INDICES_PER_HASH_OUTPUT; // eight entries
+    // Hash eight `g` values per chunk via `hash8`, which uses one AVX-512
    // compression where available and falls back to two AVX2 `hash4`s (or scalar)
    // otherwise — so this single path covers every CPU.
    let kgroup = 8 * INDICES_PER_HASH_OUTPUT; // sixteen entries
    let sgroup = kgroup * SLOT;
-    debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 4-g groups");
+    debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 8-g groups");
    // `EQ_NO_AVX512` forces the AVX2 fallback (two hash4) for A/B benchmarking.
    let use_avx512 = std::env::var_os("EQ_NO_AVX512").is_none();
    slots
        .par_chunks_mut(sgroup)
        .zip(keys.par_chunks_mut(kgroup))
        .enumerate()
        .for_each(|(c, (schunk, kchunk))| {
-            let g0 = (c * 4) as u32;
+            let g0 = (c * 8) as u32;
-            let mut hs = [[0u8; HASH_OUTPUT]; 4];
+            let mut hs = [[0u8; HASH_OUTPUT]; 8];
-            hasher.hash4(g0, &mut hs);
+            if use_avx512 {
-            for j in 0..4 {
+                hasher.hash8(g0, &mut hs);
            } else {
                let (lo, hi) = hs.split_at_mut(4);
                hasher.hash4(g0, (&mut lo[..4]).try_into().unwrap());
                hasher.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
            }
            for j in 0..8 {
                for i in 0..INDICES_PER_HASH_OUTPUT {
                    let e = j * INDICES_PER_HASH_OUTPUT + i;
                    let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
@@ -567,20 +779,27 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    // parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
    // `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
    // the new ones replace them).
    phase("round0-hash", n0);
    let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
    let mut n = n0;
-    for _ in 0..(K - 1) {
+    // Round-0 slots carry all SLOT words; each collision round consumes the
-        let (ok, os, op) = collide(&keys, &slots, n, clamp);
+    // leading block, so the next round's slots are one word narrower.
    let mut width = SLOT;
    for r in 0..(K - 1) {
        let (ok, os, op) = collide(&keys, &slots, n, clamp, width);
        n = op.len();
        parents.push(op);
        keys = ok;
        slots = os;
        width -= 1;
        phase(&format!("collide r{}", r + 1), n);
        if n == 0 {
            return Vec::new();
        }
    }
-    let candidates = collide_final(&keys, &slots, n, clamp);
+    let candidates = collide_final(&keys, &slots, n, clamp, width);
    phase("collide-final", candidates.len());
    if candidates.is_empty() {
        return Vec::new();
    }
@@ -605,8 +824,11 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
        // refs are now round-0 indices == leaf indices.
        recovered.extend_from_slice(&refs);
    }
    phase("recover", recovered.len() / SOLUTION_INDICES);
-    filter_candidates(&base_state(header), &recovered)
+    let result = filter_candidates(&base_state(header), &recovered);
    phase("filter+verify", result.len());
    result
 }
 /// Re-order recovered leaf indices into the canonical solution ordering: at
@@ -741,26 +963,34 @@ mod tests {
        assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
    }
-    // AVX2 XOR-child (xor + rotate-left-one-lane) must match the scalar version.
+    // The XOR-child producers (scalar + AVX2) must agree for every output width,
    // write exactly `w_out` words (rotate-left-one of the lane-wise XOR), and
    // leave the rest of the packed buffer untouched (the AVX2 path masked-stores).
    #[test]
    fn xor_child_matches_scalar() {
        const SENT: u32 = 0xDEAD_BEEF;
        let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
        let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
        let mut o1 = [0u32; SLOT];
        let mut o2 = [0u32; SLOT];
        let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
        // Reference: rotate-left-one of the lane-wise XOR; new key = lane 1.
        let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
-        let expect: Vec<u32> = (0..SLOT).map(|i| x[(i + 1) % SLOT]).collect();
+        for w_out in 1..SLOT {
-        assert_eq!(&o1[..], &expect[..]);
+            let mut o1 = [SENT; SLOT];
            let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
            for t in 0..w_out {
                assert_eq!(o1[t], x[(t + 1) % SLOT], "scalar word {t} (w_out={w_out})");
            }
            for t in w_out..SLOT {
                assert_eq!(o1[t], SENT, "scalar wrote past w_out at {t} (w_out={w_out})");
            }
            assert_eq!(k1, x[1]);
            #[cfg(target_arch = "x86_64")]
            if is_x86_feature_detected!("avx2") {
-            let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
+                let mut o2 = [SENT; SLOT];
-            assert_eq!(o1, o2, "avx2 xor_child != scalar");
+                let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
                assert_eq!(o1, o2, "avx2 xor_child != scalar (w_out={w_out})");
                assert_eq!(k1, k2);
            }
        }
    }
    // Clamped solve over a fixed header must find valid solutions (Equihash
    // 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
@@ -780,4 +1010,97 @@ mod tests {
            }
        }
    }
    /// A deterministic dense 140-byte header for benchmarking, filled by an
    /// xorshift64* stream keyed on `seed` (the same scheme as `main::pseudo_header`)
    /// so each nonce exercises a realistic, fully-populated header rather than the
    /// degenerate all-`0x42` one.
    #[cfg(test)]
    fn bench_header(seed: u64) -> Vec<u8> {
        let mut header = vec![0u8; HEADER_LEN];
        let mut x = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
        for b in header.iter_mut() {
            x ^= x >> 12;
            x ^= x << 25;
            x ^= x >> 27;
            *b = (x.wrapping_mul(0x2545F4914F6CDD1D) >> 33) as u8;
        }
        header
    }
    /// Steady-state throughput baseline for the CPU solver. Ignored by default;
    /// run it in release (the only meaningful configuration — `debug_assert`s in
    /// the collision hot path are off and `opt-level=3`/LTO is on) with output
    /// visible:
    ///
    /// ```text
    /// cargo test --release full_solve_baseline -- --ignored --nocapture
    /// ```
    ///
    /// Tunables via env:
    ///   * `EQ_BENCH_ITERS`  — timed solves per clamp (default 8).
    ///   * `EQ_BENCH_CLAMPS` — comma-separated clamp sweep (default `32`, the
    ///     value the miner and `--selftest` use).
    ///
    /// Each timed iteration uses a distinct nonce so the figures average over
    /// per-header variance; one warm-up solve per clamp (pages in the ~GB of
    /// round buffers, warms caches) is excluded. This measures one solve at a
    /// time across the whole rayon pool — i.e. solver latency and its memory
    /// behaviour, which is exactly what the planned radix-scatter change targets
    /// — not the aggregate multi-group mining rate `--bench` reports.
    #[test]
    #[ignore]
    fn full_solve_baseline() {
        use std::time::Instant;
        let iters: usize = std::env::var("EQ_BENCH_ITERS")
            .ok()
            .and_then(|v| v.parse().ok())
            .unwrap_or(8);
        let clamps: Vec<usize> = std::env::var("EQ_BENCH_CLAMPS")
            .ok()
            .map(|v| v.split(',').filter_map(|s| s.trim().parse().ok()).collect())
            .filter(|v: &Vec<usize>| !v.is_empty())
            .unwrap_or_else(|| vec![32]);
        eprintln!(
            "CPU solver baseline — {iters} timed solve(s)/clamp, clamps={clamps:?}, \
             rayon threads={}",
            rayon::current_num_threads()
        );
        for &clamp in &clamps {
            // Warm-up (excluded): fault in buffers, warm caches.
            let _ = solve_with(&bench_header(0), Some(clamp));
            let mut times = Vec::with_capacity(iters);
            let mut total_sols = 0usize;
            for i in 0..iters {
                let header = bench_header(i as u64 + 1);
                let base = base_state(&header);
                let t = Instant::now();
                let sols = solve_with(&header, Some(clamp));
                times.push(t.elapsed().as_secs_f64());
                for s in &sols {
                    assert!(is_valid_solution(&base, s), "solver returned an invalid solution");
                }
                total_sols += sols.len();
            }
            times.sort_by(|a, b| a.partial_cmp(b).unwrap());
            let n = times.len();
            let sum: f64 = times.iter().sum();
            let mean = 1000.0 * sum / n as f64;
            let median = 1000.0 * times[n / 2];
            let min = 1000.0 * times[0];
            let max = 1000.0 * times[n - 1];
            let solves_per_s = n as f64 / sum;
            eprintln!(
                "clamp={clamp:>3}: {solves_per_s:6.2} solve/s | mean {mean:6.0} ms  \
                 median {median:6.0} ms  min {min:6.0} ms  max {max:6.0} ms | \
                 {total_sols} sol over {n} solve(s) ({:.2}/solve)",
                total_sols as f64 / n as f64,
            );
        }
    }
 }
@@ -411,6 +411,9 @@ impl LegacySolver {
 /// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
 pub struct GpuSolver {
    inner: SolverInner,
    /// Per-card telemetry handle (AMD amdgpu sysfs / NVML), matched to the device
    /// by PCI bus id. `None` when no telemetry backend matches.
    tuner: Option<Box<dyn crate::gpu_tune::GpuTuner>>,
 }
 enum SolverInner {
@@ -423,13 +426,16 @@ impl GpuSolver {
    /// device vendor (AMD → `equihash192_7.cl`).
    pub fn new(device_index: usize) -> Result<Self> {
        let (platform, device) = pick_device(device_index)?;
        // Resolve a telemetry handle (AMD sysfs / NVML) from the device's PCI bus
        // before `device` is consumed by the inner solver.
        let tuner = device_pci_bus_id(&device).and_then(|bus| crate::gpu_tune::open(&bus));
        let inner = if use_amd_kernel(&device) {
            log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
            SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
        } else {
            SolverInner::Legacy(LegacySolver::new(platform, device)?)
        };
-        Ok(Self { inner })
+        Ok(Self { inner, tuner })
    }
    /// This device's product name, if available.
@@ -440,6 +446,34 @@ impl GpuSolver {
        }
    }
    /// Current board power draw in watts (telemetry handle), if available.
    pub fn power_watts(&self) -> Option<f64> {
        self.tuner.as_ref().and_then(|t| t.watts())
    }
    /// Current GPU temperature in °C, if available.
    pub fn temperature_c(&self) -> Option<u32> {
        self.tuner.as_ref().and_then(|t| t.temperature_c())
    }
    /// Currently enforced power limit in watts, if available.
    pub fn current_power_limit_w(&self) -> Option<u32> {
        self.tuner.as_ref().and_then(|t| t.current_power_limit_w())
    }
    /// (min, max) settable power limit in watts, if available.
    pub fn power_limit_range_w(&self) -> Option<(u32, u32)> {
        self.tuner.as_ref().and_then(|t| t.power_limit_range_w())
    }
    /// Current (core, memory) clocks in MHz, each `None` if unavailable.
    pub fn current_clocks_mhz(&self) -> (Option<u32>, Option<u32>) {
        match &self.tuner {
            Some(t) => (t.core_clock_mhz(), t.mem_clock_mhz()),
            None => (None, None),
        }
    }
    /// Solve the puzzle for `header` (140 bytes).
    pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
        match &self.inner {
@@ -475,11 +509,32 @@ impl GpuSolver {
    pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
        match &self.inner {
            SolverInner::Legacy(s) => s.hash_all(header),
-            SolverInner::Amd(_) => {
+            SolverInner::Amd(_) => Err(anyhow!("hash_all is not supported by the AMD kernel")),
                Err(anyhow!("hash_all is not supported by the AMD kernel"))
        }
    }
 }
 /// The device's PCI address as `"DDDD:BB:DD.F"` (lowercase), for matching the
 /// physical card to a telemetry backend. Prefers `cl_khr_pci_bus_info`; falls
 /// back to `cl_device_topology_amd`. `None` if the device exposes neither.
 fn device_pci_bus_id(device: &ocl::Device) -> Option<String> {
    const CL_DEVICE_PCI_BUS_INFO_KHR: u32 = 0x10F2;
    const CL_DEVICE_TOPOLOGY_AMD: u32 = 0x4037;
    // cl_device_pci_bus_info_khr = { u32 pci_domain, pci_bus, pci_device, pci_function }.
    if let Ok(b) = device.info_raw(CL_DEVICE_PCI_BUS_INFO_KHR) {
        if b.len() >= 16 {
            let rd = |i: usize| u32::from_ne_bytes([b[i], b[i + 1], b[i + 2], b[i + 3]]);
            return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", rd(0), rd(4), rd(8), rd(12)));
        }
    }
    // cl_device_topology_amd (PCIE branch): 24-byte struct, bus/device/function
    // are the last three bytes; domain isn't exposed (assume 0000).
    if let Ok(b) = device.info_raw(CL_DEVICE_TOPOLOGY_AMD) {
        if b.len() >= 24 {
            return Some(format!("0000:{:02x}:{:02x}.{:x}", b[21], b[22], b[23]));
        }
    }
    None
 }
 /// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
@@ -501,19 +556,81 @@ fn use_amd_kernel(device: &ocl::Device) -> bool {
    }
 }
 /// A cross-platform key identifying the physical GPU, as a canonical PCI address
 /// `"DDDD:BB:DD.F"`. Vendor runtimes (ROCm/NVIDIA) expose `cl_khr_pci_bus_info`;
 /// rusticl/Mesa doesn't, but its `cl_khr_device_uuid` *encodes* the PCI address
 /// ({u32 domain LE, u8 bus, u8 device, u8 function, ...}), so the same physical
 /// card yields the same key on both platforms. `None` if neither is available
 /// (then the device is never deduped — safe).
 fn device_dedup_key(device: &ocl::Device) -> Option<String> {
    if let Some(pci) = device_pci_bus_id(device) {
        return Some(pci);
    }
    const CL_DEVICE_UUID_KHR: u32 = 0x106A;
    if let Ok(b) = device.info_raw(CL_DEVICE_UUID_KHR) {
        if b.len() >= 7 {
            let domain = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
            return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", domain, b[4], b[5], b[6]));
        }
    }
    None
 }
 /// Lower = preferred when the same physical GPU is exposed by multiple OpenCL
 /// platforms. De-prioritise the Mesa Gallium drivers (rusticl/clover) relative to
 /// the vendor runtimes (ROCm / NVIDIA / Intel), which are faster and complete.
 fn platform_rank(p: &ocl::Platform) -> u8 {
    let name = p.name().unwrap_or_default().to_ascii_lowercase();
    if name.contains("rusticl") || name.contains("clover") || name.contains("mesa") {
        1
    } else {
        0
    }
 }
 /// All usable OpenCL `(platform, device)` pairs in a stable flat order, with each
 /// physical GPU de-duplicated across platforms by PCI bus id — a card exposed by
 /// both ROCm and rusticl appears once (the vendor runtime wins over Mesa), so
 /// mining "all" doesn't run the same card twice. This is the single source of
 /// truth for the flat device index used by `--devices`, `--list-devices`, and
 /// [`pick_device`]. Devices without a PCI bus id (CPU / PoCL) are never deduped.
 fn enumerate_devices() -> Vec<(ocl::Platform, ocl::Device)> {
    use ocl::{Device, Platform};
    let mut out: Vec<(Platform, Device)> = Vec::new();
    let mut by_pci: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
    for platform in Platform::list() {
        for device in Device::list_all(platform).unwrap_or_default() {
            match device_dedup_key(&device) {
                Some(pci) => match by_pci.get(&pci).copied() {
                    // Same physical GPU already listed: keep the preferred platform.
                    Some(existing) => {
                        if platform_rank(&platform) < platform_rank(&out[existing].0) {
                            out[existing] = (platform, device);
                        }
                    }
                    None => {
                        by_pci.insert(pci, out.len());
                        out.push((platform, device));
                    }
                },
                None => out.push((platform, device)), // no PCI id → can't dedup
            }
        }
    }
    out
 }
 /// List `(platform, device)` names so the user can choose `--device`.
 pub fn list_devices() -> Result<Vec<String>> {
-    use ocl::{Device, Platform};
+    let names = enumerate_devices()
-    let mut names = Vec::new();
+        .into_iter()
-    let mut idx = 0;
+        .enumerate()
-    for platform in Platform::list() {
+        .map(|(idx, (platform, device))| {
            let pname = platform.name().unwrap_or_else(|_| "?".into());
        for device in Device::list_all(platform).unwrap_or_default() {
            let dname = device.name().unwrap_or_else(|_| "?".into());
-            names.push(format!("[{idx}] {pname} / {dname}"));
+            format!("[{idx}] {pname} / {dname}")
-            idx += 1;
+        })
-        }
+        .collect();
    }
    Ok(names)
 }
@@ -522,18 +639,15 @@ pub fn list_devices() -> Result<Vec<String>> {
 /// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
 pub fn device_is_nvidia() -> Vec<bool> {
    use ocl::enums::{DeviceInfo, DeviceInfoResult};
-    use ocl::{Device, Platform};
+    enumerate_devices()
-    let mut out = Vec::new();
+        .into_iter()
-    for platform in Platform::list() {
+        .map(|(_, device)| {
-        for device in Device::list_all(platform).unwrap_or_default() {
+            matches!(
            let is_nv = matches!(
                device.info(DeviceInfo::Vendor),
                Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
-            );
+            )
-            out.push(is_nv);
+        })
-        }
+        .collect()
    }
    out
 }
 /// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
@@ -541,36 +655,20 @@ pub fn device_is_nvidia() -> Vec<bool> {
 /// [`list_devices`] / `--devices`.
 pub fn cpu_device_index() -> Option<usize> {
    use ocl::enums::{DeviceInfo, DeviceInfoResult};
-    use ocl::{Device, Platform};
+    enumerate_devices().into_iter().position(|(_, device)| {
-    let mut idx = 0;
+        matches!(
    for platform in Platform::list() {
        for device in Device::list_all(platform).unwrap_or_default() {
            let is_cpu = matches!(
            device.info(DeviceInfo::Type).ok(),
            Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
-            );
+        )
-            if is_cpu {
+    })
                return Some(idx);
            }
            idx += 1;
        }
    }
    None
 }
-/// Resolve a flat device index across all platforms, returning the device along
+/// Resolve a flat device index (into the de-duplicated [`enumerate_devices`]
-/// with the platform it belongs to (needed to build the context against the
+/// list), returning the device along with the platform it belongs to (needed to
-/// right platform).
+/// build the context against the right platform).
-fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
+pub(crate) fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
-    use ocl::{Device, Platform};
+    enumerate_devices()
-    let mut idx = 0;
+        .into_iter()
-    for platform in Platform::list() {
+        .nth(index)
-        for device in Device::list_all(platform).unwrap_or_default() {
+        .ok_or_else(|| anyhow!("no OpenCL device with index {index}"))
            if idx == index {
                return Ok((platform, device));
            }
            idx += 1;
        }
    }
    Err(anyhow!("no OpenCL device with index {index}"))
 }
@@ -128,19 +128,25 @@ pub trait GpuTuner: Send {
 /// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
 /// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
 ///
-/// NVML is the backend on both Linux (`libnvidia-ml`) and Windows (`nvml.dll`);
+/// Tries the NVIDIA backend first (NVML, `libnvidia-ml`/`nvml.dll`), then the AMD
-/// the C API is identical, so the same [`crate::nvml`] code serves both.
+/// backend ([`crate::amd_smi`], Linux amdgpu sysfs). A non-matching bus id makes
 /// each backend return `None`, so probing both is safe on mixed-vendor hosts.
 pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
-    #[cfg(any(unix, windows))]
+    #[cfg(feature = "cuda")]
    {
-        crate::nvml::open(pci_bus_id)
+        if let Some(t) = crate::nvml::open(pci_bus_id) {
            return Some(t);
        }
-    #[cfg(not(any(unix, windows)))]
+    }
    #[cfg(feature = "gpu")]
    {
        if let Some(t) = crate::amd_smi::open(pci_bus_id) {
            return Some(t);
        }
    }
    let _ = pci_bus_id;
    None
 }
 }
 static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);
@@ -18,6 +18,10 @@ mod gpu;
 #[cfg(feature = "gpu")]
 mod gpu_amd;
 // AMD GPU telemetry via Linux amdgpu sysfs (a `gpu_tune::GpuTuner` backend).
 #[cfg(feature = "gpu")]
 mod amd_smi;
 // Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
 #[cfg(feature = "cuda")]
 mod dylib;
@@ -28,7 +32,9 @@ mod cuda;
 #[cfg(feature = "cuda")]
 mod nvml;
-#[cfg(feature = "cuda")]
+// Platform-agnostic GPU tuning/telemetry surface. The trait + policy compile for
 // either GPU backend; NVML (cuda) and amd_smi (gpu) are the implementations.
 #[cfg(any(feature = "cuda", feature = "gpu"))]
 mod gpu_tune;
 use std::io::IsTerminal;
@@ -125,11 +131,12 @@ struct Args {
    #[arg(long, value_name = "SPEC")]
    cpu_cores: Option<String>,
-    /// Cores per CPU mining row (default 4). Each row runs one shared solve
+    /// Cores per CPU mining row. Each row runs one shared solve across its
-    /// across its cores; larger groups cut memory sharply: total RAM is ~4 GB ×
+    /// cores; larger groups cut memory sharply: total RAM is ~4 GB × (enabled
-    /// (enabled cores / this size). Use 1 for one row (and one solve) per core.
+    /// cores / this size). Rows align to core-index blocks of this size. Capped
-    /// Rows are aligned to core-index blocks of this size, so a row never
+    /// by core count so the row count stays manageable — ≤4 cores toggle
-    /// straddles a boundary. Cycle it live in the dashboard with 'g'.
+    /// individually (1), 5-8 cores in groups of ≤2, more than 8 in groups of ≤4
    /// — and the default is that cap. Cycle it live (within the cap) with 'g'.
    #[arg(long, value_name = "N", default_value_t = 4)]
    cpu_group_size: usize,
@@ -203,6 +210,13 @@ struct Args {
    #[arg(long)]
    auto_tune: bool,
    /// Sustained-Sol/s governor: hold each GPU at/below this edge temperature (°C)
    /// by pacing the solve cadence (no hardware writes, no root). Trades a little
    /// throughput for lower temp/power; off by default (runs flat-out). Needs a
    /// backend that reports temperature (AMD amdgpu / NVIDIA).
    #[arg(long, value_name = "CELSIUS")]
    target_temp: Option<u32>,
    /// Efficiency: cap each GPU's power limit in watts (default: card max).
    /// Lower power trades a little hashrate for much better Sol/W.
    #[arg(long, value_name = "WATTS")]
@@ -619,6 +633,8 @@ fn main() -> Result<()> {
        args.power_limit.unwrap_or(0),
        args.unlock_controls,
    );
    // Software temp governor target (paces solve cadence; no hardware writes).
    miner::set_target_temp(args.target_temp);
    miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
 }
@@ -1076,10 +1092,21 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
    use std::time::Instant;
    info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());
    /// Per-worker benchmark result, including a steady-state telemetry snapshot
    /// (sampled right after the timed loop, while the card is warm).
    struct WorkerResult {
        sols: usize,
        dt: f64,
        watts: Option<f64>,
        temp_c: Option<u32>,
        core_mhz: Option<u32>,
        mem_mhz: Option<u32>,
    }
    let start = Instant::now();
    let mut handles = Vec::new();
    for (id, spec) in specs.into_iter().enumerate() {
-        handles.push(std::thread::spawn(move || -> Result<(usize, f64)> {
+        handles.push(std::thread::spawn(move || -> Result<WorkerResult> {
            let backend = spec.build()?;
            backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
            let t = Instant::now();
@@ -1089,7 +1116,17 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
                let seed = ((id as u64) << 40) | (i as u64 + 1);
                sols += backend.solve(&pseudo_header(seed))?.len();
            }
-            Ok((sols, t.elapsed().as_secs_f64()))
+            let dt = t.elapsed().as_secs_f64();
            // Snapshot telemetry while the card is still under load.
            let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
            Ok(WorkerResult {
                sols,
                dt,
                watts: backend.power_watts(),
                temp_c: backend.temperature_c(),
                core_mhz,
                mem_mhz,
            })
        }));
    }
@@ -1098,11 +1135,26 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
    let mut workers = 0usize;
    for h in handles {
        match h.join().unwrap() {
-            Ok((sols, dt)) => {
+            Ok(r) => {
-                let sol_s = sols as f64 / dt;
+                let sol_s = r.sols as f64 / r.dt;
                // Optional telemetry tail: " | 142 W, 41.7 Sol/W, 68°C, 2700/2500 MHz".
                let mut tail = String::new();
                if let Some(w) = r.watts {
                    tail.push_str(&format!(" | {w:.0} W"));
                    if w > 0.0 {
                        tail.push_str(&format!(", {:.2} Sol/W", sol_s / w));
                    }
                }
                if let Some(t) = r.temp_c {
                    tail.push_str(&format!(", {t}°C"));
                }
                if let (Some(c), m) = (r.core_mhz, r.mem_mhz) {
                    tail.push_str(&format!(", {c}/{} MHz", m.map(|m| m.to_string()).unwrap_or_else(|| "?".into())));
                }
                info!(
-                    "  worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {sols} solutions",
+                    "  worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {} solutions{tail}",
-                    1000.0 * dt / runs as f64
+                    1000.0 * r.dt / runs as f64,
                    r.sols
                );
                agg_sols += sol_s;
                workers += 1;
@@ -14,6 +14,22 @@ use crate::equihash;
 use crate::params::{HEADER_LEN, SOLUTION_BYTES};
 use crate::stratum::{StratumClient, Work};
 /// Process-wide target edge temperature (°C) for the software solve-cadence
 /// governor; `None` ⇒ run flat-out. Set once at startup from `--target-temp`.
 static TARGET_TEMP_C: OnceLock<Option<u32>> = OnceLock::new();
 /// Install the governor's target temperature (call once, before workers start).
 pub fn set_target_temp(c: Option<u32>) {
    if let Some(t) = c {
        info!("temperature governor enabled: holding GPUs ≤{t}°C (paced cadence)");
    }
    let _ = TARGET_TEMP_C.set(c);
 }
 fn target_temp_c() -> Option<u32> {
    TARGET_TEMP_C.get().copied().flatten()
 }
 /// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
 fn sha256d(data: &[u8]) -> [u8; 32] {
    let first = Sha256::digest(data);
@@ -136,6 +152,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.power_watts(),
            #[cfg(feature = "gpu")]
            Backend::Gpu(solver) => solver.power_watts(),
            _ => None,
        }
    }
@@ -145,6 +163,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.temperature_c(),
            #[cfg(feature = "gpu")]
            Backend::Gpu(solver) => solver.temperature_c(),
            _ => None,
        }
    }
@@ -154,6 +174,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.current_power_limit_w(),
            #[cfg(feature = "gpu")]
            Backend::Gpu(solver) => solver.current_power_limit_w(),
            _ => None,
        }
    }
@@ -163,6 +185,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.power_limit_range_w(),
            #[cfg(feature = "gpu")]
            Backend::Gpu(solver) => solver.power_limit_range_w(),
            _ => None,
        }
    }
@@ -190,6 +214,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.current_clocks_mhz(),
            #[cfg(feature = "gpu")]
            Backend::Gpu(solver) => solver.current_clocks_mhz(),
            _ => (None, None),
        }
    }
@@ -513,6 +539,14 @@ fn worker(
    let mut last_job = Instant::now();
    let mut paused = false;
    let mut disabled_pause = false;
    // Software temperature governor: pace the solve cadence to hold edge temp at
    // or below `--target-temp` (no hardware writes). `gov_sleep` is the per-pass
    // pause, nudged from the periodic temperature sample.
    let gov_target = target_temp_c();
    let mut gov_sleep = Duration::ZERO;
    if let Some(target) = gov_target {
        info!("worker {id}: temperature governor active — pacing cadence to hold ≤{target}°C");
    }
    while running.load(Ordering::Relaxed) {
        if work_handle.epoch() != current.epoch {
@@ -583,6 +617,9 @@ fn worker(
            }
            if let Some(t) = backend.temperature_c() {
                stats.workers[id].temp_c.store(t, Ordering::Relaxed);
                if let Some(target) = gov_target {
                    gov_sleep = govern_cadence(gov_sleep, t, target);
                }
            }
            let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
            if let Some(c) = core_mhz {
@@ -614,12 +651,34 @@ fn worker(
            let ctx = inflight.pop_front().unwrap();
            process_results(id, &client, &ctx, &solutions, &stats);
        }
        // Temperature governor: pace the cadence (held off entirely when flat-out).
        if !gov_sleep.is_zero() {
            std::thread::sleep(gov_sleep);
        }
    }
    drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
    Ok(())
 }
 /// Nudge the per-pass governor pause to hold edge temperature near `target` (°C).
 /// Over target → lengthen the pause (proportional to the overshoot); comfortably
 /// under → shorten it. Called from the periodic temperature sample (~1 Hz), so
 /// this is a slow integral controller — fine for the card's thermal time constant.
 fn govern_cadence(cur: Duration, temp: u32, target: u32) -> Duration {
    const MAX_SLEEP: Duration = Duration::from_millis(100);
    let (t, target) = (temp as i64, target as i64);
    if t > target {
        cur.saturating_add(Duration::from_micros(400 * (t - target) as u64))
            .min(MAX_SLEEP)
    } else if t < target - 1 {
        cur.saturating_sub(Duration::from_millis(1))
    } else {
        cur
    }
 }
 /// Supervise CPU mining: spawn one worker per group for the current group size,
 /// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
 /// rebuild the grouping, and respawn. Runs until `running` is cleared.
@@ -843,3 +902,49 @@ fn build_nonce(nonce1: &[u8], counter: u64) -> Result<[u8; 32]> {
    tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
    Ok(nonce)
 }
 #[cfg(test)]
 mod governor_tests {
    use super::*;
    #[test]
    fn governor_idles_below_target() {
        // At/under target → no pause is introduced.
        assert_eq!(govern_cadence(Duration::ZERO, 60, 70), Duration::ZERO);
        assert_eq!(govern_cadence(Duration::ZERO, 70, 70), Duration::ZERO);
    }
    #[test]
    fn governor_lengthens_over_target_and_clamps() {
        // Over target → pause grows proportional to the overshoot.
        let s = govern_cadence(Duration::ZERO, 75, 70); // +5°C
        assert_eq!(s, Duration::from_micros(400 * 5));
        // Far over target saturates at the cap, never beyond.
        let hot = govern_cadence(Duration::from_millis(99), 200, 70);
        assert_eq!(hot, Duration::from_millis(100));
    }
    #[test]
    fn governor_eases_off_when_cool() {
        // 2°C+ under target → pause shrinks; floors at zero (no underflow).
        let s = govern_cadence(Duration::from_millis(3), 67, 70);
        assert_eq!(s, Duration::from_millis(2));
        assert_eq!(govern_cadence(Duration::ZERO, 50, 70), Duration::ZERO);
    }
    #[test]
    fn governor_converges_to_hold_target() {
        // Simulate a card whose temp rises with duty: more pause → cooler. The
        // controller should settle near the target without runaway.
        let target = 70u32;
        let mut sleep = Duration::ZERO;
        let mut temp = 85i32; // starts hot
        for _ in 0..200 {
            sleep = govern_cadence(sleep, temp as u32, target);
            // crude plant: each ms of pause sheds ~1.5°C off a 90°C flat-out temp.
            let modeled = 90.0 - 1.5 * sleep.as_millis() as f64;
            temp = modeled.round() as i32;
        }
        assert!((target as i32 - 2..=target as i32 + 2).contains(&temp), "settled at {temp}°C, want ~{target}");
    }
 }
@@ -0,0 +1,124 @@
 ; OpenCL import library definition for jackpotminer's Windows build.
 ; Generated from the cl-sys 0.4.3 exports so a vendor-neutral OpenCL.lib /
 ; libOpenCL.a can be produced at build time without an OpenCL SDK. The actual
 ; OpenCL.dll (ICD loader) is provided at runtime by the GPU driver.
 LIBRARY OpenCL
 EXPORTS
    clBuildProgram
    clCloneKernel
    clCompileProgram
    clCreateBuffer
    clCreateCommandQueue
    clCreateCommandQueueWithProperties
    clCreateContext
    clCreateContextFromType
    clCreateFromGLBuffer
    clCreateFromGLRenderbuffer
    clCreateFromGLTexture
    clCreateFromGLTexture2D
    clCreateFromGLTexture3D
    clCreateImage
    clCreateImage2D
    clCreateImage3D
    clCreateKernel
    clCreateKernelsInProgram
    clCreatePipe
    clCreateProgramWithBinary
    clCreateProgramWithBuiltInKernels
    clCreateProgramWithIL
    clCreateProgramWithSource
    clCreateSampler
    clCreateSamplerWithProperties
    clCreateSubBuffer
    clCreateSubDevices
    clCreateUserEvent
    clEnqueueAcquireGLObjects
    clEnqueueBarrier
    clEnqueueBarrierWithWaitList
    clEnqueueCopyBuffer
    clEnqueueCopyBufferRect
    clEnqueueCopyBufferToImage
    clEnqueueCopyImage
    clEnqueueCopyImageToBuffer
    clEnqueueFillBuffer
    clEnqueueFillImage
    clEnqueueMapBuffer
    clEnqueueMapImage
    clEnqueueMarker
    clEnqueueMarkerWithWaitList
    clEnqueueMigrateMemObjects
    clEnqueueNativeKernel
    clEnqueueNDRangeKernel
    clEnqueueReadBuffer
    clEnqueueReadBufferRect
    clEnqueueReadImage
    clEnqueueReleaseGLObjects
    clEnqueueSVMFree
    clEnqueueSVMMap
    clEnqueueSVMMemcpy
    clEnqueueSVMMemFill
    clEnqueueSVMMigrateMem
    clEnqueueSVMUnmap
    clEnqueueTask
    clEnqueueUnmapMemObject
    clEnqueueWaitForEvents
    clEnqueueWriteBuffer
    clEnqueueWriteBufferRect
    clEnqueueWriteImage
    clFinish
    clFlush
    clGetCommandQueueInfo
    clGetContextInfo
    clGetDeviceAndHostTimer
    clGetDeviceIDs
    clGetDeviceInfo
    clGetEventInfo
    clGetEventProfilingInfo
    clGetExtensionFunctionAddress
    clGetExtensionFunctionAddressForPlatform
    clGetGLContextInfoKHR
    clGetGLObjectInfo
    clGetGLTextureInfo
    clGetHostTimer
    clGetImageInfo
    clGetKernelArgInfo
    clGetKernelInfo
    clGetKernelSubGroupInfo
    clGetKernelWorkGroupInfo
    clGetMemObjectInfo
    clGetPipeInfo
    clGetPlatformIDs
    clGetPlatformInfo
    clGetProgramBuildInfo
    clGetProgramInfo
    clGetSamplerInfo
    clGetSupportedImageFormats
    clLinkProgram
    clReleaseCommandQueue
    clReleaseContext
    clReleaseDevice
    clReleaseEvent
    clReleaseKernel
    clReleaseMemObject
    clReleaseProgram
    clReleaseSampler
    clRetainCommandQueue
    clRetainContext
    clRetainDevice
    clRetainEvent
    clRetainKernel
    clRetainMemObject
    clRetainProgram
    clRetainSampler
    clSetDefaultDeviceCommandQueue
    clSetEventCallback
    clSetKernelArg
    clSetKernelArgSVMPointer
    clSetKernelExecInfo
    clSetMemObjectDestructorCallback
    clSetUserEventStatus
    clSVMAlloc
    clSVMFree
    clUnloadCompiler
    clUnloadPlatformCompiler
    clWaitForEvents
Author	SHA1	Message	Date
jackpotincorporated	09725cf674	OpenCL: de-duplicate the same physical GPU across platforms A GPU exposed by both a vendor runtime (ROCm) and rusticl/Mesa appeared twice in the device list, so mining 'all' ran each card twice (pure contention). Add a single canonical enumerate_devices() — used by list_devices, device_is_nvidia, cpu_device_index and pick_device — that dedupes by physical GPU and prefers the vendor runtime over Mesa. Dedup key is the PCI address: ROCm/NVIDIA expose it via cl_khr_pci_bus_info; rusticl doesn't, but its cl_khr_device_uuid encodes the PCI BDF, so the same card yields the same key on both. Devices without either (CPU/PoCL) are never deduped. No behavior change on single-platform hosts (nothing to dedup); here the list drops 4->2 (both physical GPUs on ROCm, ~38 Sol/s) and device indices are unchanged for the kept devices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:57:19 -04:00
jackpotincorporated	41db98af69	AMD kernel: drop fp64 dependency (build on rusticl/Mesa OpenCL) equihash192_7.cl's decompress/decompress2 used double + sqrt behind cl_khr_fp64, so the kernel failed to build on OpenCL stacks without fp64 (notably rusticl/Mesa) — those workers died with 'use of type double requires cl_khr_fp64'. Replace round(sqrt(2*x+1)) with an exact integer square root (single-precision estimate corrected to the integer floor, then rounded; inputs are triangular indices < ~2^26). No fp64, no behavior change on ROCm (verified bit-identical: 77 solutions/40 nonces, same as before), and rusticl devices now build and solve correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:39:45 -04:00
jackpotincorporated	31aa85733e	AMD GPU telemetry + --target-temp governor Brings AMD cards to parity with NVIDIA for monitoring/control surface, which was NVML-only. New src/amd_smi.rs is a gpu_tune::GpuTuner backed by Linux amdgpu sysfs (power1_average, temp1_input edge, freq1_input sclk, pp_dpm_sclk/mclk), matched to the device by PCI bus id from OpenCL cl_khr_pci_bus_info. gpu_tune is un-gated to compile under the gpu feature; open() probes NVML then amd_smi. GpuSolver carries the tuner and Backend::Gpu dispatches power/temp/clocks, so the TUI and --benchmark now show power, temperature, clocks and Sol/W for AMD. Telemetry-only — setters are Unsupported (amdgpu control nodes are root-only). --target-temp <C> adds an opt-in software governor (miner::govern_cadence) that paces solve cadence to hold edge temperature, no hardware writes/root. With small thermal throttle it won't beat flat-out on raw Sol/s; it's a temp/efficiency lever. Unit-tested controller; flag/plumbing verified live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:17:59 -04:00
jackpotincorporated	0002e90451	TUI: scale CPU core-toggle granularity by core count Cap the cores-per-row group size (and the dashboard 'g' cycle) by total core count so the toggleable-row count stays sensible and small machines get finer control: ≤4 cores toggle individually (size 1), 5-8 cores in groups of up to 2, and more than 8 in groups of up to 4. The cap is also the default — the prior fixed default of 4 now clamps to the tier (1/2/4), and an explicit --cpu-group-size is clamped to the cap too. Add max_group_size() in cpu_groups; update the help text and the cpu_groups / control tests (the cycle test now uses 16 cores so it can exercise sizes 4/2/1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 18:34:10 -04:00
jackpotincorporated	afd56bee1b	CPU solver: AVX-512 round-0 hashing + prefetch the emit gather Branch the CPU solver onto modern x86 extensions, runtime-dispatched with the existing AVX2/scalar fallbacks: - BatchHasher::hash8 — an 8-lane AVX-512 BLAKE2b final-block compression (native _mm512_ror_epi64 rotates), falling back to two AVX2 hash4s or scalar. Round 0 now hashes eight g-values per chunk. round0-hash drops ~1.45x on AVX-512 CPUs (≈225→155 ms here, AMD Zen4). - emit_bucket software-prefetches each collision group's randomly-gathered member slots (the ~1 GB slot arena is the round's cache-miss bottleneck), shaving a few percent off the dominant emit phase. Controlled A/B on this Zen4 box (same thermal state): ~4-5% faster overall. The collision rounds are memory-bandwidth bound, so SIMD width is not the limiter — the modern-ISA win is modest by nature. EQ_NO_AVX512 / EQ_NO_PREFETCH opt out per-CPU (e.g. parts where AVX-512 downclocks) and back the A/B harness. hash8 is validated against the scalar reference (batch_matches_reference) and full solves still find valid solutions in every dispatch configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 13:12:57 -04:00
jackpotincorporated	00531fb591	Windows build: generate the OpenCL import library, drop the SDK dependency Building the `gpu` (OpenCL) backend on Windows no longer needs a vendor OpenCL SDK. `cl-sys` links `OpenCL` (#[link(name = "OpenCL")]); instead of requiring an SDK-provided OpenCL.lib, build.rs now generates a vendor-neutral import library at build time from windows/OpenCL.def (all 118 cl-sys exports) — lib.exe for MSVC (located via the cc crate), dlltool for MinGW — and puts it on the link search path. The real OpenCL.dll is supplied at runtime by the GPU driver. build.rs no-ops on non-Windows targets and when the gpu feature is off, and warns rather than panics if the toolchain tool is absent so `cargo check` still works. Combined with the runtime-loaded (dlopen) CUDA/NVML, a Windows build now needs zero external GPU libraries. Add BUILD-windows.md (toolchain, build, crt-static packaging, runtime deps, CI) and link it from the README. Verified the whole crate compiles for x86_64-pc-windows-gnu (default features and gpu,cuda); Linux is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 12:38:15 -04:00
jackpotincorporated	8a9d98a21d	Refresh solver architecture comment for the optimized design The module header still described a fixed 32-byte slot, a serial partition, and a "not yet taken" radix scatter. Update it to reflect the current memory-traffic- oriented design: parallel counting-sort partition_top, keys_part/keys_sorted to make the key reads sequential, narrowing per-round packed slots with a masked store, and the remaining slot-gather floor (with the note that the full payload radix scatter was evaluated and loses on the wide early rounds). Comment-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:48:29 -04:00
jackpotincorporated	1b4a2a4dd9	Scatter partition-contiguous keys to kill the per-round key gathers count_pairs, low_group, and the emit group-walk all read each entry's leading key via `keys[order[..]]` — a random gather over the whole ~128 MB keys array, three times per round. partition_top now also produces `keys_part` (the leading keys in partition order, keys_part[p] == keys[order[p]]), written by the same parallel, disjoint phase-3 scatter at 4 bytes/entry. count_pairs and low_group then stream their partition's keys sequentially, and low_group emits a `keys_sorted` array so the emit group walk streams a dense local copy instead of gathering keys[sorted[i]]. The only remaining DRAM-random access in the rounds is the unavoidable slot gather. Measured (16 threads, clamp 16/32): count ~160 -> ~10 ms/round, emit ~770 -> ~550 ms/round, partition +~80 ms (the added 128 MB scatter); full solve ~8.4 -> ~7.04 s (~16%). Cumulative across the three CPU-solver changes: ~13.4 -> ~7.04 s (-47%), 0.07 -> 0.14 solve/s. Identical solution yield; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:23:10 -04:00
jackpotincorporated	966ce3e262	Pack collision-round slots at shrinking per-round width Each collision round consumes the leading 24-bit block, so a round-r output entry only carries 8-r meaningful words — but every round previously stored a fixed 8-word (32 B) slot, moving dead trailing words. Since the collision rounds are memory-bandwidth bound, that wasted DRAM traffic is wasted time. Pack each round's slots at w_out = w_in - 1 words. The XOR child producer still loads a full 256-bit register (over-reading up to SLOT words past a narrow tail slot — buffers carry a SLOT_SLACK pad so this stays in bounds) but masked-stores only the w_out live lanes so packed neighbours aren't clobbered. The over-read garbage only ever lands in non-stored lanes: storing out[0..w_out] = x[1..w_in] uses exclusively meaningful input words. Width is threaded through collide/collide_final/emit_bucket from solve_with (round 0 stays full 8-word). Measured (16 threads, clamp 16/32): ~9.2 -> ~8.4 s/solve (~9%); per-round time now shrinks r1~1230 -> r6~1045 ms. Cumulative with the parallel-partition change: ~13.4 -> ~8.4 s (-37%). Identical solution yield; xor_child_matches_scalar now covers every width + the masked-store no-clobber property; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:11:05 -04:00
jackpotincorporated	501527d3cb	Parallelize partition_top; add solver benchmark + phase profiling partition_top was the only serial stage in the otherwise rayon-parallel collision rounds — plain `for k in 0..n` count and scatter loops that left 15/16 cores idle for ~35% of every round (~650 ms). Replace it with a parallel counting sort: per-chunk top-bucket histograms, a small serial pass to per-chunk base offsets, then a disjoint-region scatter through a shared raw pointer (each chunk writes a provably non-overlapping set of positions). Entries within a bucket become chunk-major rather than index-major, which is immaterial: count_pairs/low_group depend only on the low-key multiset, and solutions are canonicalized, de-duplicated, and verified downstream. Measured (16 threads): partition_top ~650 -> ~100 ms/round (6.5x), collide-final ~1.18 -> ~0.59 s, full solve ~13.4 -> ~9.2 s (-31%, 0.07 -> 0.11 solve/s), with identical solution yield and all validity tests passing. Also add (gated/ignored, no production-path behavior change): - full_solve_baseline: an #[ignore] throughput benchmark over realistic dense headers (EQ_BENCH_ITERS / EQ_BENCH_CLAMPS). - EQ_PROFILE-gated per-phase and per-collide-sub-phase timing in solve_with. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 10:42:54 -04:00