OpenCL: de-duplicate the same physical GPU across platforms

A GPU exposed by both a vendor runtime (ROCm) and rusticl/Mesa appeared twice in the device list, so mining 'all' ran each card twice (pure contention). Add a single canonical enumerate_devices() — used by list_devices, device_is_nvidia, cpu_device_index and pick_device — that dedupes by physical GPU and prefers the vendor runtime over Mesa. Dedup key is the PCI address: ROCm/NVIDIA expose it via cl_khr_pci_bus_info; rusticl doesn't, but its cl_khr_device_uuid encodes the PCI BDF, so the same card yields the same key on both. Devices without either (CPU/PoCL) are never deduped. No behavior change on single-platform hosts (nothing to dedup); here the list drops 4->2 (both physical GPUs on ROCm, ~38 Sol/s) and device indices are unchanged for the kept devices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AMD kernel: drop fp64 dependency (build on rusticl/Mesa OpenCL)
2026-06-06 20:57:19 -04:00 · 2026-06-06 20:39:45 -04:00 · 2026-06-06 20:17:59 -04:00 · 2026-06-06 18:34:10 -04:00 · 2026-06-06 13:12:57 -04:00 · 2026-06-06 12:38:15 -04:00
16 changed files with 1506 additions and 242 deletions
@@ -0,0 +1,119 @@
+# Building jackpotminer on Windows
+
+The miner builds on Windows with **no external GPU SDKs**. The CUDA driver and
+NVML are loaded at runtime (`dlopen`/`LoadLibrary`), and the OpenCL import
+library is generated at build time from `windows/OpenCL.def` — so you don't need
+the CUDA Toolkit, an OpenCL SDK, or any vendor libraries to compile. The whole
+codebase (miner + the `jackpotminer-config` GUI) is verified to compile for
+`x86_64-pc-windows-*`.
+
+## Prerequisites
+
+- **Rust** (https://rustup.rs). The default target is `x86_64-pc-windows-msvc`.
+- **MSVC build tools** — Visual Studio 2019/2022 or the standalone *Build Tools
+  for Visual Studio* with the **“Desktop development with C++”** workload (gives
+  `link.exe`, `lib.exe`, and the Windows SDK).
+  - Build from a **“x64 Native Tools Command Prompt for VS”** (or any shell where
+    the MSVC environment is active) so `lib.exe` is found. `cargo` also locates
+    it automatically via the registry in most cases.
+- *(Alternative toolchain)* the GNU target `x86_64-pc-windows-gnu` works too,
+  with **MinGW-w64** on `PATH` (provides `dlltool` and the linker).
+
+No CUDA Toolkit and **no OpenCL SDK** are required.
+
+## Build
+
+```powershell
+:: Default: OpenCL + CUDA backends + the GUI config tool
+cargo build --release
+
+:: Miner only (no GUI), both GPU backends
+cargo build --release --no-default-features --features gpu,cuda
+
+:: OpenCL only (AMD / Intel / NVIDIA)
+cargo build --release --no-default-features --features gpu
+
+:: NVIDIA only — needs nothing external at build time (CUDA is dlopen'd)
+cargo build --release --no-default-features --features cuda
+```
+
+Outputs: `target\release\jackpotminer.exe` (and `jackpotminer-config.exe` with
+the default features).
+
+### How the OpenCL build dependency is avoided
+
+`ocl`/`cl-sys` link `OpenCL` (`#[link(name = "OpenCL")]`), which normally needs
+an `OpenCL.lib` from a vendor SDK. Instead, `build.rs` generates a
+vendor-neutral import library from `windows/OpenCL.def`:
+
+- **MSVC:** `lib.exe /def:windows\OpenCL.def /out:OpenCL.lib /machine:X64`
+- **GNU:** `dlltool -d windows/OpenCL.def -l libOpenCL.a -m i386:x86-64`
+
+and puts it on the link search path. The import library only forwards to
+`OpenCL.dll`, which the GPU driver provides at runtime. If the toolchain tool
+isn’t on `PATH`, `build.rs` prints a warning and the link step fails with a clear
+message (compilation/`cargo check` still works).
+
+## Distribution
+
+Statically link the MSVC C runtime so users don’t need the VC++ redistributable:
+
+```powershell
+set RUSTFLAGS=-C target-feature=+crt-static
+cargo build --release
+```
+
+## Runtime dependencies (on the mining machine)
+
+- **`OpenCL.dll`** — the ICD loader, installed with any GPU driver (AMD/NVIDIA/
+  Intel). Required for the OpenCL backend.
+- **`nvcuda.dll`** + **`nvml.dll`** — installed with the NVIDIA driver. Loaded
+  on demand for the CUDA backend; absent on AMD-only machines, where the miner
+  simply reports no CUDA devices.
+- The **VC++ runtime**, unless you built with `+crt-static`.
+
+A `cuda`-enabled binary still starts on a machine with no NVIDIA driver.
+
+## Building Windows binaries without a Windows machine
+
+### GitHub Actions (recommended)
+
+```yaml
+name: windows
+on: [push, workflow_dispatch]
+jobs:
+  build:
+    runs-on: windows-latest          # MSVC + lib.exe already on PATH
+    steps:
+      - uses: actions/checkout@v4
+      - uses: dtolnay/rust-toolchain@stable
+      - run: cargo build --release --features gpu,cuda
+        env:
+          RUSTFLAGS: -C target-feature=+crt-static
+      - uses: actions/upload-artifact@v4
+        with:
+          name: jackpotminer-windows
+          path: target/release/*.exe
+```
+
+### Cross-compile from Linux (GNU target)
+
+```bash
+rustup target add x86_64-pc-windows-gnu
+sudo pacman -S mingw-w64           # or your distro's mingw-w64 (gives dlltool + linker)
+cargo build --release --target x86_64-pc-windows-gnu --no-default-features --features gpu,cuda
+```
+
+The MSVC target can’t be linked from Linux. The GUI config tool (`eframe`) is
+easiest to build natively on Windows.
+
+## Status / caveats
+
+- **Compilation for Windows is verified** here via `cargo check
+  --target x86_64-pc-windows-gnu` (default features, and `gpu,cuda`).
+- The OpenCL **import-library linking** uses the standard `lib.exe`/`dlltool`
+  technique; validate it with an actual Windows (or MinGW cross) build, which
+  needs those tools present.
+- `relaunch_in_terminal` (reopen-in-a-terminal on GUI launch) is Linux-only;
+  harmless on Windows, where double-clicking a console binary already opens a
+  console.
@@ -2230,6 +2230,7 @@ version = "0.1.0"
 dependencies = [
 "anyhow",
 "blake2b_simd",
+ "cc",
 "clap",
 "core_affinity",
 "ctrlc",
@@ -45,6 +45,11 @@ name = "jackpotminer-config"
 path = "src/config_gui.rs"
 required-features = ["config-gui"]

+[build-dependencies]
+# Locates the MSVC `lib.exe` (to generate the OpenCL import library on Windows);
+# unused on other platforms. See build.rs.
+cc = "1"
+
 [profile.release]
 opt-level = 3
 lto = true
@@ -114,6 +114,13 @@ cargo build --release --no-default-features --features cuda       # CUDA only
 cargo build --release --no-default-features                       # CPU-only (no GPU)
 ```

+### Windows
+
+Builds with no external GPU SDKs — CUDA/NVML are loaded at runtime and the
+OpenCL import library is generated at build time. See
+[BUILD-windows.md](BUILD-windows.md) for the toolchain, build, and packaging
+steps.
+
 ### Portable / distributable builds

 The miner's only runtime dependencies are the C library and the OpenCL ICD loader
@@ -0,0 +1,79 @@
+//! Build script — Windows OpenCL import library only.
+//!
+//! On Windows, the `ocl`/`cl-sys` crates link `OpenCL` (`#[link(name =
+//! "OpenCL")]`), which normally requires an `OpenCL.lib` import library from a
+//! vendor OpenCL SDK. To avoid that build dependency, we generate a
+//! vendor-neutral import library ourselves from `windows/OpenCL.def` (the list
+//! of OpenCL exports) using the toolchain's own tools — `lib.exe` for MSVC,
+//! `dlltool` for the GNU (MinGW) toolchain — and put it on the link search path.
+//! The real `OpenCL.dll` (the ICD loader) is supplied at runtime by the GPU
+//! driver, exactly like `libOpenCL.so.1` on Linux.
+//!
+//! Nothing here is needed on non-Windows targets (the system `libOpenCL` is used
+//! directly) or when the `gpu` (OpenCL) feature is off — the script no-ops. The
+//! CUDA driver / NVML are loaded at runtime via dlopen (see `src/dylib.rs`), so
+//! they need no build-script support.
+
+use std::env;
+use std::path::PathBuf;
+use std::process::Command;
+
+fn main() {
+    println!("cargo:rerun-if-changed=build.rs");
+    println!("cargo:rerun-if-changed=windows/OpenCL.def");
+
+    // Only Windows targets with the OpenCL backend need an import library.
+    if env::var("CARGO_CFG_TARGET_OS").as_deref() != Ok("windows") {
+        return;
+    }
+    if env::var_os("CARGO_FEATURE_GPU").is_none() {
+        return;
+    }
+
+    let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
+    let def = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("windows/OpenCL.def");
+    let target = env::var("TARGET").unwrap_or_default();
+    let is_msvc = env::var("CARGO_CFG_TARGET_ENV").as_deref() == Ok("msvc");
+
+    let (mut cmd, tool) = if is_msvc {
+        // lib.exe /def:OpenCL.def /out:OpenCL.lib /machine:X64
+        let mut c = cc::windows_registry::find(&target, "lib.exe")
+            .unwrap_or_else(|| Command::new("lib.exe"));
+        c.arg("/nologo")
+            .arg(format!("/def:{}", def.display()))
+            .arg(format!("/out:{}", out_dir.join("OpenCL.lib").display()))
+            .arg("/machine:X64");
+        (c, "lib.exe")
+    } else {
+        // GNU/MinGW: dlltool -d OpenCL.def -l libOpenCL.a -m i386:x86-64
+        let dlltool = ["x86_64-w64-mingw32-dlltool", "dlltool"]
+            .into_iter()
+            .find(|t| Command::new(t).arg("--version").output().is_ok())
+            .unwrap_or("dlltool");
+        let mut c = Command::new(dlltool);
+        c.arg("-d")
+            .arg(&def)
+            .arg("-l")
+            .arg(out_dir.join("libOpenCL.a"))
+            .arg("-m")
+            .arg("i386:x86-64");
+        (c, "dlltool")
+    };
+
+    match cmd.status() {
+        Ok(s) if s.success() => {
+            // Resolve `#[link(name = "OpenCL")]` against the generated import lib.
+            println!("cargo:rustc-link-search=native={}", out_dir.display());
+        }
+        Ok(s) => panic!("{tool} failed ({s}) generating the OpenCL import library from {}", def.display()),
+        Err(e) => {
+            // Tool not found: let `cargo check` (which doesn't link) still succeed;
+            // a real build will fail at link with a clear "cannot find OpenCL".
+            println!(
+                "cargo:warning=could not run {tool} to generate the OpenCL import library ({e}); \
+                 ensure the toolchain tools are on PATH (MSVC dev prompt, or mingw-w64). \
+                 Linking the `gpu` feature will fail until then."
+            );
+        }
+    }
+}
@@ -2,8 +2,6 @@

 //#define PRINT 1

-#pragma OPENCL EXTENSION cl_khr_fp64 : enable
-
 __constant ulong blake_iv[] =
 {
    0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
@@ -507,14 +505,22 @@ uint compress2(uint in0, uint in1) {
 	return tmp;	
 }

+// Exact round(sqrt(x)) without fp64: a single-precision estimate corrected to the
+// exact integer floor, then rounded. Lets the kernel build on OpenCL stacks that
+// lack cl_khr_fp64 (e.g. rusticl/Mesa) while staying bit-identical on ROCm. Inputs
+// are triangular indices (< ~2^26), well within range for the float estimate +
+// integer correction.
+inline uint isqrt_round(ulong x) {
+	long m = (long) sqrt((float) x);
+	while (m > 0 && (ulong)(m * m) > x) m--;       // correct down to floor(sqrt)
+	while ((ulong)((m + 1) * (m + 1)) <= x) m++;   // correct up to floor(sqrt)
+	// round to nearest: round up iff the fractional part is >= 0.5.
+	return (uint)(((x - (ulong)(m * m)) > (ulong) m) ? (m + 1) : m);
+}
+
 uint2 decompress(uint2 in) {
-	double inFl = (double) (in.s0 >> 6);
-	
-	inFl *= 2.0;
-	inFl += 1.0;
-
 	uint2 res;
-	res.s0 = (uint) round(sqrt(inFl));
+	res.s0 = isqrt_round(2ul * (ulong)(in.s0 >> 6) + 1ul);


 	uint tmp = res.s0 * (res.s0-1);
@@ -535,13 +541,8 @@ uint2 decompress(uint2 in) {
 }

 uint2 decompress2(uint in) {
-	double inFl = (double) in;
-	
-	inFl *= 2.0;
-	inFl += 1.0;
-
 	uint2 res;
-	res.s0 = (uint) round(sqrt(inFl));
+	res.s0 = isqrt_round(2ul * (ulong) in + 1ul);


 	uint tmp = res.s0 * (res.s0-1);
@@ -0,0 +1,204 @@
+//! AMD GPU telemetry for [`crate::gpu_tune::GpuTuner`] via the Linux amdgpu
+//! sysfs interface (`/sys/class/drm/cardN/device/...`).
+//!
+//! Telemetry only: board power, edge temperature, and core/memory clocks come
+//! from the world-readable hwmon + DPM nodes, so it works unprivileged. The
+//! control nodes (`pp_od_clk_voltage`, power cap, performance level) are
+//! root-write-only and carry GPU-hang risk, so every setter returns
+//! [`SetOutcome::Unsupported`] — this backend never writes. The handle is matched
+//! to the physical card by PCI bus id (e.g. from OpenCL's `cl_khr_pci_bus_info`),
+//! so it lines up with whichever device the solver actually opened.
+
+use std::path::{Path, PathBuf};
+
+use crate::gpu_tune::{GpuTuner, SetOutcome};
+
+/// One amdgpu card's sysfs telemetry handle.
+pub struct AmdTuner {
+    /// Canonical `/sys/.../<PCI BDF>` device directory (holds `pp_dpm_*`).
+    device_dir: PathBuf,
+    /// `device_dir/hwmon/hwmonM` (the index M is not stable — resolved by glob).
+    hwmon_dir: PathBuf,
+    name: String,
+}
+
+// Only `PathBuf`/`String` — `AmdTuner` is `Send` automatically; no `unsafe impl`.
+
+/// Open a telemetry handle for the amdgpu card at `pci_bus_id` (e.g.
+/// "0000:03:00.0"). Returns `None` off Linux, or when no amdgpu card with
+/// telemetry nodes matches the bus id.
+pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
+    #[cfg(not(target_os = "linux"))]
+    {
+        let _ = pci_bus_id;
+        None
+    }
+    #[cfg(target_os = "linux")]
+    {
+        let want = bdf_tail(pci_bus_id)?;
+        for entry in std::fs::read_dir("/sys/class/drm").ok()?.flatten() {
+            let fname = entry.file_name();
+            let fname = fname.to_string_lossy();
+            // Match the GPU nodes ("card0", "card1", …), not the per-connector
+            // dirs ("card1-DP-1") or render nodes.
+            if !fname.starts_with("card") || fname.contains('-') {
+                continue;
+            }
+            // `cardN/device` symlinks to the PCI device dir `…/<BDF>`.
+            let Ok(device_dir) = std::fs::canonicalize(entry.path().join("device")) else {
+                continue;
+            };
+            let matches = device_dir
+                .file_name()
+                .and_then(|s| s.to_str())
+                .and_then(bdf_tail)
+                .map(|t| t == want)
+                .unwrap_or(false);
+            if !matches {
+                continue;
+            }
+            let Some(hwmon_dir) = find_hwmon(&device_dir) else { continue };
+            // Require a telemetry node so we don't attach to a card without
+            // sensors (e.g. some virtual/headless devices).
+            if !hwmon_dir.join("temp1_input").exists()
+                && !hwmon_dir.join("power1_average").exists()
+            {
+                continue;
+            }
+            let name = read_name(&device_dir);
+            return Some(Box::new(AmdTuner { device_dir, hwmon_dir, name }));
+        }
+        None
+    }
+}
+
+/// First `hwmon/hwmon*` subdir of a device dir (the index isn't stable).
+fn find_hwmon(device_dir: &Path) -> Option<PathBuf> {
+    for e in std::fs::read_dir(device_dir.join("hwmon")).ok()?.flatten() {
+        if e.file_name().to_string_lossy().starts_with("hwmon") {
+            return Some(e.path());
+        }
+    }
+    None
+}
+
+/// A display name for the card; amdgpu sysfs rarely exposes a marketing name, so
+/// fall back to a generic label.
+fn read_name(device_dir: &Path) -> String {
+    read_trim(&device_dir.join("product_name"))
+        .filter(|s| !s.is_empty())
+        .unwrap_or_else(|| "AMD GPU".to_string())
+}
+
+/// Normalize a PCI BDF to its `bus:device.function` tail (drop the domain), so
+/// e.g. "0000:03:00.0" and "03:00.0" compare equal.
+fn bdf_tail(bdf: &str) -> Option<String> {
+    let bdf = bdf.trim().to_ascii_lowercase();
+    match bdf.split(':').collect::<Vec<_>>().as_slice() {
+        [_domain, bus, devfunc] => Some(format!("{bus}:{devfunc}")),
+        [bus, devfunc] => Some(format!("{bus}:{devfunc}")),
+        _ => None,
+    }
+}
+
+fn read_trim(p: &Path) -> Option<String> {
+    std::fs::read_to_string(p).ok().map(|s| s.trim().to_string())
+}
+
+fn read_u64(p: &Path) -> Option<u64> {
+    read_trim(p)?.parse().ok()
+}
+
+/// Parse the MHz value out of a `pp_dpm_*` line like `"2: 2700Mhz *"`.
+fn parse_mhz(line: &str) -> Option<u32> {
+    let after_colon = line.split(':').nth(1)?;
+    let token = after_colon.trim().split_whitespace().next()?;
+    let digits: String = token.chars().take_while(|c| c.is_ascii_digit()).collect();
+    digits.parse().ok()
+}
+
+impl AmdTuner {
+    /// MHz of the currently active DPM level (the line marked `*`) in `file`.
+    fn dpm_active_mhz(&self, file: &str) -> Option<u32> {
+        let s = read_trim(&self.device_dir.join(file))?;
+        s.lines().find(|l| l.contains('*')).and_then(parse_mhz)
+    }
+    /// Highest DPM level (MHz) listed in `file`.
+    fn dpm_max_mhz(&self, file: &str) -> Option<u32> {
+        let s = read_trim(&self.device_dir.join(file))?;
+        s.lines().filter_map(parse_mhz).max()
+    }
+}
+
+impl GpuTuner for AmdTuner {
+    fn name(&self) -> String {
+        self.name.clone()
+    }
+
+    fn watts(&self) -> Option<f64> {
+        // power1_average is µW; fall back to the instantaneous power1_input.
+        let uw = read_u64(&self.hwmon_dir.join("power1_average"))
+            .or_else(|| read_u64(&self.hwmon_dir.join("power1_input")))?;
+        Some(uw as f64 / 1_000_000.0)
+    }
+
+    fn temperature_c(&self) -> Option<u32> {
+        // temp1 = edge (m°C).
+        let mc = read_u64(&self.hwmon_dir.join("temp1_input"))?;
+        Some(((mc + 500) / 1000) as u32)
+    }
+
+    fn current_power_limit_w(&self) -> Option<u32> {
+        // Absent on Navi 44 (RX 9060 XT); best-effort for cards that expose it.
+        read_u64(&self.hwmon_dir.join("power1_cap")).map(|uw| (uw / 1_000_000) as u32)
+    }
+
+    fn core_clock_mhz(&self) -> Option<u32> {
+        // freq1_input (Hz) is the live sclk; the DPM active level is the fallback.
+        if let Some(hz) = read_u64(&self.hwmon_dir.join("freq1_input")) {
+            if hz > 0 {
+                return Some((hz / 1_000_000) as u32);
+            }
+        }
+        self.dpm_active_mhz("pp_dpm_sclk")
+    }
+
+    fn mem_clock_mhz(&self) -> Option<u32> {
+        self.dpm_active_mhz("pp_dpm_mclk")
+    }
+
+    fn max_core_clock_mhz(&self) -> Option<u32> {
+        self.dpm_max_mhz("pp_dpm_sclk")
+    }
+
+    fn max_mem_clock_mhz(&self) -> Option<u32> {
+        self.dpm_max_mhz("pp_dpm_mclk")
+    }
+
+    fn power_limit_range_w(&self) -> Option<(u32, u32)> {
+        let mn = read_u64(&self.hwmon_dir.join("power1_cap_min"))?;
+        let mx = read_u64(&self.hwmon_dir.join("power1_cap_max"))?;
+        Some(((mn / 1_000_000) as u32, (mx / 1_000_000) as u32))
+    }
+
+    // Telemetry-only backend: never writes the root-only control nodes.
+    fn set_persistence(&self, _on: bool) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn lock_core_clock_mhz(&self, _mhz: u32) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn lock_mem_clock_mhz(&self, _mhz: u32) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn set_power_limit_w(&self, _watts: u32) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn set_core_offset_mhz(&self, _mhz: i32) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn set_mem_offset_mhz(&self, _mhz: i32) -> SetOutcome {
+        SetOutcome::Unsupported
+    }
+    fn reset(&self) {}
+}
@@ -215,6 +215,92 @@ impl BatchHasher {
        }
    }

+    /// Hash eight consecutive indices `g0..g0+8`, writing each 48-byte digest
+    /// into `out[0..8]`. Uses AVX-512 (one 8-wide BLAKE2b compression) when
+    /// available, else two AVX2 `hash4`s. Modern AMD (Zen4+) and Intel CPUs run
+    /// AVX-512 without the clock penalty older Intel parts had.
+    #[inline]
+    pub fn hash8(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
+        #[cfg(target_arch = "x86_64")]
+        {
+            if is_x86_feature_detected!("avx512f") {
+                unsafe { self.hash8_avx512(g0, out) };
+                return;
+            }
+        }
+        // Fallback: two 4-lane batches into the two halves of `out`.
+        let (lo, hi) = out.split_at_mut(4);
+        self.hash4(g0, (&mut lo[..4]).try_into().unwrap());
+        self.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
+    }
+
+    /// Eight-lane BLAKE2b final-block compression (AVX-512). Same structure as
+    /// [`Self::hash4_avx2`] but 512-bit lanes and native 64-bit rotates.
+    #[cfg(target_arch = "x86_64")]
+    #[target_feature(enable = "avx512f")]
+    unsafe fn hash8_avx512(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
+        use core::arch::x86_64::*;
+
+        #[inline(always)]
+        unsafe fn g8(
+            v: &mut [__m512i; 16],
+            a: usize, b: usize, c: usize, d: usize,
+            x: __m512i, y: __m512i,
+        ) {
+            v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), x);
+            v[d] = _mm512_ror_epi64::<32>(_mm512_xor_si512(v[d], v[a]));
+            v[c] = _mm512_add_epi64(v[c], v[d]);
+            v[b] = _mm512_ror_epi64::<24>(_mm512_xor_si512(v[b], v[c]));
+            v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), y);
+            v[d] = _mm512_ror_epi64::<16>(_mm512_xor_si512(v[d], v[a]));
+            v[c] = _mm512_add_epi64(v[c], v[d]);
+            v[b] = _mm512_ror_epi64::<63>(_mm512_xor_si512(v[b], v[c]));
+        }
+
+        // Only m0 and m1 are nonzero; m1's high 32 bits hold the per-lane `g`.
+        let tail0 = u64::from_le_bytes(self.tail[0..8].try_into().unwrap());
+        let tail_hi = u32::from_le_bytes(self.tail[8..12].try_into().unwrap()) as u64;
+        let m1 = |g: u32| (tail_hi | ((g as u64) << 32)) as i64;
+        let mut m = [_mm512_setzero_si512(); 16];
+        m[0] = _mm512_set1_epi64(tail0 as i64);
+        m[1] = _mm512_set_epi64(
+            m1(g0 + 7), m1(g0 + 6), m1(g0 + 5), m1(g0 + 4),
+            m1(g0 + 3), m1(g0 + 2), m1(g0 + 1), m1(g0),
+        );
+
+        let mut v = [_mm512_setzero_si512(); 16];
+        for i in 0..8 {
+            v[i] = _mm512_set1_epi64(self.mid[i] as i64);
+            v[i + 8] = _mm512_set1_epi64(IV[i] as i64);
+        }
+        v[12] = _mm512_xor_si512(v[12], _mm512_set1_epi64(FINAL_COUNT as i64));
+        v[14] = _mm512_xor_si512(v[14], _mm512_set1_epi64(-1)); // last-block flag
+
+        for s in &SIGMA {
+            g8(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
+            g8(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
+            g8(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
+            g8(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
+            g8(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
+            g8(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
+            g8(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
+            g8(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
+        }
+
+        // h[i] = mid[i] ^ v[i] ^ v[i+8]; first HASH_OUTPUT/8 words per lane, LE.
+        let mut tmp = [0u64; 8];
+        for i in 0..HASH_OUTPUT / 8 {
+            let o = _mm512_xor_si512(
+                _mm512_xor_si512(_mm512_set1_epi64(self.mid[i] as i64), v[i]),
+                v[i + 8],
+            );
+            _mm512_storeu_si512(tmp.as_mut_ptr() as *mut _, o);
+            for l in 0..8 {
+                out[l][i * 8..i * 8 + 8].copy_from_slice(&tmp[l].to_le_bytes());
+            }
+        }
+    }
+
    #[cfg(target_arch = "x86_64")]
    #[target_feature(enable = "avx2")]
    unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
@@ -316,5 +402,14 @@ mod tests {
                assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
            }
        }
+
+        // hash8 (AVX-512 or hash4 fallback) must match for several batches.
+        for base_g in [0u32, 8, 64, 1_000_000] {
+            let mut out = [[0u8; HASH_OUTPUT]; 8];
+            hasher.hash8(base_g, &mut out);
+            for l in 0..8 {
+                assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash8 mismatch at g={}", base_g + l as u32);
+            }
+        }
    }
 }
@@ -175,15 +175,15 @@ mod tests {
    #[test]
    fn protocol_get_and_set() {
        let controls = Controls::new(1, 0, 0, 0, 0, true);
-        let cpu = CpuMining::new((0..8).collect(), 4, false); // sizes [1,2,4,8]
+        let cpu = CpuMining::new((0..16).collect(), 4, false); // 16 cores -> tier cap 4, sizes [1,2,4]
        let stats = Stats::for_test(vec!["GPU 0".into()]);

-        // get: one device, cpu rows reflect the grouping (8 cores / 4 = 2 rows).
+        // get: one device, cpu rows reflect the grouping (16 cores / 4 = 4 rows).
        let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
        assert_eq!(v["ok"], true);
        assert_eq!(v["devices"].as_array().unwrap().len(), 1);
        assert_eq!(v["cpu"]["group_size"], 4);
-        assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 2);
+        assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 4);

        // set device 0 disabled.
        assert!(controls.device(0).enabled());
@@ -165,6 +165,21 @@ fn grouped(cores: &[usize], size: usize, enabled: &BTreeSet<usize>) -> Arc<CpuGr
    groups
 }

+/// Largest cores-per-row the dashboard allows, by total core count. Finer
+/// control on small machines, coarser on big ones so the toggleable-row count
+/// stays manageable: **≤4 cores** toggle individually (size 1), **5-8 cores** in
+/// groups of up to 2, and **more than 8** in groups of up to 4. This caps both
+/// the starting group size and the 'g' cycle.
+fn max_group_size(cores: usize) -> usize {
+    if cores <= 4 {
+        1
+    } else if cores <= 8 {
+        2
+    } else {
+        4
+    }
+}
+
 /// Live controller for CPU mining. Owns the selected cores plus the current
 /// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
 /// dashboard (the worker supervisor watches `group_size()` and respawns to
@@ -188,14 +203,16 @@ impl CpuMining {
    /// the starting group size (`--cpu-group-size`), and `start_enabled` whether
    /// mining begins on (`--cpu-mining`).
    pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
-        let initial_size = initial_size.max(1);
+        // Toggle granularity scales with core count so the row count stays
+        // manageable (see [`max_group_size`]): the cap bounds both the starting
+        // size and the dashboard 'g' cycle.
+        let cap = max_group_size(cores.len());
+        let initial_size = initial_size.clamp(1, cap);
        let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
        let groups = grouped(&cores, initial_size, &enabled);

-        // Cycle list: the usual powers of two plus the requested size, capped so a
-        // group never exceeds the core count (unless the user explicitly asked for
-        // a larger size), sorted and de-duplicated.
-        let cap = cores.len().max(initial_size).max(1);
+        // Cycle list: powers of two from 1 up to the cap, plus the (clamped)
+        // requested size, sorted and de-duplicated.
        let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
            .into_iter()
            .chain([initial_size])
@@ -346,39 +363,67 @@ mod tests {
    }

    #[test]
-    fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
-        // 8 cores, start at size 4 fully enabled -> two groups [0-3],[4-7], both on.
-        let m = CpuMining::new((0..8).collect(), 4, true);
+    fn group_size_tier_by_core_count() {
+        // ≤4 cores: individual cores only — size 1, no larger option.
+        let m = CpuMining::new((0..4).collect(), 4, false); // requested 4 clamps to 1
+        assert_eq!(m.group_size(), 1);
+        assert_eq!(m.groups().len(), 4);
+        m.cycle_group_size();
+        assert_eq!(m.group_size(), 1); // only one size, cycle is a no-op
+
+        // 5-8 cores: groups of up to 2 (requested 4 clamps to 2).
+        let m = CpuMining::new((0..8).collect(), 4, false);
+        assert_eq!(m.group_size(), 2);
+        assert_eq!(m.groups().len(), 4);
+        m.cycle_group_size();
+        assert_eq!(m.group_size(), 1); // cycle covers {1, 2}
+
+        // >8 cores: groups of up to 4 (the default).
+        let m = CpuMining::new((0..16).collect(), 4, false);
        assert_eq!(m.group_size(), 4);
-        assert_eq!(m.groups().len(), 2);
+        assert_eq!(m.groups().len(), 4);
+        // A larger explicit request is still capped at the tier max.
+        let m = CpuMining::new((0..16).collect(), 8, false);
+        assert_eq!(m.group_size(), 4);
+    }
+
+    #[test]
+    fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
+        // 16 cores (tier cap 4): start at size 4 fully enabled -> four groups,
+        // all on.
+        let m = CpuMining::new((0..16).collect(), 4, true);
+        assert_eq!(m.group_size(), 4);
+        assert_eq!(m.groups().len(), 4);
        assert!(m.groups().iter().all(|g| g.enabled()));

-        // Disable the second group (cores 4-7): now only cores 0-3 are enabled.
-        m.toggle_group(1);
-        assert!(m.groups().group(0).enabled());
-        assert!(!m.groups().group(1).enabled());
+        // Disable the last group (cores 12-15): now only cores 0-11 are enabled.
+        m.toggle_group(3);
+        assert!(!m.groups().group(3).enabled());

-        // Cycle to size 8 and rebuild: the single [0-7] group is off, because not
-        // all of its cores were enabled.
-        while m.group_size() != 8 {
+        // Cycle to size 1 (individual cores) and rebuild: 16 rows; cores 0-11 on,
+        // 12-15 off.
+        while m.group_size() != 1 {
            m.cycle_group_size();
        }
        let g = m.rebuild();
-        assert_eq!(g.len(), 1);
-        assert!(!g.group(0).enabled());
+        assert_eq!(g.len(), 16);
+        assert!(g.group(0).enabled()); // core 0
+        assert!(g.group(11).enabled()); // core 11
+        assert!(!g.group(12).enabled()); // core 12
+        assert!(!g.group(15).enabled()); // core 15

-        // Cycle to size 2 and rebuild: cores 0-3 are still tracked as enabled, so
-        // [0,1] and [2,3] come back on while [4,5] and [6,7] stay off — the choice
-        // survived two regroups (not derived from the all-off size-8 grouping).
+        // Cycle to size 2 and rebuild: cores 0-11 are still tracked as enabled, so
+        // [0,1]..[10,11] come back on while [12,13],[14,15] stay off — the choice
+        // survived two regroups.
        while m.group_size() != 2 {
            m.cycle_group_size();
        }
        let g = m.rebuild();
-        assert_eq!(g.len(), 4);
+        assert_eq!(g.len(), 8);
        assert!(g.group(0).enabled()); // [0,1]
-        assert!(g.group(1).enabled()); // [2,3]
-        assert!(!g.group(2).enabled()); // [4,5]
-        assert!(!g.group(3).enabled()); // [6,7]
+        assert!(g.group(5).enabled()); // [10,11]
+        assert!(!g.group(6).enabled()); // [12,13]
+        assert!(!g.group(7).enabled()); // [14,15]
        assert!(m.generation() >= 2);
    }

@@ -131,23 +131,31 @@ fn leaf_row(base: &State, idx: u32) -> Row {
 // histogram that has to be zeroed, scatter-filled, and cloned every round). Two
 // entries can collide only when their *whole* leading block matches, so they
 // always land in the same partition — making the partitions independent
-// collision domains that are processed in parallel across the rayon pool the
-// single CPU worker otherwise leaves idle.
+// collision domains processed in parallel across the rayon pool.
 //
-// The data layout follows xenoncat's packed slots and targets AVX2 throughput:
-//   * Entries live in fixed 32-byte `SLOT`s (one `__m256i`); producing a child
-//     is a single load/xor/permute/store (`xor_child_avx2`).
-//   * A dense parallel `keys[]` array mirrors each entry's leading block, so the
-//     `partition_top`/`count_pairs`/`low_group` histogram sweeps stream over
-//     4 bytes/entry instead of striding the slots.
+// The data layout targets AVX2 throughput; the round is memory-bandwidth bound,
+// so every choice below exists to cut DRAM traffic / random access:
+//   * `partition_top` is a parallel counting sort (per-chunk histograms ->
+//     per-chunk base offsets -> disjoint-region scatter), not a serial pass.
+//   * It also emits `keys_part` (the leading keys in partition order) so the
+//     per-partition `count_pairs`/`low_group` sweeps read keys sequentially
+//     instead of gathering `keys[order[..]]`; `low_group` likewise emits
+//     `keys_sorted` so the emit group walk streams a dense local copy.
+//   * Slots hold residual hash words in a *narrowing* packed slot: round 0 is
+//     `SLOT` words; each round consumes the leading block, so round r packs at
+//     `SLOT - r` words. The XOR child is one 256-bit load/xor/permute that
+//     masked-stores the live lanes (`xor_child_avx2`); buffers carry a
+//     `SLOT_SLACK` pad for the 256-bit over-read of a narrow tail slot.
 //   * `collide` runs two passes — count pairs, then emit children directly into
 //     one pre-sized arena at per-partition offsets — so there is no per-bucket
 //     allocation and no final concatenation copy; the `keys`/`slots` buffers
 //     ping-pong between rounds.
 //   * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
-// A further bandwidth step (not yet taken) is xenoncat's full radix scatter of
-// the payload into bucket-contiguous storage, turning the inner-loop slot
-// gathers into sequential reads.
+// The remaining bottleneck is the random slot gather in `emit_bucket` over the
+// ~1 GiB round buffer — the Equihash memory-hardness floor. A full radix scatter
+// of the payload into bucket-contiguous storage (to make that gather sequential)
+// was evaluated and loses: the wide-slot scatter costs more than the gather it
+// saves on the early, wide rounds.
 // ---------------------------------------------------------------------------

 /// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
@@ -164,36 +172,53 @@ const LOW_BUCKETS: usize = 1 << LOW_BITS;
 /// Mask isolating the low resolved bits.
 const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;

-/// Number of u32 words in a padded entry slot (32 bytes = one AVX2 register).
-/// Every round stores its residual hash words in a fixed 8-word slot so the XOR
-/// that produces a child is a single 256-bit load/xor/permute/store and every
-/// slot access is naturally aligned — xenoncat's packed-slot trick (and the same
-/// `uint4`-aligned-slot idea the CUDA backend already uses), on the CPU. The
-/// leading collision word lives in lane 0 and is mirrored into a dense parallel
-/// `keys[]` array so the histogram passes stream over 4 bytes/entry instead of
-/// striding the 32-byte slots.
+/// Number of u32 words in a round-0 entry slot (32 bytes = one AVX2 register).
+/// Round 0 stores all eight 24-bit blocks; the XOR that produces a child is a
+/// single 256-bit load/xor/permute — xenoncat's packed-slot trick (and the same
+/// `uint4`-aligned-slot idea the CUDA backend uses), on the CPU. The leading
+/// collision word lives in lane 0 and is mirrored into a dense parallel `keys[]`
+/// array so the histogram passes stream over 4 bytes/entry instead of striding
+/// the slots.
+///
+/// Later rounds use a *narrower* pitch: each collision round consumes the leading
+/// block, so a round-`r` output entry only carries `8 - r` meaningful words. The
+/// solver packs each round's slots at that width (see `collide`'s `w_out`),
+/// cutting the per-round slot-buffer DRAM traffic that bounds the collision rounds.
+/// The XOR producer still loads a full 256-bit register (over-reading up to
+/// `SLOT` words past a narrow tail slot, hence the `SLOT_SLACK` pad) but
+/// masked-stores only the `w_out` live lanes so packed neighbours aren't touched.
 const SLOT: usize = 8;

-/// Scalar child producer: `out[0..8] = (a XOR b)` rotated left one lane; returns
-/// the child's new leading word (lane 1 of the XOR). Lane 0 of the XOR is the
-/// just-collided block (always zero) and is rotated out.
+/// Trailing pad (in u32 words) on every slot buffer so the XOR producer's 256-bit
+/// load over a narrow tail slot stays in bounds. The over-read reaches at most
+/// `(n-1)*w + SLOT` words for pitch `w`, i.e. `SLOT - w ≤ SLOT` words past the
+/// `n*w` payload; `SLOT` words always covers it.
+const SLOT_SLACK: usize = SLOT;
+
+/// Scalar child producer: writes the `w_out` live words of `(a XOR b)` rotated
+/// left one lane into `out`, and returns the child's new leading word (lane 1 of
+/// the XOR). Lane 0 of the XOR is the just-collided block (always zero) and is
+/// rotated out. `a`/`b` are read a full `SLOT` words wide (the caller pads each
+/// slot buffer by `SLOT_SLACK`); only `out[0..w_out]` is written.
 #[inline]
-unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
+unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
    let mut x = [0u32; SLOT];
    for t in 0..SLOT {
        x[t] = *a.add(t) ^ *b.add(t);
    }
-    for t in 0..SLOT {
-        *out.add(t) = x[(t + 1) % SLOT];
+    // out[t] = x[(t + 1) % SLOT]; for t < w_out <= SLOT-1 the modulo is a no-op.
+    for t in 0..w_out {
+        *out.add(t) = x[t + 1];
    }
    x[1]
 }

 /// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
-/// one) + one store. Replaces the per-word scalar XOR loop.
+/// one), then a masked store of the low `w_out` lanes so the packed, `w_out`-
+/// pitched output never clobbers the next slot.
 #[cfg(target_arch = "x86_64")]
 #[target_feature(enable = "avx2")]
-unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
+unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
    use core::arch::x86_64::*;
    let x = _mm256_xor_si256(
        _mm256_loadu_si256(a as *const __m256i),
@@ -201,7 +226,12 @@ unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
    );
    // rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
    let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
-    _mm256_storeu_si256(out as *mut __m256i, p);
+    // mask = lanes [0, w_out) -> all-ones; maskstore writes only those.
+    let mask = _mm256_cmpgt_epi32(
+        _mm256_set1_epi32(w_out as i32),
+        _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7),
+    );
+    _mm256_maskstore_epi32(out as *mut i32, mask, p);
    _mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
 }

@@ -244,38 +274,113 @@ fn repack_index(src: &[u8], dst: &mut [u32]) {
    repack_index_scalar(src, dst);
 }

+/// A raw `*mut u32` shared across rayon workers for the parallel scatter below.
+/// Sound only because the workers write provably disjoint position sets.
+#[derive(Clone, Copy)]
+struct OrderPtr(*mut u32);
+unsafe impl Send for OrderPtr {}
+unsafe impl Sync for OrderPtr {}
+
 /// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
-/// their (dense) leading block. Returns `(starts, order)`, where partition `v`
-/// owns the input indices `order[starts[v]..starts[v + 1]]`. The histogram
-/// passes stream over `keys[]` (4 bytes/entry) instead of striding the slots.
-fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>) {
+/// their (dense) leading block. Returns `(starts, order, keys_part)`: partition
+/// `v` owns the input indices `order[starts[v]..starts[v + 1]]`, and `keys_part`
+/// is the entries' leading words in the same partition-contiguous order
+/// (`keys_part[p] == keys[order[p]]`). Carrying `keys_part` lets the per-partition
+/// `count_pairs`/`low_group` sweeps read keys sequentially instead of gathering
+/// `keys[order[..]]` over the whole array — at the cost of one extra 4-byte/entry
+/// scatter here, folded into the (already parallel) phase-3 pass.
+///
+/// Parallel counting sort: the input is split into one contiguous chunk per
+/// rayon worker. Each chunk histograms its slice (phase 1), a small serial pass
+/// turns those into per-chunk base offsets within each bucket's output region
+/// (phase 2), and each chunk scatters its entries into `order`/`keys_part`
+/// (phase 3). Chunk `c`'s bucket-`b` writes land in `[off[c][b], off[c+1][b])`,
+/// disjoint from every other chunk and bucket, so the concurrent writes never
+/// alias. Entries within a bucket end up chunk-major rather than index-major;
+/// that reordering is immaterial — `count_pairs`/`low_group` depend only on the
+/// multiset of low keys, and final solutions are canonicalised, de-duplicated,
+/// and verified.
+fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>, Vec<u32>) {
    let mut starts = vec![0u32; TOP_BUCKETS + 1];
-    for k in 0..n {
-        starts[(keys[k] >> LOW_BITS) as usize + 1] += 1;
-    }
-    for i in 0..TOP_BUCKETS {
-        starts[i + 1] += starts[i];
-    }
    let mut order = vec![0u32; n];
-    let mut cur = starts.clone(); // small: TOP_BUCKETS + 1 entries
-    for k in 0..n {
-        let b = (keys[k] >> LOW_BITS) as usize;
-        order[cur[b] as usize] = k as u32;
+    let mut keys_part = vec![0u32; n];
+    if n == 0 {
+        return (starts, order, keys_part);
+    }
+
+    let nthreads = rayon::current_num_threads().max(1);
+    let chunk = n.div_ceil(nthreads);
+    let nchunks = n.div_ceil(chunk);
+
+    // Phase 1 (parallel): per-chunk top-bucket histograms.
+    let local_hists: Vec<Vec<u32>> = (0..nchunks)
+        .into_par_iter()
+        .map(|c| {
+            let lo = c * chunk;
+            let hi = ((c + 1) * chunk).min(n);
+            let mut h = vec![0u32; TOP_BUCKETS];
+            for &key in &keys[lo..hi] {
+                h[(key >> LOW_BITS) as usize] += 1;
+            }
+            h
+        })
+        .collect();
+
+    // Phase 2 (serial, nchunks * TOP_BUCKETS work): bucket starts, then each
+    // chunk's per-bucket base offset within its bucket's output region.
+    let mut totals = vec![0u32; TOP_BUCKETS];
+    for h in &local_hists {
+        for b in 0..TOP_BUCKETS {
+            totals[b] += h[b];
+        }
+    }
+    for b in 0..TOP_BUCKETS {
+        starts[b + 1] = starts[b] + totals[b];
+    }
+    let mut offsets = vec![vec![0u32; TOP_BUCKETS]; nchunks];
+    let mut running = starts[..TOP_BUCKETS].to_vec(); // running[b] starts at starts[b]
+    for c in 0..nchunks {
+        for b in 0..TOP_BUCKETS {
+            offsets[c][b] = running[b];
+            running[b] += local_hists[c][b];
+        }
+    }
+
+    // Phase 3 (parallel): each chunk scatters into its disjoint sub-ranges.
+    let optr = OrderPtr(order.as_mut_ptr());
+    let kptr = OrderPtr(keys_part.as_mut_ptr());
+    offsets.into_par_iter().enumerate().for_each(|(c, mut cur)| {
+        let (optr, kptr) = (optr, kptr); // capture whole (Sync) wrappers
+        let lo = c * chunk;
+        let hi = ((c + 1) * chunk).min(n);
+        let (obase, kbase) = (optr.0, kptr.0);
+        for k in lo..hi {
+            let key = keys[k];
+            let b = (key >> LOW_BITS) as usize;
+            let pos = cur[b] as usize;
+            // SAFETY: `pos` ranges over `[off[c][b], off[c+1][b])`, a range owned
+            // exclusively by chunk `c` and within `order`/`keys_part` bounds.
+            unsafe {
+                *obase.add(pos) = k as u32;
+                *kbase.add(pos) = key;
+            }
            cur[b] += 1;
        }
-    (starts, order)
+    });
+
+    (starts, order, keys_part)
 }

 /// Count the colliding pairs a partition will emit, from the low-bit histogram
-/// alone (no reordering). `clamp` caps each exact-collision group, matching the
-/// emit pass so the output offsets line up. `hist` is reusable `LOW_BUCKETS`
-/// scratch.
-fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
+/// of its (partition-contiguous) leading keys `keys_run`. `clamp` caps each
+/// exact-collision group, matching the emit pass so the output offsets line up.
+/// `hist` is reusable `LOW_BUCKETS` scratch.
+fn count_pairs(keys_run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
    for h in hist.iter_mut() {
        *h = 0;
    }
-    for &k in run {
-        hist[(keys[k as usize] & LOW_MASK) as usize] += 1;
+    for &key in keys_run {
+        hist[(key & LOW_MASK) as usize] += 1;
    }
    let mut pairs = 0usize;
    for i in 0..LOW_BUCKETS {
@@ -285,31 +390,44 @@ fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usi
    pairs
 }

-/// Within one partition, group `run`'s entries by the low bits of their leading
-/// block, writing the grouped indices into `sorted`. `hist` is reusable
-/// `LOW_BUCKETS + 1` scratch. After this call `sorted` lists the run's indices
-/// with equal low keys contiguous, so callers recover each exact-collision
-/// group by walking adjacent equal keys.
-fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>) {
-    let m = run.len();
+/// Within one partition, group its entries by the low bits of their leading
+/// block. Inputs are the partition-contiguous slabs `keys_run` (leading keys) and
+/// `order_run` (matching global indices). Outputs, in low-key-sorted order:
+/// `sorted` (the global indices, for the emit slot gather + back-refs) and
+/// `keys_sorted` (the leading keys, so the emit group walk streams a dense local
+/// array instead of gathering `keys[sorted[i]]`). `hist` is reusable
+/// `LOW_BUCKETS + 1` scratch. Both reads are sequential over the slabs.
+fn low_group(
+    keys_run: &[u32],
+    order_run: &[u32],
+    hist: &mut [u32],
+    sorted: &mut Vec<u32>,
+    keys_sorted: &mut Vec<u32>,
+) {
+    let m = order_run.len();
    sorted.clear();
+    keys_sorted.clear();
    if m == 0 {
        return;
    }
    for h in hist.iter_mut() {
        *h = 0;
    }
-    for &k in run {
-        hist[(keys[k as usize] & LOW_MASK) as usize + 1] += 1;
+    for &key in keys_run {
+        hist[(key & LOW_MASK) as usize + 1] += 1;
    }
    for i in 0..LOW_BUCKETS {
        hist[i + 1] += hist[i];
    }
    sorted.resize(m, 0);
+    keys_sorted.resize(m, 0);
    // hist[low] now holds the run-start offset; reuse it as the live cursor.
-    for &k in run {
-        let low = (keys[k as usize] & LOW_MASK) as usize;
-        sorted[hist[low] as usize] = k;
+    for i in 0..m {
+        let key = keys_run[i];
+        let low = (key & LOW_MASK) as usize;
+        let pos = hist[low] as usize;
+        sorted[pos] = order_run[i];
+        keys_sorted[pos] = key;
        hist[low] += 1;
    }
 }
@@ -320,26 +438,42 @@ fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>)
 /// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
 /// inside a `target_feature` wrapper while sharing one source of truth.
 macro_rules! emit_bucket_body {
-    ($keys:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $xor:path) => {{
+    ($keys_sorted:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $w_in:expr, $w_out:expr, $prefetch:expr, $xor:path) => {{
        let s = $sorted;
+        let ks = $keys_sorted; // leading keys in `s`-order; group walk streams it
        let m = s.len();
+        let w_in = $w_in;
+        let w_out = $w_out;
        let mut w = 0usize;
        let mut i = 0;
        while i < m {
-            let key = $keys[s[i] as usize] & LOW_MASK;
+            let key = ks[i] & LOW_MASK;
            let mut j = i + 1;
-            while j < m && ($keys[s[j] as usize] & LOW_MASK) == key {
+            while j < m && (ks[j] & LOW_MASK) == key {
                j += 1;
            }
            let hi = j.min(i.saturating_add($clamp));
+            // The pair loops below read `slots[s[a]]` / `slots[s[b]]` at random
+            // global positions in the ~1 GB slot arena — each a likely cache
+            // miss. Prefetch this group's member slots up front so the misses
+            // overlap with the (L1-resident) pair XORs that follow.
+            #[cfg(target_arch = "x86_64")]
+            if $prefetch {
+                for a in i..hi {
+                    core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(
+                        $slots.as_ptr().add(s[a] as usize * w_in) as *const i8,
+                    );
+                }
+            }
            for a in i..hi {
                let l = s[a] as usize;
                for b in (a + 1)..hi {
                    let mr = s[b] as usize;
                    let nk = $xor(
-                        $sout.as_mut_ptr().add(w * SLOT),
-                        $slots.as_ptr().add(l * SLOT),
-                        $slots.as_ptr().add(mr * SLOT),
+                        $sout.as_mut_ptr().add(w * w_out),
+                        $slots.as_ptr().add(l * w_in),
+                        $slots.as_ptr().add(mr * w_in),
+                        w_out,
                    );
                    $kout[w] = nk;
                    $pout[w] = ((l as u64) << 32) | mr as u64;
@@ -353,69 +487,101 @@ macro_rules! emit_bucket_body {
 }

 unsafe fn emit_bucket_scalar(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
+    w_in: usize,
+    w_out: usize,
+    prefetch: bool,
 ) -> usize {
-    emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_scalar)
+    emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_scalar)
 }

 #[cfg(target_arch = "x86_64")]
 #[target_feature(enable = "avx2")]
 unsafe fn emit_bucket_avx2(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
+    w_in: usize,
+    w_out: usize,
+    prefetch: bool,
 ) -> usize {
-    emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_avx2)
+    emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_avx2)
 }

 /// Emit a partition's children, dispatching to the AVX2 producer when available.
+/// `w_in`/`w_out` are the input/output slot pitches (`w_out == w_in - 1`).
+/// `prefetch` software-prefetches each group's randomly-gathered member slots.
 unsafe fn emit_bucket(
-    keys: &[u32],
+    keys_sorted: &[u32],
    slots: &[u32],
    sorted: &[u32],
    kout: &mut [u32],
    sout: &mut [u32],
    pout: &mut [u64],
    clamp: usize,
+    w_in: usize,
+    w_out: usize,
+    prefetch: bool,
 ) -> usize {
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
-            return emit_bucket_avx2(keys, slots, sorted, kout, sout, pout, clamp);
+            return emit_bucket_avx2(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch);
        }
    }
-    emit_bucket_scalar(keys, slots, sorted, kout, sout, pout, clamp)
+    emit_bucket_scalar(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch)
 }

 /// Group `n` entries by their leading block, then emit one child per colliding
-/// pair: the XOR of the residual blocks (rotated into a fresh 8-word slot) plus
-/// a packed `(l << 32) | mr` back-reference. Two passes — count, then emit
-/// directly into one pre-sized arena — so there is no per-partition allocation
-/// or final concatenation copy. Returns `(keys_out, slots_out, parents)`.
-fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
-    let (starts, order) = partition_top(keys, n);
+/// pair: the XOR of the residual blocks (rotated into a fresh slot) plus a packed
+/// `(l << 32) | mr` back-reference. Two passes — count, then emit directly into
+/// one pre-sized arena — so there is no per-partition allocation or final
+/// concatenation copy. Returns `(keys_out, slots_out, parents)`.
+///
+/// `w_in` is the input slot pitch; the round consumes one block, so the output is
+/// packed at `w_out = w_in - 1` words/slot. `slots_out` carries a `SLOT_SLACK`
+/// trailing pad so the next round's 256-bit over-read stays in bounds.
+fn collide(
+    keys: &[u32],
+    slots: &[u32],
+    n: usize,
+    clamp: usize,
+    w_in: usize,
+) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
+    let w_out = w_in - 1;
+    // Sub-phase timing, gated on `EQ_PROFILE`. Prints partition / count / alloc /
+    // emit splits so we can see which part of the round dominates.
+    let prof = std::env::var_os("EQ_PROFILE").is_some();
+    // Software-prefetch the emit gather (on by default; `EQ_NO_PREFETCH` disables
+    // it for A/B benchmarking). Read once, not per group.
+    let do_prefetch = std::env::var_os("EQ_NO_PREFETCH").is_none();
+    let t0 = std::time::Instant::now();

-    // Pass 1: per-partition child counts (histogram-derived, no reordering).
+    let (starts, order, keys_part) = partition_top(keys, n);
+    let t_part = std::time::Instant::now();
+
+    // Pass 1: per-partition child counts from each partition's contiguous keys.
    let counts: Vec<usize> = (0..TOP_BUCKETS)
        .into_par_iter()
        .map_init(
            || vec![0u32; LOW_BUCKETS],
            |hist, v| {
-                let run = &order[starts[v] as usize..starts[v + 1] as usize];
-                count_pairs(keys, run, hist, clamp)
+                let keys_run = &keys_part[starts[v] as usize..starts[v + 1] as usize];
+                count_pairs(keys_run, hist, clamp)
            },
        )
        .collect();
+    let t_count = std::time::Instant::now();

    let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
    for v in 0..TOP_BUCKETS {
@@ -423,7 +589,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
    }
    let total = out_starts[TOP_BUCKETS];
    let mut keys_out = vec![0u32; total];
-    let mut slots_out = vec![0u32; total * SLOT];
+    // Packed at `w_out` words/slot, plus a slack pad for the next round's over-read.
+    let mut slots_out = vec![0u32; total * w_out + SLOT_SLACK];
    let mut parents = vec![0u64; total];

    // Carve the output arena into disjoint per-partition sub-slices so workers
@@ -435,7 +602,7 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
        for &c in &counts {
            let (kh, kt) = kr.split_at_mut(c);
-            let (sh, st) = sr.split_at_mut(c * SLOT);
+            let (sh, st) = sr.split_at_mut(c * w_out);
            let (ph, pt) = pr.split_at_mut(c);
            kparts.push(kh);
            sparts.push(sh);
@@ -446,6 +613,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        }
    }

+    let t_alloc = std::time::Instant::now();
+
    // Pass 2: group each partition and emit its colliding pairs in place.
    kparts
        .into_par_iter()
@@ -453,37 +622,53 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
        .zip(pparts)
        .enumerate()
        .for_each(|(v, ((kout, sout), pout))| {
-            let run = &order[starts[v] as usize..starts[v + 1] as usize];
+            let lo = starts[v] as usize;
+            let hi = starts[v + 1] as usize;
            let mut hist = vec![0u32; LOW_BUCKETS + 1];
            let mut sorted = Vec::new();
-            low_group(keys, run, &mut hist, &mut sorted);
-            let w = unsafe { emit_bucket(keys, slots, &sorted, kout, sout, pout, clamp) };
+            let mut keys_sorted = Vec::new();
+            low_group(&keys_part[lo..hi], &order[lo..hi], &mut hist, &mut sorted, &mut keys_sorted);
+            let w = unsafe {
+                emit_bucket(&keys_sorted, slots, &sorted, kout, sout, pout, clamp, w_in, w_out, do_prefetch)
+            };
            debug_assert_eq!(w, kout.len());
        });

+    if prof {
+        let ms = |a: std::time::Instant, b: std::time::Instant| (b - a).as_secs_f64() * 1000.0;
+        eprintln!(
+            "    [collide n={n}] partition {:6.1}  count {:6.1}  alloc {:6.1}  emit(group+xor) {:6.1} ms",
+            ms(t0, t_part),
+            ms(t_part, t_count),
+            ms(t_count, t_alloc),
+            ms(t_alloc, std::time::Instant::now()),
+        );
+    }
+
    (keys_out, slots_out, parents)
 }

-/// Final round (slots hold `[w0, w1, …]`): among entries sharing leading block
-/// `w0`, a pair whose `w1` also matches XORs the last two blocks to zero — a
-/// candidate. Returns the `(l, mr)` parents of each candidate.
-fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u32, u32)> {
-    let (starts, order) = partition_top(keys, n);
+/// Final round (slots hold `[w0, w1, …]` at pitch `w_in`): among entries sharing
+/// leading block `w0`, a pair whose `w1` also matches XORs the last two blocks to
+/// zero — a candidate. Returns the `(l, mr)` parents of each candidate.
+fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize, w_in: usize) -> Vec<(u32, u32)> {
+    let (starts, order, keys_part) = partition_top(keys, n);

    (0..TOP_BUCKETS)
        .into_par_iter()
        .map_init(
-            || (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new()),
-            |(hist, sorted), v| {
-                let run = &order[starts[v] as usize..starts[v + 1] as usize];
-                low_group(keys, run, hist, sorted);
+            || (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new(), Vec::<u32>::new()),
+            |(hist, sorted, keys_sorted), v| {
+                let lo = starts[v] as usize;
+                let hi = starts[v + 1] as usize;
+                low_group(&keys_part[lo..hi], &order[lo..hi], hist, sorted, keys_sorted);
                let m = sorted.len();
                let mut local = Vec::new();
                let mut i = 0;
                while i < m {
-                    let key = keys[sorted[i] as usize] & LOW_MASK;
+                    let key = keys_sorted[i] & LOW_MASK;
                    let mut j = i + 1;
-                    while j < m && (keys[sorted[j] as usize] & LOW_MASK) == key {
+                    while j < m && (keys_sorted[j] & LOW_MASK) == key {
                        j += 1;
                    }
                    let hi = j.min(i.saturating_add(clamp));
@@ -491,7 +676,7 @@ fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u3
                        let l = sorted[a] as usize;
                        for b in (a + 1)..hi {
                            let mr = sorted[b] as usize;
-                            if slots[l * SLOT + 1] == slots[mr * SLOT + 1] {
+                            if slots[l * w_in + 1] == slots[mr * w_in + 1] {
                                local.push((l as u32, mr as u32));
                            }
                        }
@@ -532,6 +717,22 @@ pub fn solve(header: &[u8]) -> Vec<Vec<u32>> {
 pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    let clamp = clamp.unwrap_or(usize::MAX);

+    // Optional per-phase timing, gated on `EQ_PROFILE` (any value). Zero cost
+    // when unset — the only overhead is one env lookup. Each `phase(label, n)`
+    // prints the wall time since the previous call and the live entry count.
+    let prof = std::env::var_os("EQ_PROFILE").is_some();
+    let mut t_last = std::time::Instant::now();
+    let mut phase = |label: &str, n: usize| {
+        if prof {
+            let now = std::time::Instant::now();
+            eprintln!(
+                "  [profile] {label:<13} {:8.1} ms  (n={n})",
+                (now - t_last).as_secs_f64() * 1000.0
+            );
+            t_last = now;
+        }
+    };
+
    // ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
    // in padded 8-word slots with the leading block mirrored into `keys`. Entry
    // k corresponds to leaf index k (the gen order), so no leaf table is needed.
@@ -541,18 +742,29 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    let hasher = BatchHasher::new(header);
    let mut keys: Vec<u32> = vec![0u32; n0];
    let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
-    let kgroup = 4 * INDICES_PER_HASH_OUTPUT; // eight entries
+    // Hash eight `g` values per chunk via `hash8`, which uses one AVX-512
+    // compression where available and falls back to two AVX2 `hash4`s (or scalar)
+    // otherwise — so this single path covers every CPU.
+    let kgroup = 8 * INDICES_PER_HASH_OUTPUT; // sixteen entries
    let sgroup = kgroup * SLOT;
-    debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 4-g groups");
+    debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 8-g groups");
+    // `EQ_NO_AVX512` forces the AVX2 fallback (two hash4) for A/B benchmarking.
+    let use_avx512 = std::env::var_os("EQ_NO_AVX512").is_none();
    slots
        .par_chunks_mut(sgroup)
        .zip(keys.par_chunks_mut(kgroup))
        .enumerate()
        .for_each(|(c, (schunk, kchunk))| {
-            let g0 = (c * 4) as u32;
-            let mut hs = [[0u8; HASH_OUTPUT]; 4];
-            hasher.hash4(g0, &mut hs);
-            for j in 0..4 {
+            let g0 = (c * 8) as u32;
+            let mut hs = [[0u8; HASH_OUTPUT]; 8];
+            if use_avx512 {
+                hasher.hash8(g0, &mut hs);
+            } else {
+                let (lo, hi) = hs.split_at_mut(4);
+                hasher.hash4(g0, (&mut lo[..4]).try_into().unwrap());
+                hasher.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
+            }
+            for j in 0..8 {
                for i in 0..INDICES_PER_HASH_OUTPUT {
                    let e = j * INDICES_PER_HASH_OUTPUT + i;
                    let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
@@ -567,20 +779,27 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
    // parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
    // `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
    // the new ones replace them).
+    phase("round0-hash", n0);
    let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
    let mut n = n0;
-    for _ in 0..(K - 1) {
-        let (ok, os, op) = collide(&keys, &slots, n, clamp);
+    // Round-0 slots carry all SLOT words; each collision round consumes the
+    // leading block, so the next round's slots are one word narrower.
+    let mut width = SLOT;
+    for r in 0..(K - 1) {
+        let (ok, os, op) = collide(&keys, &slots, n, clamp, width);
        n = op.len();
        parents.push(op);
        keys = ok;
        slots = os;
+        width -= 1;
+        phase(&format!("collide r{}", r + 1), n);
        if n == 0 {
            return Vec::new();
        }
    }

-    let candidates = collide_final(&keys, &slots, n, clamp);
+    let candidates = collide_final(&keys, &slots, n, clamp, width);
+    phase("collide-final", candidates.len());
    if candidates.is_empty() {
        return Vec::new();
    }
@@ -605,8 +824,11 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
        // refs are now round-0 indices == leaf indices.
        recovered.extend_from_slice(&refs);
    }
+    phase("recover", recovered.len() / SOLUTION_INDICES);

-    filter_candidates(&base_state(header), &recovered)
+    let result = filter_candidates(&base_state(header), &recovered);
+    phase("filter+verify", result.len());
+    result
 }

 /// Re-order recovered leaf indices into the canonical solution ordering: at
@@ -741,26 +963,34 @@ mod tests {
        assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
    }

-    // AVX2 XOR-child (xor + rotate-left-one-lane) must match the scalar version.
+    // The XOR-child producers (scalar + AVX2) must agree for every output width,
+    // write exactly `w_out` words (rotate-left-one of the lane-wise XOR), and
+    // leave the rest of the packed buffer untouched (the AVX2 path masked-stores).
    #[test]
    fn xor_child_matches_scalar() {
+        const SENT: u32 = 0xDEAD_BEEF;
        let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
        let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
-        let mut o1 = [0u32; SLOT];
-        let mut o2 = [0u32; SLOT];
-        let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
-        // Reference: rotate-left-one of the lane-wise XOR; new key = lane 1.
        let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
-        let expect: Vec<u32> = (0..SLOT).map(|i| x[(i + 1) % SLOT]).collect();
-        assert_eq!(&o1[..], &expect[..]);
+        for w_out in 1..SLOT {
+            let mut o1 = [SENT; SLOT];
+            let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
+            for t in 0..w_out {
+                assert_eq!(o1[t], x[(t + 1) % SLOT], "scalar word {t} (w_out={w_out})");
+            }
+            for t in w_out..SLOT {
+                assert_eq!(o1[t], SENT, "scalar wrote past w_out at {t} (w_out={w_out})");
+            }
            assert_eq!(k1, x[1]);
            #[cfg(target_arch = "x86_64")]
            if is_x86_feature_detected!("avx2") {
-            let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
-            assert_eq!(o1, o2, "avx2 xor_child != scalar");
+                let mut o2 = [SENT; SLOT];
+                let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
+                assert_eq!(o1, o2, "avx2 xor_child != scalar (w_out={w_out})");
                assert_eq!(k1, k2);
            }
        }
+    }

    // Clamped solve over a fixed header must find valid solutions (Equihash
    // 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
@@ -780,4 +1010,97 @@ mod tests {
            }
        }
    }
+
+    /// A deterministic dense 140-byte header for benchmarking, filled by an
+    /// xorshift64* stream keyed on `seed` (the same scheme as `main::pseudo_header`)
+    /// so each nonce exercises a realistic, fully-populated header rather than the
+    /// degenerate all-`0x42` one.
+    #[cfg(test)]
+    fn bench_header(seed: u64) -> Vec<u8> {
+        let mut header = vec![0u8; HEADER_LEN];
+        let mut x = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
+        for b in header.iter_mut() {
+            x ^= x >> 12;
+            x ^= x << 25;
+            x ^= x >> 27;
+            *b = (x.wrapping_mul(0x2545F4914F6CDD1D) >> 33) as u8;
+        }
+        header
+    }
+
+    /// Steady-state throughput baseline for the CPU solver. Ignored by default;
+    /// run it in release (the only meaningful configuration — `debug_assert`s in
+    /// the collision hot path are off and `opt-level=3`/LTO is on) with output
+    /// visible:
+    ///
+    /// ```text
+    /// cargo test --release full_solve_baseline -- --ignored --nocapture
+    /// ```
+    ///
+    /// Tunables via env:
+    ///   * `EQ_BENCH_ITERS`  — timed solves per clamp (default 8).
+    ///   * `EQ_BENCH_CLAMPS` — comma-separated clamp sweep (default `32`, the
+    ///     value the miner and `--selftest` use).
+    ///
+    /// Each timed iteration uses a distinct nonce so the figures average over
+    /// per-header variance; one warm-up solve per clamp (pages in the ~GB of
+    /// round buffers, warms caches) is excluded. This measures one solve at a
+    /// time across the whole rayon pool — i.e. solver latency and its memory
+    /// behaviour, which is exactly what the planned radix-scatter change targets
+    /// — not the aggregate multi-group mining rate `--bench` reports.
+    #[test]
+    #[ignore]
+    fn full_solve_baseline() {
+        use std::time::Instant;
+
+        let iters: usize = std::env::var("EQ_BENCH_ITERS")
+            .ok()
+            .and_then(|v| v.parse().ok())
+            .unwrap_or(8);
+        let clamps: Vec<usize> = std::env::var("EQ_BENCH_CLAMPS")
+            .ok()
+            .map(|v| v.split(',').filter_map(|s| s.trim().parse().ok()).collect())
+            .filter(|v: &Vec<usize>| !v.is_empty())
+            .unwrap_or_else(|| vec![32]);
+
+        eprintln!(
+            "CPU solver baseline — {iters} timed solve(s)/clamp, clamps={clamps:?}, \
+             rayon threads={}",
+            rayon::current_num_threads()
+        );
+
+        for &clamp in &clamps {
+            // Warm-up (excluded): fault in buffers, warm caches.
+            let _ = solve_with(&bench_header(0), Some(clamp));
+
+            let mut times = Vec::with_capacity(iters);
+            let mut total_sols = 0usize;
+            for i in 0..iters {
+                let header = bench_header(i as u64 + 1);
+                let base = base_state(&header);
+                let t = Instant::now();
+                let sols = solve_with(&header, Some(clamp));
+                times.push(t.elapsed().as_secs_f64());
+                for s in &sols {
+                    assert!(is_valid_solution(&base, s), "solver returned an invalid solution");
+                }
+                total_sols += sols.len();
+            }
+
+            times.sort_by(|a, b| a.partial_cmp(b).unwrap());
+            let n = times.len();
+            let sum: f64 = times.iter().sum();
+            let mean = 1000.0 * sum / n as f64;
+            let median = 1000.0 * times[n / 2];
+            let min = 1000.0 * times[0];
+            let max = 1000.0 * times[n - 1];
+            let solves_per_s = n as f64 / sum;
+            eprintln!(
+                "clamp={clamp:>3}: {solves_per_s:6.2} solve/s | mean {mean:6.0} ms  \
+                 median {median:6.0} ms  min {min:6.0} ms  max {max:6.0} ms | \
+                 {total_sols} sol over {n} solve(s) ({:.2}/solve)",
+                total_sols as f64 / n as f64,
+            );
+        }
+    }
 }
@@ -411,6 +411,9 @@ impl LegacySolver {
 /// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
 pub struct GpuSolver {
    inner: SolverInner,
+    /// Per-card telemetry handle (AMD amdgpu sysfs / NVML), matched to the device
+    /// by PCI bus id. `None` when no telemetry backend matches.
+    tuner: Option<Box<dyn crate::gpu_tune::GpuTuner>>,
 }

 enum SolverInner {
@@ -423,13 +426,16 @@ impl GpuSolver {
    /// device vendor (AMD → `equihash192_7.cl`).
    pub fn new(device_index: usize) -> Result<Self> {
        let (platform, device) = pick_device(device_index)?;
+        // Resolve a telemetry handle (AMD sysfs / NVML) from the device's PCI bus
+        // before `device` is consumed by the inner solver.
+        let tuner = device_pci_bus_id(&device).and_then(|bus| crate::gpu_tune::open(&bus));
        let inner = if use_amd_kernel(&device) {
            log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
            SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
        } else {
            SolverInner::Legacy(LegacySolver::new(platform, device)?)
        };
-        Ok(Self { inner })
+        Ok(Self { inner, tuner })
    }

    /// This device's product name, if available.
@@ -440,6 +446,34 @@ impl GpuSolver {
        }
    }

+    /// Current board power draw in watts (telemetry handle), if available.
+    pub fn power_watts(&self) -> Option<f64> {
+        self.tuner.as_ref().and_then(|t| t.watts())
+    }
+
+    /// Current GPU temperature in °C, if available.
+    pub fn temperature_c(&self) -> Option<u32> {
+        self.tuner.as_ref().and_then(|t| t.temperature_c())
+    }
+
+    /// Currently enforced power limit in watts, if available.
+    pub fn current_power_limit_w(&self) -> Option<u32> {
+        self.tuner.as_ref().and_then(|t| t.current_power_limit_w())
+    }
+
+    /// (min, max) settable power limit in watts, if available.
+    pub fn power_limit_range_w(&self) -> Option<(u32, u32)> {
+        self.tuner.as_ref().and_then(|t| t.power_limit_range_w())
+    }
+
+    /// Current (core, memory) clocks in MHz, each `None` if unavailable.
+    pub fn current_clocks_mhz(&self) -> (Option<u32>, Option<u32>) {
+        match &self.tuner {
+            Some(t) => (t.core_clock_mhz(), t.mem_clock_mhz()),
+            None => (None, None),
+        }
+    }
+
    /// Solve the puzzle for `header` (140 bytes).
    pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
        match &self.inner {
@@ -475,11 +509,32 @@ impl GpuSolver {
    pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
        match &self.inner {
            SolverInner::Legacy(s) => s.hash_all(header),
-            SolverInner::Amd(_) => {
-                Err(anyhow!("hash_all is not supported by the AMD kernel"))
+            SolverInner::Amd(_) => Err(anyhow!("hash_all is not supported by the AMD kernel")),
        }
    }
 }
+
+/// The device's PCI address as `"DDDD:BB:DD.F"` (lowercase), for matching the
+/// physical card to a telemetry backend. Prefers `cl_khr_pci_bus_info`; falls
+/// back to `cl_device_topology_amd`. `None` if the device exposes neither.
+fn device_pci_bus_id(device: &ocl::Device) -> Option<String> {
+    const CL_DEVICE_PCI_BUS_INFO_KHR: u32 = 0x10F2;
+    const CL_DEVICE_TOPOLOGY_AMD: u32 = 0x4037;
+    // cl_device_pci_bus_info_khr = { u32 pci_domain, pci_bus, pci_device, pci_function }.
+    if let Ok(b) = device.info_raw(CL_DEVICE_PCI_BUS_INFO_KHR) {
+        if b.len() >= 16 {
+            let rd = |i: usize| u32::from_ne_bytes([b[i], b[i + 1], b[i + 2], b[i + 3]]);
+            return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", rd(0), rd(4), rd(8), rd(12)));
+        }
+    }
+    // cl_device_topology_amd (PCIE branch): 24-byte struct, bus/device/function
+    // are the last three bytes; domain isn't exposed (assume 0000).
+    if let Ok(b) = device.info_raw(CL_DEVICE_TOPOLOGY_AMD) {
+        if b.len() >= 24 {
+            return Some(format!("0000:{:02x}:{:02x}.{:x}", b[21], b[22], b[23]));
+        }
+    }
+    None
 }

 /// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
@@ -501,19 +556,81 @@ fn use_amd_kernel(device: &ocl::Device) -> bool {
    }
 }

+/// A cross-platform key identifying the physical GPU, as a canonical PCI address
+/// `"DDDD:BB:DD.F"`. Vendor runtimes (ROCm/NVIDIA) expose `cl_khr_pci_bus_info`;
+/// rusticl/Mesa doesn't, but its `cl_khr_device_uuid` *encodes* the PCI address
+/// ({u32 domain LE, u8 bus, u8 device, u8 function, ...}), so the same physical
+/// card yields the same key on both platforms. `None` if neither is available
+/// (then the device is never deduped — safe).
+fn device_dedup_key(device: &ocl::Device) -> Option<String> {
+    if let Some(pci) = device_pci_bus_id(device) {
+        return Some(pci);
+    }
+    const CL_DEVICE_UUID_KHR: u32 = 0x106A;
+    if let Ok(b) = device.info_raw(CL_DEVICE_UUID_KHR) {
+        if b.len() >= 7 {
+            let domain = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
+            return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", domain, b[4], b[5], b[6]));
+        }
+    }
+    None
+}
+
+/// Lower = preferred when the same physical GPU is exposed by multiple OpenCL
+/// platforms. De-prioritise the Mesa Gallium drivers (rusticl/clover) relative to
+/// the vendor runtimes (ROCm / NVIDIA / Intel), which are faster and complete.
+fn platform_rank(p: &ocl::Platform) -> u8 {
+    let name = p.name().unwrap_or_default().to_ascii_lowercase();
+    if name.contains("rusticl") || name.contains("clover") || name.contains("mesa") {
+        1
+    } else {
+        0
+    }
+}
+
+/// All usable OpenCL `(platform, device)` pairs in a stable flat order, with each
+/// physical GPU de-duplicated across platforms by PCI bus id — a card exposed by
+/// both ROCm and rusticl appears once (the vendor runtime wins over Mesa), so
+/// mining "all" doesn't run the same card twice. This is the single source of
+/// truth for the flat device index used by `--devices`, `--list-devices`, and
+/// [`pick_device`]. Devices without a PCI bus id (CPU / PoCL) are never deduped.
+fn enumerate_devices() -> Vec<(ocl::Platform, ocl::Device)> {
+    use ocl::{Device, Platform};
+    let mut out: Vec<(Platform, Device)> = Vec::new();
+    let mut by_pci: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
+    for platform in Platform::list() {
+        for device in Device::list_all(platform).unwrap_or_default() {
+            match device_dedup_key(&device) {
+                Some(pci) => match by_pci.get(&pci).copied() {
+                    // Same physical GPU already listed: keep the preferred platform.
+                    Some(existing) => {
+                        if platform_rank(&platform) < platform_rank(&out[existing].0) {
+                            out[existing] = (platform, device);
+                        }
+                    }
+                    None => {
+                        by_pci.insert(pci, out.len());
+                        out.push((platform, device));
+                    }
+                },
+                None => out.push((platform, device)), // no PCI id → can't dedup
+            }
+        }
+    }
+    out
+}
+
 /// List `(platform, device)` names so the user can choose `--device`.
 pub fn list_devices() -> Result<Vec<String>> {
-    use ocl::{Device, Platform};
-    let mut names = Vec::new();
-    let mut idx = 0;
-    for platform in Platform::list() {
+    let names = enumerate_devices()
+        .into_iter()
+        .enumerate()
+        .map(|(idx, (platform, device))| {
            let pname = platform.name().unwrap_or_else(|_| "?".into());
-        for device in Device::list_all(platform).unwrap_or_default() {
            let dname = device.name().unwrap_or_else(|_| "?".into());
-            names.push(format!("[{idx}] {pname} / {dname}"));
-            idx += 1;
-        }
-    }
+            format!("[{idx}] {pname} / {dname}")
+        })
+        .collect();
    Ok(names)
 }

@@ -522,18 +639,15 @@ pub fn list_devices() -> Result<Vec<String>> {
 /// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
 pub fn device_is_nvidia() -> Vec<bool> {
    use ocl::enums::{DeviceInfo, DeviceInfoResult};
-    use ocl::{Device, Platform};
-    let mut out = Vec::new();
-    for platform in Platform::list() {
-        for device in Device::list_all(platform).unwrap_or_default() {
-            let is_nv = matches!(
+    enumerate_devices()
+        .into_iter()
+        .map(|(_, device)| {
+            matches!(
                device.info(DeviceInfo::Vendor),
                Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
-            );
-            out.push(is_nv);
-        }
-    }
-    out
+            )
+        })
+        .collect()
 }

 /// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
@@ -541,36 +655,20 @@ pub fn device_is_nvidia() -> Vec<bool> {
 /// [`list_devices`] / `--devices`.
 pub fn cpu_device_index() -> Option<usize> {
    use ocl::enums::{DeviceInfo, DeviceInfoResult};
-    use ocl::{Device, Platform};
-    let mut idx = 0;
-    for platform in Platform::list() {
-        for device in Device::list_all(platform).unwrap_or_default() {
-            let is_cpu = matches!(
+    enumerate_devices().into_iter().position(|(_, device)| {
+        matches!(
            device.info(DeviceInfo::Type).ok(),
            Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
-            );
-            if is_cpu {
-                return Some(idx);
-            }
-            idx += 1;
-        }
-    }
-    None
+        )
+    })
 }

-/// Resolve a flat device index across all platforms, returning the device along
-/// with the platform it belongs to (needed to build the context against the
-/// right platform).
-fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
-    use ocl::{Device, Platform};
-    let mut idx = 0;
-    for platform in Platform::list() {
-        for device in Device::list_all(platform).unwrap_or_default() {
-            if idx == index {
-                return Ok((platform, device));
-            }
-            idx += 1;
-        }
-    }
-    Err(anyhow!("no OpenCL device with index {index}"))
+/// Resolve a flat device index (into the de-duplicated [`enumerate_devices`]
+/// list), returning the device along with the platform it belongs to (needed to
+/// build the context against the right platform).
+pub(crate) fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
+    enumerate_devices()
+        .into_iter()
+        .nth(index)
+        .ok_or_else(|| anyhow!("no OpenCL device with index {index}"))
 }
@@ -128,19 +128,25 @@ pub trait GpuTuner: Send {
 /// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
 /// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
 ///
-/// NVML is the backend on both Linux (`libnvidia-ml`) and Windows (`nvml.dll`);
-/// the C API is identical, so the same [`crate::nvml`] code serves both.
+/// Tries the NVIDIA backend first (NVML, `libnvidia-ml`/`nvml.dll`), then the AMD
+/// backend ([`crate::amd_smi`], Linux amdgpu sysfs). A non-matching bus id makes
+/// each backend return `None`, so probing both is safe on mixed-vendor hosts.
 pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
-    #[cfg(any(unix, windows))]
+    #[cfg(feature = "cuda")]
    {
-        crate::nvml::open(pci_bus_id)
+        if let Some(t) = crate::nvml::open(pci_bus_id) {
+            return Some(t);
        }
-    #[cfg(not(any(unix, windows)))]
+    }
+    #[cfg(feature = "gpu")]
    {
+        if let Some(t) = crate::amd_smi::open(pci_bus_id) {
+            return Some(t);
+        }
+    }
    let _ = pci_bus_id;
    None
 }
-}

 static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);

@@ -18,6 +18,10 @@ mod gpu;
 #[cfg(feature = "gpu")]
 mod gpu_amd;

+// AMD GPU telemetry via Linux amdgpu sysfs (a `gpu_tune::GpuTuner` backend).
+#[cfg(feature = "gpu")]
+mod amd_smi;
+
 // Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
 #[cfg(feature = "cuda")]
 mod dylib;
@@ -28,7 +32,9 @@ mod cuda;
 #[cfg(feature = "cuda")]
 mod nvml;

-#[cfg(feature = "cuda")]
+// Platform-agnostic GPU tuning/telemetry surface. The trait + policy compile for
+// either GPU backend; NVML (cuda) and amd_smi (gpu) are the implementations.
+#[cfg(any(feature = "cuda", feature = "gpu"))]
 mod gpu_tune;

 use std::io::IsTerminal;
@@ -125,11 +131,12 @@ struct Args {
    #[arg(long, value_name = "SPEC")]
    cpu_cores: Option<String>,

-    /// Cores per CPU mining row (default 4). Each row runs one shared solve
-    /// across its cores; larger groups cut memory sharply: total RAM is ~4 GB ×
-    /// (enabled cores / this size). Use 1 for one row (and one solve) per core.
-    /// Rows are aligned to core-index blocks of this size, so a row never
-    /// straddles a boundary. Cycle it live in the dashboard with 'g'.
+    /// Cores per CPU mining row. Each row runs one shared solve across its
+    /// cores; larger groups cut memory sharply: total RAM is ~4 GB × (enabled
+    /// cores / this size). Rows align to core-index blocks of this size. Capped
+    /// by core count so the row count stays manageable — ≤4 cores toggle
+    /// individually (1), 5-8 cores in groups of ≤2, more than 8 in groups of ≤4
+    /// — and the default is that cap. Cycle it live (within the cap) with 'g'.
    #[arg(long, value_name = "N", default_value_t = 4)]
    cpu_group_size: usize,

@@ -203,6 +210,13 @@ struct Args {
    #[arg(long)]
    auto_tune: bool,

+    /// Sustained-Sol/s governor: hold each GPU at/below this edge temperature (°C)
+    /// by pacing the solve cadence (no hardware writes, no root). Trades a little
+    /// throughput for lower temp/power; off by default (runs flat-out). Needs a
+    /// backend that reports temperature (AMD amdgpu / NVIDIA).
+    #[arg(long, value_name = "CELSIUS")]
+    target_temp: Option<u32>,
+
    /// Efficiency: cap each GPU's power limit in watts (default: card max).
    /// Lower power trades a little hashrate for much better Sol/W.
    #[arg(long, value_name = "WATTS")]
@@ -619,6 +633,8 @@ fn main() -> Result<()> {
        args.power_limit.unwrap_or(0),
        args.unlock_controls,
    );
+    // Software temp governor target (paces solve cadence; no hardware writes).
+    miner::set_target_temp(args.target_temp);
    miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
 }

@@ -1076,10 +1092,21 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
    use std::time::Instant;
    info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());

+    /// Per-worker benchmark result, including a steady-state telemetry snapshot
+    /// (sampled right after the timed loop, while the card is warm).
+    struct WorkerResult {
+        sols: usize,
+        dt: f64,
+        watts: Option<f64>,
+        temp_c: Option<u32>,
+        core_mhz: Option<u32>,
+        mem_mhz: Option<u32>,
+    }
+
    let start = Instant::now();
    let mut handles = Vec::new();
    for (id, spec) in specs.into_iter().enumerate() {
-        handles.push(std::thread::spawn(move || -> Result<(usize, f64)> {
+        handles.push(std::thread::spawn(move || -> Result<WorkerResult> {
            let backend = spec.build()?;
            backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
            let t = Instant::now();
@@ -1089,7 +1116,17 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
                let seed = ((id as u64) << 40) | (i as u64 + 1);
                sols += backend.solve(&pseudo_header(seed))?.len();
            }
-            Ok((sols, t.elapsed().as_secs_f64()))
+            let dt = t.elapsed().as_secs_f64();
+            // Snapshot telemetry while the card is still under load.
+            let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
+            Ok(WorkerResult {
+                sols,
+                dt,
+                watts: backend.power_watts(),
+                temp_c: backend.temperature_c(),
+                core_mhz,
+                mem_mhz,
+            })
        }));
    }

@@ -1098,11 +1135,26 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
    let mut workers = 0usize;
    for h in handles {
        match h.join().unwrap() {
-            Ok((sols, dt)) => {
-                let sol_s = sols as f64 / dt;
+            Ok(r) => {
+                let sol_s = r.sols as f64 / r.dt;
+                // Optional telemetry tail: " | 142 W, 41.7 Sol/W, 68°C, 2700/2500 MHz".
+                let mut tail = String::new();
+                if let Some(w) = r.watts {
+                    tail.push_str(&format!(" | {w:.0} W"));
+                    if w > 0.0 {
+                        tail.push_str(&format!(", {:.2} Sol/W", sol_s / w));
+                    }
+                }
+                if let Some(t) = r.temp_c {
+                    tail.push_str(&format!(", {t}°C"));
+                }
+                if let (Some(c), m) = (r.core_mhz, r.mem_mhz) {
+                    tail.push_str(&format!(", {c}/{} MHz", m.map(|m| m.to_string()).unwrap_or_else(|| "?".into())));
+                }
                info!(
-                    "  worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {sols} solutions",
-                    1000.0 * dt / runs as f64
+                    "  worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {} solutions{tail}",
+                    1000.0 * r.dt / runs as f64,
+                    r.sols
                );
                agg_sols += sol_s;
                workers += 1;
@@ -14,6 +14,22 @@ use crate::equihash;
 use crate::params::{HEADER_LEN, SOLUTION_BYTES};
 use crate::stratum::{StratumClient, Work};

+/// Process-wide target edge temperature (°C) for the software solve-cadence
+/// governor; `None` ⇒ run flat-out. Set once at startup from `--target-temp`.
+static TARGET_TEMP_C: OnceLock<Option<u32>> = OnceLock::new();
+
+/// Install the governor's target temperature (call once, before workers start).
+pub fn set_target_temp(c: Option<u32>) {
+    if let Some(t) = c {
+        info!("temperature governor enabled: holding GPUs ≤{t}°C (paced cadence)");
+    }
+    let _ = TARGET_TEMP_C.set(c);
+}
+
+fn target_temp_c() -> Option<u32> {
+    TARGET_TEMP_C.get().copied().flatten()
+}
+
 /// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
 fn sha256d(data: &[u8]) -> [u8; 32] {
    let first = Sha256::digest(data);
@@ -136,6 +152,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.power_watts(),
+            #[cfg(feature = "gpu")]
+            Backend::Gpu(solver) => solver.power_watts(),
            _ => None,
        }
    }
@@ -145,6 +163,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.temperature_c(),
+            #[cfg(feature = "gpu")]
+            Backend::Gpu(solver) => solver.temperature_c(),
            _ => None,
        }
    }
@@ -154,6 +174,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.current_power_limit_w(),
+            #[cfg(feature = "gpu")]
+            Backend::Gpu(solver) => solver.current_power_limit_w(),
            _ => None,
        }
    }
@@ -163,6 +185,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.power_limit_range_w(),
+            #[cfg(feature = "gpu")]
+            Backend::Gpu(solver) => solver.power_limit_range_w(),
            _ => None,
        }
    }
@@ -190,6 +214,8 @@ impl Backend {
        match self {
            #[cfg(feature = "cuda")]
            Backend::Cuda(solver) => solver.current_clocks_mhz(),
+            #[cfg(feature = "gpu")]
+            Backend::Gpu(solver) => solver.current_clocks_mhz(),
            _ => (None, None),
        }
    }
@@ -513,6 +539,14 @@ fn worker(
    let mut last_job = Instant::now();
    let mut paused = false;
    let mut disabled_pause = false;
+    // Software temperature governor: pace the solve cadence to hold edge temp at
+    // or below `--target-temp` (no hardware writes). `gov_sleep` is the per-pass
+    // pause, nudged from the periodic temperature sample.
+    let gov_target = target_temp_c();
+    let mut gov_sleep = Duration::ZERO;
+    if let Some(target) = gov_target {
+        info!("worker {id}: temperature governor active — pacing cadence to hold ≤{target}°C");
+    }

    while running.load(Ordering::Relaxed) {
        if work_handle.epoch() != current.epoch {
@@ -583,6 +617,9 @@ fn worker(
            }
            if let Some(t) = backend.temperature_c() {
                stats.workers[id].temp_c.store(t, Ordering::Relaxed);
+                if let Some(target) = gov_target {
+                    gov_sleep = govern_cadence(gov_sleep, t, target);
+                }
            }
            let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
            if let Some(c) = core_mhz {
@@ -614,12 +651,34 @@ fn worker(
            let ctx = inflight.pop_front().unwrap();
            process_results(id, &client, &ctx, &solutions, &stats);
        }
+
+        // Temperature governor: pace the cadence (held off entirely when flat-out).
+        if !gov_sleep.is_zero() {
+            std::thread::sleep(gov_sleep);
+        }
    }

    drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
    Ok(())
 }

+/// Nudge the per-pass governor pause to hold edge temperature near `target` (°C).
+/// Over target → lengthen the pause (proportional to the overshoot); comfortably
+/// under → shorten it. Called from the periodic temperature sample (~1 Hz), so
+/// this is a slow integral controller — fine for the card's thermal time constant.
+fn govern_cadence(cur: Duration, temp: u32, target: u32) -> Duration {
+    const MAX_SLEEP: Duration = Duration::from_millis(100);
+    let (t, target) = (temp as i64, target as i64);
+    if t > target {
+        cur.saturating_add(Duration::from_micros(400 * (t - target) as u64))
+            .min(MAX_SLEEP)
+    } else if t < target - 1 {
+        cur.saturating_sub(Duration::from_millis(1))
+    } else {
+        cur
+    }
+}
+
 /// Supervise CPU mining: spawn one worker per group for the current group size,
 /// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
 /// rebuild the grouping, and respawn. Runs until `running` is cleared.
@@ -843,3 +902,49 @@ fn build_nonce(nonce1: &[u8], counter: u64) -> Result<[u8; 32]> {
    tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
    Ok(nonce)
 }
+
+#[cfg(test)]
+mod governor_tests {
+    use super::*;
+
+    #[test]
+    fn governor_idles_below_target() {
+        // At/under target → no pause is introduced.
+        assert_eq!(govern_cadence(Duration::ZERO, 60, 70), Duration::ZERO);
+        assert_eq!(govern_cadence(Duration::ZERO, 70, 70), Duration::ZERO);
+    }
+
+    #[test]
+    fn governor_lengthens_over_target_and_clamps() {
+        // Over target → pause grows proportional to the overshoot.
+        let s = govern_cadence(Duration::ZERO, 75, 70); // +5°C
+        assert_eq!(s, Duration::from_micros(400 * 5));
+        // Far over target saturates at the cap, never beyond.
+        let hot = govern_cadence(Duration::from_millis(99), 200, 70);
+        assert_eq!(hot, Duration::from_millis(100));
+    }
+
+    #[test]
+    fn governor_eases_off_when_cool() {
+        // 2°C+ under target → pause shrinks; floors at zero (no underflow).
+        let s = govern_cadence(Duration::from_millis(3), 67, 70);
+        assert_eq!(s, Duration::from_millis(2));
+        assert_eq!(govern_cadence(Duration::ZERO, 50, 70), Duration::ZERO);
+    }
+
+    #[test]
+    fn governor_converges_to_hold_target() {
+        // Simulate a card whose temp rises with duty: more pause → cooler. The
+        // controller should settle near the target without runaway.
+        let target = 70u32;
+        let mut sleep = Duration::ZERO;
+        let mut temp = 85i32; // starts hot
+        for _ in 0..200 {
+            sleep = govern_cadence(sleep, temp as u32, target);
+            // crude plant: each ms of pause sheds ~1.5°C off a 90°C flat-out temp.
+            let modeled = 90.0 - 1.5 * sleep.as_millis() as f64;
+            temp = modeled.round() as i32;
+        }
+        assert!((target as i32 - 2..=target as i32 + 2).contains(&temp), "settled at {temp}°C, want ~{target}");
+    }
+}
@@ -0,0 +1,124 @@
+; OpenCL import library definition for jackpotminer's Windows build.
+; Generated from the cl-sys 0.4.3 exports so a vendor-neutral OpenCL.lib /
+; libOpenCL.a can be produced at build time without an OpenCL SDK. The actual
+; OpenCL.dll (ICD loader) is provided at runtime by the GPU driver.
+LIBRARY OpenCL
+EXPORTS
+    clBuildProgram
+    clCloneKernel
+    clCompileProgram
+    clCreateBuffer
+    clCreateCommandQueue
+    clCreateCommandQueueWithProperties
+    clCreateContext
+    clCreateContextFromType
+    clCreateFromGLBuffer
+    clCreateFromGLRenderbuffer
+    clCreateFromGLTexture
+    clCreateFromGLTexture2D
+    clCreateFromGLTexture3D
+    clCreateImage
+    clCreateImage2D
+    clCreateImage3D
+    clCreateKernel
+    clCreateKernelsInProgram
+    clCreatePipe
+    clCreateProgramWithBinary
+    clCreateProgramWithBuiltInKernels
+    clCreateProgramWithIL
+    clCreateProgramWithSource
+    clCreateSampler
+    clCreateSamplerWithProperties
+    clCreateSubBuffer
+    clCreateSubDevices
+    clCreateUserEvent
+    clEnqueueAcquireGLObjects
+    clEnqueueBarrier
+    clEnqueueBarrierWithWaitList
+    clEnqueueCopyBuffer
+    clEnqueueCopyBufferRect
+    clEnqueueCopyBufferToImage
+    clEnqueueCopyImage
+    clEnqueueCopyImageToBuffer
+    clEnqueueFillBuffer
+    clEnqueueFillImage
+    clEnqueueMapBuffer
+    clEnqueueMapImage
+    clEnqueueMarker
+    clEnqueueMarkerWithWaitList
+    clEnqueueMigrateMemObjects
+    clEnqueueNativeKernel
+    clEnqueueNDRangeKernel
+    clEnqueueReadBuffer
+    clEnqueueReadBufferRect
+    clEnqueueReadImage
+    clEnqueueReleaseGLObjects
+    clEnqueueSVMFree
+    clEnqueueSVMMap
+    clEnqueueSVMMemcpy
+    clEnqueueSVMMemFill
+    clEnqueueSVMMigrateMem
+    clEnqueueSVMUnmap
+    clEnqueueTask
+    clEnqueueUnmapMemObject
+    clEnqueueWaitForEvents
+    clEnqueueWriteBuffer
+    clEnqueueWriteBufferRect
+    clEnqueueWriteImage
+    clFinish
+    clFlush
+    clGetCommandQueueInfo
+    clGetContextInfo
+    clGetDeviceAndHostTimer
+    clGetDeviceIDs
+    clGetDeviceInfo
+    clGetEventInfo
+    clGetEventProfilingInfo
+    clGetExtensionFunctionAddress
+    clGetExtensionFunctionAddressForPlatform
+    clGetGLContextInfoKHR
+    clGetGLObjectInfo
+    clGetGLTextureInfo
+    clGetHostTimer
+    clGetImageInfo
+    clGetKernelArgInfo
+    clGetKernelInfo
+    clGetKernelSubGroupInfo
+    clGetKernelWorkGroupInfo
+    clGetMemObjectInfo
+    clGetPipeInfo
+    clGetPlatformIDs
+    clGetPlatformInfo
+    clGetProgramBuildInfo
+    clGetProgramInfo
+    clGetSamplerInfo
+    clGetSupportedImageFormats
+    clLinkProgram
+    clReleaseCommandQueue
+    clReleaseContext
+    clReleaseDevice
+    clReleaseEvent
+    clReleaseKernel
+    clReleaseMemObject
+    clReleaseProgram
+    clReleaseSampler
+    clRetainCommandQueue
+    clRetainContext
+    clRetainDevice
+    clRetainEvent
+    clRetainKernel
+    clRetainMemObject
+    clRetainProgram
+    clRetainSampler
+    clSetDefaultDeviceCommandQueue
+    clSetEventCallback
+    clSetKernelArg
+    clSetKernelArgSVMPointer
+    clSetKernelExecInfo
+    clSetMemObjectDestructorCallback
+    clSetUserEventStatus
+    clSVMAlloc
+    clSVMFree
+    clUnloadCompiler
+    clUnloadPlatformCompiler
+    clWaitForEvents
Author	SHA1	Message	Date
jackpotincorporated	09725cf674	OpenCL: de-duplicate the same physical GPU across platforms A GPU exposed by both a vendor runtime (ROCm) and rusticl/Mesa appeared twice in the device list, so mining 'all' ran each card twice (pure contention). Add a single canonical enumerate_devices() — used by list_devices, device_is_nvidia, cpu_device_index and pick_device — that dedupes by physical GPU and prefers the vendor runtime over Mesa. Dedup key is the PCI address: ROCm/NVIDIA expose it via cl_khr_pci_bus_info; rusticl doesn't, but its cl_khr_device_uuid encodes the PCI BDF, so the same card yields the same key on both. Devices without either (CPU/PoCL) are never deduped. No behavior change on single-platform hosts (nothing to dedup); here the list drops 4->2 (both physical GPUs on ROCm, ~38 Sol/s) and device indices are unchanged for the kept devices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:57:19 -04:00
jackpotincorporated	41db98af69	AMD kernel: drop fp64 dependency (build on rusticl/Mesa OpenCL) equihash192_7.cl's decompress/decompress2 used double + sqrt behind cl_khr_fp64, so the kernel failed to build on OpenCL stacks without fp64 (notably rusticl/Mesa) — those workers died with 'use of type double requires cl_khr_fp64'. Replace round(sqrt(2*x+1)) with an exact integer square root (single-precision estimate corrected to the integer floor, then rounded; inputs are triangular indices < ~2^26). No fp64, no behavior change on ROCm (verified bit-identical: 77 solutions/40 nonces, same as before), and rusticl devices now build and solve correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:39:45 -04:00
jackpotincorporated	31aa85733e	AMD GPU telemetry + --target-temp governor Brings AMD cards to parity with NVIDIA for monitoring/control surface, which was NVML-only. New src/amd_smi.rs is a gpu_tune::GpuTuner backed by Linux amdgpu sysfs (power1_average, temp1_input edge, freq1_input sclk, pp_dpm_sclk/mclk), matched to the device by PCI bus id from OpenCL cl_khr_pci_bus_info. gpu_tune is un-gated to compile under the gpu feature; open() probes NVML then amd_smi. GpuSolver carries the tuner and Backend::Gpu dispatches power/temp/clocks, so the TUI and --benchmark now show power, temperature, clocks and Sol/W for AMD. Telemetry-only — setters are Unsupported (amdgpu control nodes are root-only). --target-temp <C> adds an opt-in software governor (miner::govern_cadence) that paces solve cadence to hold edge temperature, no hardware writes/root. With small thermal throttle it won't beat flat-out on raw Sol/s; it's a temp/efficiency lever. Unit-tested controller; flag/plumbing verified live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:17:59 -04:00
jackpotincorporated	0002e90451	TUI: scale CPU core-toggle granularity by core count Cap the cores-per-row group size (and the dashboard 'g' cycle) by total core count so the toggleable-row count stays sensible and small machines get finer control: ≤4 cores toggle individually (size 1), 5-8 cores in groups of up to 2, and more than 8 in groups of up to 4. The cap is also the default — the prior fixed default of 4 now clamps to the tier (1/2/4), and an explicit --cpu-group-size is clamped to the cap too. Add max_group_size() in cpu_groups; update the help text and the cpu_groups / control tests (the cycle test now uses 16 cores so it can exercise sizes 4/2/1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 18:34:10 -04:00
jackpotincorporated	afd56bee1b	CPU solver: AVX-512 round-0 hashing + prefetch the emit gather Branch the CPU solver onto modern x86 extensions, runtime-dispatched with the existing AVX2/scalar fallbacks: - BatchHasher::hash8 — an 8-lane AVX-512 BLAKE2b final-block compression (native _mm512_ror_epi64 rotates), falling back to two AVX2 hash4s or scalar. Round 0 now hashes eight g-values per chunk. round0-hash drops ~1.45x on AVX-512 CPUs (≈225→155 ms here, AMD Zen4). - emit_bucket software-prefetches each collision group's randomly-gathered member slots (the ~1 GB slot arena is the round's cache-miss bottleneck), shaving a few percent off the dominant emit phase. Controlled A/B on this Zen4 box (same thermal state): ~4-5% faster overall. The collision rounds are memory-bandwidth bound, so SIMD width is not the limiter — the modern-ISA win is modest by nature. EQ_NO_AVX512 / EQ_NO_PREFETCH opt out per-CPU (e.g. parts where AVX-512 downclocks) and back the A/B harness. hash8 is validated against the scalar reference (batch_matches_reference) and full solves still find valid solutions in every dispatch configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 13:12:57 -04:00
jackpotincorporated	00531fb591	Windows build: generate the OpenCL import library, drop the SDK dependency Building the `gpu` (OpenCL) backend on Windows no longer needs a vendor OpenCL SDK. `cl-sys` links `OpenCL` (#[link(name = "OpenCL")]); instead of requiring an SDK-provided OpenCL.lib, build.rs now generates a vendor-neutral import library at build time from windows/OpenCL.def (all 118 cl-sys exports) — lib.exe for MSVC (located via the cc crate), dlltool for MinGW — and puts it on the link search path. The real OpenCL.dll is supplied at runtime by the GPU driver. build.rs no-ops on non-Windows targets and when the gpu feature is off, and warns rather than panics if the toolchain tool is absent so `cargo check` still works. Combined with the runtime-loaded (dlopen) CUDA/NVML, a Windows build now needs zero external GPU libraries. Add BUILD-windows.md (toolchain, build, crt-static packaging, runtime deps, CI) and link it from the README. Verified the whole crate compiles for x86_64-pc-windows-gnu (default features and gpu,cuda); Linux is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 12:38:15 -04:00
jackpotincorporated	8a9d98a21d	Refresh solver architecture comment for the optimized design The module header still described a fixed 32-byte slot, a serial partition, and a "not yet taken" radix scatter. Update it to reflect the current memory-traffic- oriented design: parallel counting-sort partition_top, keys_part/keys_sorted to make the key reads sequential, narrowing per-round packed slots with a masked store, and the remaining slot-gather floor (with the note that the full payload radix scatter was evaluated and loses on the wide early rounds). Comment-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:48:29 -04:00
jackpotincorporated	1b4a2a4dd9	Scatter partition-contiguous keys to kill the per-round key gathers count_pairs, low_group, and the emit group-walk all read each entry's leading key via `keys[order[..]]` — a random gather over the whole ~128 MB keys array, three times per round. partition_top now also produces `keys_part` (the leading keys in partition order, keys_part[p] == keys[order[p]]), written by the same parallel, disjoint phase-3 scatter at 4 bytes/entry. count_pairs and low_group then stream their partition's keys sequentially, and low_group emits a `keys_sorted` array so the emit group walk streams a dense local copy instead of gathering keys[sorted[i]]. The only remaining DRAM-random access in the rounds is the unavoidable slot gather. Measured (16 threads, clamp 16/32): count ~160 -> ~10 ms/round, emit ~770 -> ~550 ms/round, partition +~80 ms (the added 128 MB scatter); full solve ~8.4 -> ~7.04 s (~16%). Cumulative across the three CPU-solver changes: ~13.4 -> ~7.04 s (-47%), 0.07 -> 0.14 solve/s. Identical solution yield; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:23:10 -04:00
jackpotincorporated	966ce3e262	Pack collision-round slots at shrinking per-round width Each collision round consumes the leading 24-bit block, so a round-r output entry only carries 8-r meaningful words — but every round previously stored a fixed 8-word (32 B) slot, moving dead trailing words. Since the collision rounds are memory-bandwidth bound, that wasted DRAM traffic is wasted time. Pack each round's slots at w_out = w_in - 1 words. The XOR child producer still loads a full 256-bit register (over-reading up to SLOT words past a narrow tail slot — buffers carry a SLOT_SLACK pad so this stays in bounds) but masked-stores only the w_out live lanes so packed neighbours aren't clobbered. The over-read garbage only ever lands in non-stored lanes: storing out[0..w_out] = x[1..w_in] uses exclusively meaningful input words. Width is threaded through collide/collide_final/emit_bucket from solve_with (round 0 stays full 8-word). Measured (16 threads, clamp 16/32): ~9.2 -> ~8.4 s/solve (~9%); per-round time now shrinks r1~1230 -> r6~1045 ms. Cumulative with the parallel-partition change: ~13.4 -> ~8.4 s (-37%). Identical solution yield; xor_child_matches_scalar now covers every width + the masked-store no-clobber property; cross-clamp validity and full_solve_baseline pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 11:11:05 -04:00
jackpotincorporated	501527d3cb	Parallelize partition_top; add solver benchmark + phase profiling partition_top was the only serial stage in the otherwise rayon-parallel collision rounds — plain `for k in 0..n` count and scatter loops that left 15/16 cores idle for ~35% of every round (~650 ms). Replace it with a parallel counting sort: per-chunk top-bucket histograms, a small serial pass to per-chunk base offsets, then a disjoint-region scatter through a shared raw pointer (each chunk writes a provably non-overlapping set of positions). Entries within a bucket become chunk-major rather than index-major, which is immaterial: count_pairs/low_group depend only on the low-key multiset, and solutions are canonicalized, de-duplicated, and verified downstream. Measured (16 threads): partition_top ~650 -> ~100 ms/round (6.5x), collide-final ~1.18 -> ~0.59 s, full solve ~13.4 -> ~9.2 s (-31%, 0.07 -> 0.11 solve/s), with identical solution yield and all validity tests passing. Also add (gated/ignored, no production-path behavior change): - full_solve_baseline: an #[ignore] throughput benchmark over realistic dense headers (EQ_BENCH_ITERS / EQ_BENCH_CLAMPS). - EQ_PROFILE-gated per-phase and per-collide-sub-phase timing in solve_with. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 10:42:54 -04:00