Compare commits
10
Commits
6a753b62fa
...
09725cf674
| Author | SHA1 | Date | |
|---|---|---|---|
| 09725cf674 | |||
| 41db98af69 | |||
| 31aa85733e | |||
| 0002e90451 | |||
| afd56bee1b | |||
| 00531fb591 | |||
| 8a9d98a21d | |||
| 1b4a2a4dd9 | |||
| 966ce3e262 | |||
| 501527d3cb |
@@ -0,0 +1,119 @@
|
|||||||
|
# Building jackpotminer on Windows
|
||||||
|
|
||||||
|
The miner builds on Windows with **no external GPU SDKs**. The CUDA driver and
|
||||||
|
NVML are loaded at runtime (`dlopen`/`LoadLibrary`), and the OpenCL import
|
||||||
|
library is generated at build time from `windows/OpenCL.def` — so you don't need
|
||||||
|
the CUDA Toolkit, an OpenCL SDK, or any vendor libraries to compile. The whole
|
||||||
|
codebase (miner + the `jackpotminer-config` GUI) is verified to compile for
|
||||||
|
`x86_64-pc-windows-*`.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- **Rust** (https://rustup.rs). The default target is `x86_64-pc-windows-msvc`.
|
||||||
|
- **MSVC build tools** — Visual Studio 2019/2022 or the standalone *Build Tools
|
||||||
|
for Visual Studio* with the **“Desktop development with C++”** workload (gives
|
||||||
|
`link.exe`, `lib.exe`, and the Windows SDK).
|
||||||
|
- Build from a **“x64 Native Tools Command Prompt for VS”** (or any shell where
|
||||||
|
the MSVC environment is active) so `lib.exe` is found. `cargo` also locates
|
||||||
|
it automatically via the registry in most cases.
|
||||||
|
- *(Alternative toolchain)* the GNU target `x86_64-pc-windows-gnu` works too,
|
||||||
|
with **MinGW-w64** on `PATH` (provides `dlltool` and the linker).
|
||||||
|
|
||||||
|
No CUDA Toolkit and **no OpenCL SDK** are required.
|
||||||
|
|
||||||
|
## Build
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
:: Default: OpenCL + CUDA backends + the GUI config tool
|
||||||
|
cargo build --release
|
||||||
|
|
||||||
|
:: Miner only (no GUI), both GPU backends
|
||||||
|
cargo build --release --no-default-features --features gpu,cuda
|
||||||
|
|
||||||
|
:: OpenCL only (AMD / Intel / NVIDIA)
|
||||||
|
cargo build --release --no-default-features --features gpu
|
||||||
|
|
||||||
|
:: NVIDIA only — needs nothing external at build time (CUDA is dlopen'd)
|
||||||
|
cargo build --release --no-default-features --features cuda
|
||||||
|
```
|
||||||
|
|
||||||
|
Outputs: `target\release\jackpotminer.exe` (and `jackpotminer-config.exe` with
|
||||||
|
the default features).
|
||||||
|
|
||||||
|
### How the OpenCL build dependency is avoided
|
||||||
|
|
||||||
|
`ocl`/`cl-sys` link `OpenCL` (`#[link(name = "OpenCL")]`), which normally needs
|
||||||
|
an `OpenCL.lib` from a vendor SDK. Instead, `build.rs` generates a
|
||||||
|
vendor-neutral import library from `windows/OpenCL.def`:
|
||||||
|
|
||||||
|
- **MSVC:** `lib.exe /def:windows\OpenCL.def /out:OpenCL.lib /machine:X64`
|
||||||
|
- **GNU:** `dlltool -d windows/OpenCL.def -l libOpenCL.a -m i386:x86-64`
|
||||||
|
|
||||||
|
and puts it on the link search path. The import library only forwards to
|
||||||
|
`OpenCL.dll`, which the GPU driver provides at runtime. If the toolchain tool
|
||||||
|
isn’t on `PATH`, `build.rs` prints a warning and the link step fails with a clear
|
||||||
|
message (compilation/`cargo check` still works).
|
||||||
|
|
||||||
|
## Distribution
|
||||||
|
|
||||||
|
Statically link the MSVC C runtime so users don’t need the VC++ redistributable:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
set RUSTFLAGS=-C target-feature=+crt-static
|
||||||
|
cargo build --release
|
||||||
|
```
|
||||||
|
|
||||||
|
## Runtime dependencies (on the mining machine)
|
||||||
|
|
||||||
|
- **`OpenCL.dll`** — the ICD loader, installed with any GPU driver (AMD/NVIDIA/
|
||||||
|
Intel). Required for the OpenCL backend.
|
||||||
|
- **`nvcuda.dll`** + **`nvml.dll`** — installed with the NVIDIA driver. Loaded
|
||||||
|
on demand for the CUDA backend; absent on AMD-only machines, where the miner
|
||||||
|
simply reports no CUDA devices.
|
||||||
|
- The **VC++ runtime**, unless you built with `+crt-static`.
|
||||||
|
|
||||||
|
A `cuda`-enabled binary still starts on a machine with no NVIDIA driver.
|
||||||
|
|
||||||
|
## Building Windows binaries without a Windows machine
|
||||||
|
|
||||||
|
### GitHub Actions (recommended)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: windows
|
||||||
|
on: [push, workflow_dispatch]
|
||||||
|
jobs:
|
||||||
|
build:
|
||||||
|
runs-on: windows-latest # MSVC + lib.exe already on PATH
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
- uses: dtolnay/rust-toolchain@stable
|
||||||
|
- run: cargo build --release --features gpu,cuda
|
||||||
|
env:
|
||||||
|
RUSTFLAGS: -C target-feature=+crt-static
|
||||||
|
- uses: actions/upload-artifact@v4
|
||||||
|
with:
|
||||||
|
name: jackpotminer-windows
|
||||||
|
path: target/release/*.exe
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cross-compile from Linux (GNU target)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rustup target add x86_64-pc-windows-gnu
|
||||||
|
sudo pacman -S mingw-w64 # or your distro's mingw-w64 (gives dlltool + linker)
|
||||||
|
cargo build --release --target x86_64-pc-windows-gnu --no-default-features --features gpu,cuda
|
||||||
|
```
|
||||||
|
|
||||||
|
The MSVC target can’t be linked from Linux. The GUI config tool (`eframe`) is
|
||||||
|
easiest to build natively on Windows.
|
||||||
|
|
||||||
|
## Status / caveats
|
||||||
|
|
||||||
|
- **Compilation for Windows is verified** here via `cargo check
|
||||||
|
--target x86_64-pc-windows-gnu` (default features, and `gpu,cuda`).
|
||||||
|
- The OpenCL **import-library linking** uses the standard `lib.exe`/`dlltool`
|
||||||
|
technique; validate it with an actual Windows (or MinGW cross) build, which
|
||||||
|
needs those tools present.
|
||||||
|
- `relaunch_in_terminal` (reopen-in-a-terminal on GUI launch) is Linux-only;
|
||||||
|
harmless on Windows, where double-clicking a console binary already opens a
|
||||||
|
console.
|
||||||
Generated
+1
@@ -2230,6 +2230,7 @@ version = "0.1.0"
|
|||||||
dependencies = [
|
dependencies = [
|
||||||
"anyhow",
|
"anyhow",
|
||||||
"blake2b_simd",
|
"blake2b_simd",
|
||||||
|
"cc",
|
||||||
"clap",
|
"clap",
|
||||||
"core_affinity",
|
"core_affinity",
|
||||||
"ctrlc",
|
"ctrlc",
|
||||||
|
|||||||
@@ -45,6 +45,11 @@ name = "jackpotminer-config"
|
|||||||
path = "src/config_gui.rs"
|
path = "src/config_gui.rs"
|
||||||
required-features = ["config-gui"]
|
required-features = ["config-gui"]
|
||||||
|
|
||||||
|
[build-dependencies]
|
||||||
|
# Locates the MSVC `lib.exe` (to generate the OpenCL import library on Windows);
|
||||||
|
# unused on other platforms. See build.rs.
|
||||||
|
cc = "1"
|
||||||
|
|
||||||
[profile.release]
|
[profile.release]
|
||||||
opt-level = 3
|
opt-level = 3
|
||||||
lto = true
|
lto = true
|
||||||
|
|||||||
@@ -114,6 +114,13 @@ cargo build --release --no-default-features --features cuda # CUDA only
|
|||||||
cargo build --release --no-default-features # CPU-only (no GPU)
|
cargo build --release --no-default-features # CPU-only (no GPU)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Windows
|
||||||
|
|
||||||
|
Builds with no external GPU SDKs — CUDA/NVML are loaded at runtime and the
|
||||||
|
OpenCL import library is generated at build time. See
|
||||||
|
[BUILD-windows.md](BUILD-windows.md) for the toolchain, build, and packaging
|
||||||
|
steps.
|
||||||
|
|
||||||
### Portable / distributable builds
|
### Portable / distributable builds
|
||||||
|
|
||||||
The miner's only runtime dependencies are the C library and the OpenCL ICD loader
|
The miner's only runtime dependencies are the C library and the OpenCL ICD loader
|
||||||
|
|||||||
@@ -0,0 +1,79 @@
|
|||||||
|
//! Build script — Windows OpenCL import library only.
|
||||||
|
//!
|
||||||
|
//! On Windows, the `ocl`/`cl-sys` crates link `OpenCL` (`#[link(name =
|
||||||
|
//! "OpenCL")]`), which normally requires an `OpenCL.lib` import library from a
|
||||||
|
//! vendor OpenCL SDK. To avoid that build dependency, we generate a
|
||||||
|
//! vendor-neutral import library ourselves from `windows/OpenCL.def` (the list
|
||||||
|
//! of OpenCL exports) using the toolchain's own tools — `lib.exe` for MSVC,
|
||||||
|
//! `dlltool` for the GNU (MinGW) toolchain — and put it on the link search path.
|
||||||
|
//! The real `OpenCL.dll` (the ICD loader) is supplied at runtime by the GPU
|
||||||
|
//! driver, exactly like `libOpenCL.so.1` on Linux.
|
||||||
|
//!
|
||||||
|
//! Nothing here is needed on non-Windows targets (the system `libOpenCL` is used
|
||||||
|
//! directly) or when the `gpu` (OpenCL) feature is off — the script no-ops. The
|
||||||
|
//! CUDA driver / NVML are loaded at runtime via dlopen (see `src/dylib.rs`), so
|
||||||
|
//! they need no build-script support.
|
||||||
|
|
||||||
|
use std::env;
|
||||||
|
use std::path::PathBuf;
|
||||||
|
use std::process::Command;
|
||||||
|
|
||||||
|
fn main() {
|
||||||
|
println!("cargo:rerun-if-changed=build.rs");
|
||||||
|
println!("cargo:rerun-if-changed=windows/OpenCL.def");
|
||||||
|
|
||||||
|
// Only Windows targets with the OpenCL backend need an import library.
|
||||||
|
if env::var("CARGO_CFG_TARGET_OS").as_deref() != Ok("windows") {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if env::var_os("CARGO_FEATURE_GPU").is_none() {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
|
||||||
|
let def = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("windows/OpenCL.def");
|
||||||
|
let target = env::var("TARGET").unwrap_or_default();
|
||||||
|
let is_msvc = env::var("CARGO_CFG_TARGET_ENV").as_deref() == Ok("msvc");
|
||||||
|
|
||||||
|
let (mut cmd, tool) = if is_msvc {
|
||||||
|
// lib.exe /def:OpenCL.def /out:OpenCL.lib /machine:X64
|
||||||
|
let mut c = cc::windows_registry::find(&target, "lib.exe")
|
||||||
|
.unwrap_or_else(|| Command::new("lib.exe"));
|
||||||
|
c.arg("/nologo")
|
||||||
|
.arg(format!("/def:{}", def.display()))
|
||||||
|
.arg(format!("/out:{}", out_dir.join("OpenCL.lib").display()))
|
||||||
|
.arg("/machine:X64");
|
||||||
|
(c, "lib.exe")
|
||||||
|
} else {
|
||||||
|
// GNU/MinGW: dlltool -d OpenCL.def -l libOpenCL.a -m i386:x86-64
|
||||||
|
let dlltool = ["x86_64-w64-mingw32-dlltool", "dlltool"]
|
||||||
|
.into_iter()
|
||||||
|
.find(|t| Command::new(t).arg("--version").output().is_ok())
|
||||||
|
.unwrap_or("dlltool");
|
||||||
|
let mut c = Command::new(dlltool);
|
||||||
|
c.arg("-d")
|
||||||
|
.arg(&def)
|
||||||
|
.arg("-l")
|
||||||
|
.arg(out_dir.join("libOpenCL.a"))
|
||||||
|
.arg("-m")
|
||||||
|
.arg("i386:x86-64");
|
||||||
|
(c, "dlltool")
|
||||||
|
};
|
||||||
|
|
||||||
|
match cmd.status() {
|
||||||
|
Ok(s) if s.success() => {
|
||||||
|
// Resolve `#[link(name = "OpenCL")]` against the generated import lib.
|
||||||
|
println!("cargo:rustc-link-search=native={}", out_dir.display());
|
||||||
|
}
|
||||||
|
Ok(s) => panic!("{tool} failed ({s}) generating the OpenCL import library from {}", def.display()),
|
||||||
|
Err(e) => {
|
||||||
|
// Tool not found: let `cargo check` (which doesn't link) still succeed;
|
||||||
|
// a real build will fail at link with a clear "cannot find OpenCL".
|
||||||
|
println!(
|
||||||
|
"cargo:warning=could not run {tool} to generate the OpenCL import library ({e}); \
|
||||||
|
ensure the toolchain tools are on PATH (MSVC dev prompt, or mingw-w64). \
|
||||||
|
Linking the `gpu` feature will fail until then."
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
+15
-14
@@ -2,8 +2,6 @@
|
|||||||
|
|
||||||
//#define PRINT 1
|
//#define PRINT 1
|
||||||
|
|
||||||
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
|
|
||||||
|
|
||||||
__constant ulong blake_iv[] =
|
__constant ulong blake_iv[] =
|
||||||
{
|
{
|
||||||
0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
|
0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
|
||||||
@@ -507,14 +505,22 @@ uint compress2(uint in0, uint in1) {
|
|||||||
return tmp;
|
return tmp;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Exact round(sqrt(x)) without fp64: a single-precision estimate corrected to the
|
||||||
|
// exact integer floor, then rounded. Lets the kernel build on OpenCL stacks that
|
||||||
|
// lack cl_khr_fp64 (e.g. rusticl/Mesa) while staying bit-identical on ROCm. Inputs
|
||||||
|
// are triangular indices (< ~2^26), well within range for the float estimate +
|
||||||
|
// integer correction.
|
||||||
|
inline uint isqrt_round(ulong x) {
|
||||||
|
long m = (long) sqrt((float) x);
|
||||||
|
while (m > 0 && (ulong)(m * m) > x) m--; // correct down to floor(sqrt)
|
||||||
|
while ((ulong)((m + 1) * (m + 1)) <= x) m++; // correct up to floor(sqrt)
|
||||||
|
// round to nearest: round up iff the fractional part is >= 0.5.
|
||||||
|
return (uint)(((x - (ulong)(m * m)) > (ulong) m) ? (m + 1) : m);
|
||||||
|
}
|
||||||
|
|
||||||
uint2 decompress(uint2 in) {
|
uint2 decompress(uint2 in) {
|
||||||
double inFl = (double) (in.s0 >> 6);
|
|
||||||
|
|
||||||
inFl *= 2.0;
|
|
||||||
inFl += 1.0;
|
|
||||||
|
|
||||||
uint2 res;
|
uint2 res;
|
||||||
res.s0 = (uint) round(sqrt(inFl));
|
res.s0 = isqrt_round(2ul * (ulong)(in.s0 >> 6) + 1ul);
|
||||||
|
|
||||||
|
|
||||||
uint tmp = res.s0 * (res.s0-1);
|
uint tmp = res.s0 * (res.s0-1);
|
||||||
@@ -535,13 +541,8 @@ uint2 decompress(uint2 in) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
uint2 decompress2(uint in) {
|
uint2 decompress2(uint in) {
|
||||||
double inFl = (double) in;
|
|
||||||
|
|
||||||
inFl *= 2.0;
|
|
||||||
inFl += 1.0;
|
|
||||||
|
|
||||||
uint2 res;
|
uint2 res;
|
||||||
res.s0 = (uint) round(sqrt(inFl));
|
res.s0 = isqrt_round(2ul * (ulong) in + 1ul);
|
||||||
|
|
||||||
|
|
||||||
uint tmp = res.s0 * (res.s0-1);
|
uint tmp = res.s0 * (res.s0-1);
|
||||||
|
|||||||
+204
@@ -0,0 +1,204 @@
|
|||||||
|
//! AMD GPU telemetry for [`crate::gpu_tune::GpuTuner`] via the Linux amdgpu
|
||||||
|
//! sysfs interface (`/sys/class/drm/cardN/device/...`).
|
||||||
|
//!
|
||||||
|
//! Telemetry only: board power, edge temperature, and core/memory clocks come
|
||||||
|
//! from the world-readable hwmon + DPM nodes, so it works unprivileged. The
|
||||||
|
//! control nodes (`pp_od_clk_voltage`, power cap, performance level) are
|
||||||
|
//! root-write-only and carry GPU-hang risk, so every setter returns
|
||||||
|
//! [`SetOutcome::Unsupported`] — this backend never writes. The handle is matched
|
||||||
|
//! to the physical card by PCI bus id (e.g. from OpenCL's `cl_khr_pci_bus_info`),
|
||||||
|
//! so it lines up with whichever device the solver actually opened.
|
||||||
|
|
||||||
|
use std::path::{Path, PathBuf};
|
||||||
|
|
||||||
|
use crate::gpu_tune::{GpuTuner, SetOutcome};
|
||||||
|
|
||||||
|
/// One amdgpu card's sysfs telemetry handle.
|
||||||
|
pub struct AmdTuner {
|
||||||
|
/// Canonical `/sys/.../<PCI BDF>` device directory (holds `pp_dpm_*`).
|
||||||
|
device_dir: PathBuf,
|
||||||
|
/// `device_dir/hwmon/hwmonM` (the index M is not stable — resolved by glob).
|
||||||
|
hwmon_dir: PathBuf,
|
||||||
|
name: String,
|
||||||
|
}
|
||||||
|
|
||||||
|
// Only `PathBuf`/`String` — `AmdTuner` is `Send` automatically; no `unsafe impl`.
|
||||||
|
|
||||||
|
/// Open a telemetry handle for the amdgpu card at `pci_bus_id` (e.g.
|
||||||
|
/// "0000:03:00.0"). Returns `None` off Linux, or when no amdgpu card with
|
||||||
|
/// telemetry nodes matches the bus id.
|
||||||
|
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
|
||||||
|
#[cfg(not(target_os = "linux"))]
|
||||||
|
{
|
||||||
|
let _ = pci_bus_id;
|
||||||
|
None
|
||||||
|
}
|
||||||
|
#[cfg(target_os = "linux")]
|
||||||
|
{
|
||||||
|
let want = bdf_tail(pci_bus_id)?;
|
||||||
|
for entry in std::fs::read_dir("/sys/class/drm").ok()?.flatten() {
|
||||||
|
let fname = entry.file_name();
|
||||||
|
let fname = fname.to_string_lossy();
|
||||||
|
// Match the GPU nodes ("card0", "card1", …), not the per-connector
|
||||||
|
// dirs ("card1-DP-1") or render nodes.
|
||||||
|
if !fname.starts_with("card") || fname.contains('-') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
// `cardN/device` symlinks to the PCI device dir `…/<BDF>`.
|
||||||
|
let Ok(device_dir) = std::fs::canonicalize(entry.path().join("device")) else {
|
||||||
|
continue;
|
||||||
|
};
|
||||||
|
let matches = device_dir
|
||||||
|
.file_name()
|
||||||
|
.and_then(|s| s.to_str())
|
||||||
|
.and_then(bdf_tail)
|
||||||
|
.map(|t| t == want)
|
||||||
|
.unwrap_or(false);
|
||||||
|
if !matches {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
let Some(hwmon_dir) = find_hwmon(&device_dir) else { continue };
|
||||||
|
// Require a telemetry node so we don't attach to a card without
|
||||||
|
// sensors (e.g. some virtual/headless devices).
|
||||||
|
if !hwmon_dir.join("temp1_input").exists()
|
||||||
|
&& !hwmon_dir.join("power1_average").exists()
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
let name = read_name(&device_dir);
|
||||||
|
return Some(Box::new(AmdTuner { device_dir, hwmon_dir, name }));
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// First `hwmon/hwmon*` subdir of a device dir (the index isn't stable).
|
||||||
|
fn find_hwmon(device_dir: &Path) -> Option<PathBuf> {
|
||||||
|
for e in std::fs::read_dir(device_dir.join("hwmon")).ok()?.flatten() {
|
||||||
|
if e.file_name().to_string_lossy().starts_with("hwmon") {
|
||||||
|
return Some(e.path());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
|
||||||
|
/// A display name for the card; amdgpu sysfs rarely exposes a marketing name, so
|
||||||
|
/// fall back to a generic label.
|
||||||
|
fn read_name(device_dir: &Path) -> String {
|
||||||
|
read_trim(&device_dir.join("product_name"))
|
||||||
|
.filter(|s| !s.is_empty())
|
||||||
|
.unwrap_or_else(|| "AMD GPU".to_string())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Normalize a PCI BDF to its `bus:device.function` tail (drop the domain), so
|
||||||
|
/// e.g. "0000:03:00.0" and "03:00.0" compare equal.
|
||||||
|
fn bdf_tail(bdf: &str) -> Option<String> {
|
||||||
|
let bdf = bdf.trim().to_ascii_lowercase();
|
||||||
|
match bdf.split(':').collect::<Vec<_>>().as_slice() {
|
||||||
|
[_domain, bus, devfunc] => Some(format!("{bus}:{devfunc}")),
|
||||||
|
[bus, devfunc] => Some(format!("{bus}:{devfunc}")),
|
||||||
|
_ => None,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn read_trim(p: &Path) -> Option<String> {
|
||||||
|
std::fs::read_to_string(p).ok().map(|s| s.trim().to_string())
|
||||||
|
}
|
||||||
|
|
||||||
|
fn read_u64(p: &Path) -> Option<u64> {
|
||||||
|
read_trim(p)?.parse().ok()
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Parse the MHz value out of a `pp_dpm_*` line like `"2: 2700Mhz *"`.
|
||||||
|
fn parse_mhz(line: &str) -> Option<u32> {
|
||||||
|
let after_colon = line.split(':').nth(1)?;
|
||||||
|
let token = after_colon.trim().split_whitespace().next()?;
|
||||||
|
let digits: String = token.chars().take_while(|c| c.is_ascii_digit()).collect();
|
||||||
|
digits.parse().ok()
|
||||||
|
}
|
||||||
|
|
||||||
|
impl AmdTuner {
|
||||||
|
/// MHz of the currently active DPM level (the line marked `*`) in `file`.
|
||||||
|
fn dpm_active_mhz(&self, file: &str) -> Option<u32> {
|
||||||
|
let s = read_trim(&self.device_dir.join(file))?;
|
||||||
|
s.lines().find(|l| l.contains('*')).and_then(parse_mhz)
|
||||||
|
}
|
||||||
|
/// Highest DPM level (MHz) listed in `file`.
|
||||||
|
fn dpm_max_mhz(&self, file: &str) -> Option<u32> {
|
||||||
|
let s = read_trim(&self.device_dir.join(file))?;
|
||||||
|
s.lines().filter_map(parse_mhz).max()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
impl GpuTuner for AmdTuner {
|
||||||
|
fn name(&self) -> String {
|
||||||
|
self.name.clone()
|
||||||
|
}
|
||||||
|
|
||||||
|
fn watts(&self) -> Option<f64> {
|
||||||
|
// power1_average is µW; fall back to the instantaneous power1_input.
|
||||||
|
let uw = read_u64(&self.hwmon_dir.join("power1_average"))
|
||||||
|
.or_else(|| read_u64(&self.hwmon_dir.join("power1_input")))?;
|
||||||
|
Some(uw as f64 / 1_000_000.0)
|
||||||
|
}
|
||||||
|
|
||||||
|
fn temperature_c(&self) -> Option<u32> {
|
||||||
|
// temp1 = edge (m°C).
|
||||||
|
let mc = read_u64(&self.hwmon_dir.join("temp1_input"))?;
|
||||||
|
Some(((mc + 500) / 1000) as u32)
|
||||||
|
}
|
||||||
|
|
||||||
|
fn current_power_limit_w(&self) -> Option<u32> {
|
||||||
|
// Absent on Navi 44 (RX 9060 XT); best-effort for cards that expose it.
|
||||||
|
read_u64(&self.hwmon_dir.join("power1_cap")).map(|uw| (uw / 1_000_000) as u32)
|
||||||
|
}
|
||||||
|
|
||||||
|
fn core_clock_mhz(&self) -> Option<u32> {
|
||||||
|
// freq1_input (Hz) is the live sclk; the DPM active level is the fallback.
|
||||||
|
if let Some(hz) = read_u64(&self.hwmon_dir.join("freq1_input")) {
|
||||||
|
if hz > 0 {
|
||||||
|
return Some((hz / 1_000_000) as u32);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
self.dpm_active_mhz("pp_dpm_sclk")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn mem_clock_mhz(&self) -> Option<u32> {
|
||||||
|
self.dpm_active_mhz("pp_dpm_mclk")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn max_core_clock_mhz(&self) -> Option<u32> {
|
||||||
|
self.dpm_max_mhz("pp_dpm_sclk")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn max_mem_clock_mhz(&self) -> Option<u32> {
|
||||||
|
self.dpm_max_mhz("pp_dpm_mclk")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn power_limit_range_w(&self) -> Option<(u32, u32)> {
|
||||||
|
let mn = read_u64(&self.hwmon_dir.join("power1_cap_min"))?;
|
||||||
|
let mx = read_u64(&self.hwmon_dir.join("power1_cap_max"))?;
|
||||||
|
Some(((mn / 1_000_000) as u32, (mx / 1_000_000) as u32))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Telemetry-only backend: never writes the root-only control nodes.
|
||||||
|
fn set_persistence(&self, _on: bool) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn lock_core_clock_mhz(&self, _mhz: u32) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn lock_mem_clock_mhz(&self, _mhz: u32) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn set_power_limit_w(&self, _watts: u32) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn set_core_offset_mhz(&self, _mhz: i32) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn set_mem_offset_mhz(&self, _mhz: i32) -> SetOutcome {
|
||||||
|
SetOutcome::Unsupported
|
||||||
|
}
|
||||||
|
fn reset(&self) {}
|
||||||
|
}
|
||||||
@@ -215,6 +215,92 @@ impl BatchHasher {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Hash eight consecutive indices `g0..g0+8`, writing each 48-byte digest
|
||||||
|
/// into `out[0..8]`. Uses AVX-512 (one 8-wide BLAKE2b compression) when
|
||||||
|
/// available, else two AVX2 `hash4`s. Modern AMD (Zen4+) and Intel CPUs run
|
||||||
|
/// AVX-512 without the clock penalty older Intel parts had.
|
||||||
|
#[inline]
|
||||||
|
pub fn hash8(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
|
||||||
|
#[cfg(target_arch = "x86_64")]
|
||||||
|
{
|
||||||
|
if is_x86_feature_detected!("avx512f") {
|
||||||
|
unsafe { self.hash8_avx512(g0, out) };
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// Fallback: two 4-lane batches into the two halves of `out`.
|
||||||
|
let (lo, hi) = out.split_at_mut(4);
|
||||||
|
self.hash4(g0, (&mut lo[..4]).try_into().unwrap());
|
||||||
|
self.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Eight-lane BLAKE2b final-block compression (AVX-512). Same structure as
|
||||||
|
/// [`Self::hash4_avx2`] but 512-bit lanes and native 64-bit rotates.
|
||||||
|
#[cfg(target_arch = "x86_64")]
|
||||||
|
#[target_feature(enable = "avx512f")]
|
||||||
|
unsafe fn hash8_avx512(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
|
||||||
|
use core::arch::x86_64::*;
|
||||||
|
|
||||||
|
#[inline(always)]
|
||||||
|
unsafe fn g8(
|
||||||
|
v: &mut [__m512i; 16],
|
||||||
|
a: usize, b: usize, c: usize, d: usize,
|
||||||
|
x: __m512i, y: __m512i,
|
||||||
|
) {
|
||||||
|
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), x);
|
||||||
|
v[d] = _mm512_ror_epi64::<32>(_mm512_xor_si512(v[d], v[a]));
|
||||||
|
v[c] = _mm512_add_epi64(v[c], v[d]);
|
||||||
|
v[b] = _mm512_ror_epi64::<24>(_mm512_xor_si512(v[b], v[c]));
|
||||||
|
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), y);
|
||||||
|
v[d] = _mm512_ror_epi64::<16>(_mm512_xor_si512(v[d], v[a]));
|
||||||
|
v[c] = _mm512_add_epi64(v[c], v[d]);
|
||||||
|
v[b] = _mm512_ror_epi64::<63>(_mm512_xor_si512(v[b], v[c]));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Only m0 and m1 are nonzero; m1's high 32 bits hold the per-lane `g`.
|
||||||
|
let tail0 = u64::from_le_bytes(self.tail[0..8].try_into().unwrap());
|
||||||
|
let tail_hi = u32::from_le_bytes(self.tail[8..12].try_into().unwrap()) as u64;
|
||||||
|
let m1 = |g: u32| (tail_hi | ((g as u64) << 32)) as i64;
|
||||||
|
let mut m = [_mm512_setzero_si512(); 16];
|
||||||
|
m[0] = _mm512_set1_epi64(tail0 as i64);
|
||||||
|
m[1] = _mm512_set_epi64(
|
||||||
|
m1(g0 + 7), m1(g0 + 6), m1(g0 + 5), m1(g0 + 4),
|
||||||
|
m1(g0 + 3), m1(g0 + 2), m1(g0 + 1), m1(g0),
|
||||||
|
);
|
||||||
|
|
||||||
|
let mut v = [_mm512_setzero_si512(); 16];
|
||||||
|
for i in 0..8 {
|
||||||
|
v[i] = _mm512_set1_epi64(self.mid[i] as i64);
|
||||||
|
v[i + 8] = _mm512_set1_epi64(IV[i] as i64);
|
||||||
|
}
|
||||||
|
v[12] = _mm512_xor_si512(v[12], _mm512_set1_epi64(FINAL_COUNT as i64));
|
||||||
|
v[14] = _mm512_xor_si512(v[14], _mm512_set1_epi64(-1)); // last-block flag
|
||||||
|
|
||||||
|
for s in &SIGMA {
|
||||||
|
g8(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
|
||||||
|
g8(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
|
||||||
|
g8(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
|
||||||
|
g8(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
|
||||||
|
g8(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
|
||||||
|
g8(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
|
||||||
|
g8(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
|
||||||
|
g8(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
|
||||||
|
}
|
||||||
|
|
||||||
|
// h[i] = mid[i] ^ v[i] ^ v[i+8]; first HASH_OUTPUT/8 words per lane, LE.
|
||||||
|
let mut tmp = [0u64; 8];
|
||||||
|
for i in 0..HASH_OUTPUT / 8 {
|
||||||
|
let o = _mm512_xor_si512(
|
||||||
|
_mm512_xor_si512(_mm512_set1_epi64(self.mid[i] as i64), v[i]),
|
||||||
|
v[i + 8],
|
||||||
|
);
|
||||||
|
_mm512_storeu_si512(tmp.as_mut_ptr() as *mut _, o);
|
||||||
|
for l in 0..8 {
|
||||||
|
out[l][i * 8..i * 8 + 8].copy_from_slice(&tmp[l].to_le_bytes());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
#[cfg(target_arch = "x86_64")]
|
#[cfg(target_arch = "x86_64")]
|
||||||
#[target_feature(enable = "avx2")]
|
#[target_feature(enable = "avx2")]
|
||||||
unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
|
unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
|
||||||
@@ -316,5 +402,14 @@ mod tests {
|
|||||||
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
|
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// hash8 (AVX-512 or hash4 fallback) must match for several batches.
|
||||||
|
for base_g in [0u32, 8, 64, 1_000_000] {
|
||||||
|
let mut out = [[0u8; HASH_OUTPUT]; 8];
|
||||||
|
hasher.hash8(base_g, &mut out);
|
||||||
|
for l in 0..8 {
|
||||||
|
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash8 mismatch at g={}", base_g + l as u32);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
+3
-3
@@ -175,15 +175,15 @@ mod tests {
|
|||||||
#[test]
|
#[test]
|
||||||
fn protocol_get_and_set() {
|
fn protocol_get_and_set() {
|
||||||
let controls = Controls::new(1, 0, 0, 0, 0, true);
|
let controls = Controls::new(1, 0, 0, 0, 0, true);
|
||||||
let cpu = CpuMining::new((0..8).collect(), 4, false); // sizes [1,2,4,8]
|
let cpu = CpuMining::new((0..16).collect(), 4, false); // 16 cores -> tier cap 4, sizes [1,2,4]
|
||||||
let stats = Stats::for_test(vec!["GPU 0".into()]);
|
let stats = Stats::for_test(vec!["GPU 0".into()]);
|
||||||
|
|
||||||
// get: one device, cpu rows reflect the grouping (8 cores / 4 = 2 rows).
|
// get: one device, cpu rows reflect the grouping (16 cores / 4 = 4 rows).
|
||||||
let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
|
let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
|
||||||
assert_eq!(v["ok"], true);
|
assert_eq!(v["ok"], true);
|
||||||
assert_eq!(v["devices"].as_array().unwrap().len(), 1);
|
assert_eq!(v["devices"].as_array().unwrap().len(), 1);
|
||||||
assert_eq!(v["cpu"]["group_size"], 4);
|
assert_eq!(v["cpu"]["group_size"], 4);
|
||||||
assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 2);
|
assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 4);
|
||||||
|
|
||||||
// set device 0 disabled.
|
// set device 0 disabled.
|
||||||
assert!(controls.device(0).enabled());
|
assert!(controls.device(0).enabled());
|
||||||
|
|||||||
+70
-25
@@ -165,6 +165,21 @@ fn grouped(cores: &[usize], size: usize, enabled: &BTreeSet<usize>) -> Arc<CpuGr
|
|||||||
groups
|
groups
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Largest cores-per-row the dashboard allows, by total core count. Finer
|
||||||
|
/// control on small machines, coarser on big ones so the toggleable-row count
|
||||||
|
/// stays manageable: **≤4 cores** toggle individually (size 1), **5-8 cores** in
|
||||||
|
/// groups of up to 2, and **more than 8** in groups of up to 4. This caps both
|
||||||
|
/// the starting group size and the 'g' cycle.
|
||||||
|
fn max_group_size(cores: usize) -> usize {
|
||||||
|
if cores <= 4 {
|
||||||
|
1
|
||||||
|
} else if cores <= 8 {
|
||||||
|
2
|
||||||
|
} else {
|
||||||
|
4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// Live controller for CPU mining. Owns the selected cores plus the current
|
/// Live controller for CPU mining. Owns the selected cores plus the current
|
||||||
/// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
|
/// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
|
||||||
/// dashboard (the worker supervisor watches `group_size()` and respawns to
|
/// dashboard (the worker supervisor watches `group_size()` and respawns to
|
||||||
@@ -188,14 +203,16 @@ impl CpuMining {
|
|||||||
/// the starting group size (`--cpu-group-size`), and `start_enabled` whether
|
/// the starting group size (`--cpu-group-size`), and `start_enabled` whether
|
||||||
/// mining begins on (`--cpu-mining`).
|
/// mining begins on (`--cpu-mining`).
|
||||||
pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
|
pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
|
||||||
let initial_size = initial_size.max(1);
|
// Toggle granularity scales with core count so the row count stays
|
||||||
|
// manageable (see [`max_group_size`]): the cap bounds both the starting
|
||||||
|
// size and the dashboard 'g' cycle.
|
||||||
|
let cap = max_group_size(cores.len());
|
||||||
|
let initial_size = initial_size.clamp(1, cap);
|
||||||
let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
|
let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
|
||||||
let groups = grouped(&cores, initial_size, &enabled);
|
let groups = grouped(&cores, initial_size, &enabled);
|
||||||
|
|
||||||
// Cycle list: the usual powers of two plus the requested size, capped so a
|
// Cycle list: powers of two from 1 up to the cap, plus the (clamped)
|
||||||
// group never exceeds the core count (unless the user explicitly asked for
|
// requested size, sorted and de-duplicated.
|
||||||
// a larger size), sorted and de-duplicated.
|
|
||||||
let cap = cores.len().max(initial_size).max(1);
|
|
||||||
let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
|
let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
|
||||||
.into_iter()
|
.into_iter()
|
||||||
.chain([initial_size])
|
.chain([initial_size])
|
||||||
@@ -346,39 +363,67 @@ mod tests {
|
|||||||
}
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
|
fn group_size_tier_by_core_count() {
|
||||||
// 8 cores, start at size 4 fully enabled -> two groups [0-3],[4-7], both on.
|
// ≤4 cores: individual cores only — size 1, no larger option.
|
||||||
let m = CpuMining::new((0..8).collect(), 4, true);
|
let m = CpuMining::new((0..4).collect(), 4, false); // requested 4 clamps to 1
|
||||||
|
assert_eq!(m.group_size(), 1);
|
||||||
|
assert_eq!(m.groups().len(), 4);
|
||||||
|
m.cycle_group_size();
|
||||||
|
assert_eq!(m.group_size(), 1); // only one size, cycle is a no-op
|
||||||
|
|
||||||
|
// 5-8 cores: groups of up to 2 (requested 4 clamps to 2).
|
||||||
|
let m = CpuMining::new((0..8).collect(), 4, false);
|
||||||
|
assert_eq!(m.group_size(), 2);
|
||||||
|
assert_eq!(m.groups().len(), 4);
|
||||||
|
m.cycle_group_size();
|
||||||
|
assert_eq!(m.group_size(), 1); // cycle covers {1, 2}
|
||||||
|
|
||||||
|
// >8 cores: groups of up to 4 (the default).
|
||||||
|
let m = CpuMining::new((0..16).collect(), 4, false);
|
||||||
assert_eq!(m.group_size(), 4);
|
assert_eq!(m.group_size(), 4);
|
||||||
assert_eq!(m.groups().len(), 2);
|
assert_eq!(m.groups().len(), 4);
|
||||||
|
// A larger explicit request is still capped at the tier max.
|
||||||
|
let m = CpuMining::new((0..16).collect(), 8, false);
|
||||||
|
assert_eq!(m.group_size(), 4);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
|
||||||
|
// 16 cores (tier cap 4): start at size 4 fully enabled -> four groups,
|
||||||
|
// all on.
|
||||||
|
let m = CpuMining::new((0..16).collect(), 4, true);
|
||||||
|
assert_eq!(m.group_size(), 4);
|
||||||
|
assert_eq!(m.groups().len(), 4);
|
||||||
assert!(m.groups().iter().all(|g| g.enabled()));
|
assert!(m.groups().iter().all(|g| g.enabled()));
|
||||||
|
|
||||||
// Disable the second group (cores 4-7): now only cores 0-3 are enabled.
|
// Disable the last group (cores 12-15): now only cores 0-11 are enabled.
|
||||||
m.toggle_group(1);
|
m.toggle_group(3);
|
||||||
assert!(m.groups().group(0).enabled());
|
assert!(!m.groups().group(3).enabled());
|
||||||
assert!(!m.groups().group(1).enabled());
|
|
||||||
|
|
||||||
// Cycle to size 8 and rebuild: the single [0-7] group is off, because not
|
// Cycle to size 1 (individual cores) and rebuild: 16 rows; cores 0-11 on,
|
||||||
// all of its cores were enabled.
|
// 12-15 off.
|
||||||
while m.group_size() != 8 {
|
while m.group_size() != 1 {
|
||||||
m.cycle_group_size();
|
m.cycle_group_size();
|
||||||
}
|
}
|
||||||
let g = m.rebuild();
|
let g = m.rebuild();
|
||||||
assert_eq!(g.len(), 1);
|
assert_eq!(g.len(), 16);
|
||||||
assert!(!g.group(0).enabled());
|
assert!(g.group(0).enabled()); // core 0
|
||||||
|
assert!(g.group(11).enabled()); // core 11
|
||||||
|
assert!(!g.group(12).enabled()); // core 12
|
||||||
|
assert!(!g.group(15).enabled()); // core 15
|
||||||
|
|
||||||
// Cycle to size 2 and rebuild: cores 0-3 are still tracked as enabled, so
|
// Cycle to size 2 and rebuild: cores 0-11 are still tracked as enabled, so
|
||||||
// [0,1] and [2,3] come back on while [4,5] and [6,7] stay off — the choice
|
// [0,1]..[10,11] come back on while [12,13],[14,15] stay off — the choice
|
||||||
// survived two regroups (not derived from the all-off size-8 grouping).
|
// survived two regroups.
|
||||||
while m.group_size() != 2 {
|
while m.group_size() != 2 {
|
||||||
m.cycle_group_size();
|
m.cycle_group_size();
|
||||||
}
|
}
|
||||||
let g = m.rebuild();
|
let g = m.rebuild();
|
||||||
assert_eq!(g.len(), 4);
|
assert_eq!(g.len(), 8);
|
||||||
assert!(g.group(0).enabled()); // [0,1]
|
assert!(g.group(0).enabled()); // [0,1]
|
||||||
assert!(g.group(1).enabled()); // [2,3]
|
assert!(g.group(5).enabled()); // [10,11]
|
||||||
assert!(!g.group(2).enabled()); // [4,5]
|
assert!(!g.group(6).enabled()); // [12,13]
|
||||||
assert!(!g.group(3).enabled()); // [6,7]
|
assert!(!g.group(7).enabled()); // [14,15]
|
||||||
assert!(m.generation() >= 2);
|
assert!(m.generation() >= 2);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+442
-119
@@ -131,23 +131,31 @@ fn leaf_row(base: &State, idx: u32) -> Row {
|
|||||||
// histogram that has to be zeroed, scatter-filled, and cloned every round). Two
|
// histogram that has to be zeroed, scatter-filled, and cloned every round). Two
|
||||||
// entries can collide only when their *whole* leading block matches, so they
|
// entries can collide only when their *whole* leading block matches, so they
|
||||||
// always land in the same partition — making the partitions independent
|
// always land in the same partition — making the partitions independent
|
||||||
// collision domains that are processed in parallel across the rayon pool the
|
// collision domains processed in parallel across the rayon pool.
|
||||||
// single CPU worker otherwise leaves idle.
|
|
||||||
//
|
//
|
||||||
// The data layout follows xenoncat's packed slots and targets AVX2 throughput:
|
// The data layout targets AVX2 throughput; the round is memory-bandwidth bound,
|
||||||
// * Entries live in fixed 32-byte `SLOT`s (one `__m256i`); producing a child
|
// so every choice below exists to cut DRAM traffic / random access:
|
||||||
// is a single load/xor/permute/store (`xor_child_avx2`).
|
// * `partition_top` is a parallel counting sort (per-chunk histograms ->
|
||||||
// * A dense parallel `keys[]` array mirrors each entry's leading block, so the
|
// per-chunk base offsets -> disjoint-region scatter), not a serial pass.
|
||||||
// `partition_top`/`count_pairs`/`low_group` histogram sweeps stream over
|
// * It also emits `keys_part` (the leading keys in partition order) so the
|
||||||
// 4 bytes/entry instead of striding the slots.
|
// per-partition `count_pairs`/`low_group` sweeps read keys sequentially
|
||||||
|
// instead of gathering `keys[order[..]]`; `low_group` likewise emits
|
||||||
|
// `keys_sorted` so the emit group walk streams a dense local copy.
|
||||||
|
// * Slots hold residual hash words in a *narrowing* packed slot: round 0 is
|
||||||
|
// `SLOT` words; each round consumes the leading block, so round r packs at
|
||||||
|
// `SLOT - r` words. The XOR child is one 256-bit load/xor/permute that
|
||||||
|
// masked-stores the live lanes (`xor_child_avx2`); buffers carry a
|
||||||
|
// `SLOT_SLACK` pad for the 256-bit over-read of a narrow tail slot.
|
||||||
// * `collide` runs two passes — count pairs, then emit children directly into
|
// * `collide` runs two passes — count pairs, then emit children directly into
|
||||||
// one pre-sized arena at per-partition offsets — so there is no per-bucket
|
// one pre-sized arena at per-partition offsets — so there is no per-bucket
|
||||||
// allocation and no final concatenation copy; the `keys`/`slots` buffers
|
// allocation and no final concatenation copy; the `keys`/`slots` buffers
|
||||||
// ping-pong between rounds.
|
// ping-pong between rounds.
|
||||||
// * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
|
// * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
|
||||||
// A further bandwidth step (not yet taken) is xenoncat's full radix scatter of
|
// The remaining bottleneck is the random slot gather in `emit_bucket` over the
|
||||||
// the payload into bucket-contiguous storage, turning the inner-loop slot
|
// ~1 GiB round buffer — the Equihash memory-hardness floor. A full radix scatter
|
||||||
// gathers into sequential reads.
|
// of the payload into bucket-contiguous storage (to make that gather sequential)
|
||||||
|
// was evaluated and loses: the wide-slot scatter costs more than the gather it
|
||||||
|
// saves on the early, wide rounds.
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
/// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
|
/// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
|
||||||
@@ -164,36 +172,53 @@ const LOW_BUCKETS: usize = 1 << LOW_BITS;
|
|||||||
/// Mask isolating the low resolved bits.
|
/// Mask isolating the low resolved bits.
|
||||||
const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;
|
const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;
|
||||||
|
|
||||||
/// Number of u32 words in a padded entry slot (32 bytes = one AVX2 register).
|
/// Number of u32 words in a round-0 entry slot (32 bytes = one AVX2 register).
|
||||||
/// Every round stores its residual hash words in a fixed 8-word slot so the XOR
|
/// Round 0 stores all eight 24-bit blocks; the XOR that produces a child is a
|
||||||
/// that produces a child is a single 256-bit load/xor/permute/store and every
|
/// single 256-bit load/xor/permute — xenoncat's packed-slot trick (and the same
|
||||||
/// slot access is naturally aligned — xenoncat's packed-slot trick (and the same
|
/// `uint4`-aligned-slot idea the CUDA backend uses), on the CPU. The leading
|
||||||
/// `uint4`-aligned-slot idea the CUDA backend already uses), on the CPU. The
|
/// collision word lives in lane 0 and is mirrored into a dense parallel `keys[]`
|
||||||
/// leading collision word lives in lane 0 and is mirrored into a dense parallel
|
/// array so the histogram passes stream over 4 bytes/entry instead of striding
|
||||||
/// `keys[]` array so the histogram passes stream over 4 bytes/entry instead of
|
/// the slots.
|
||||||
/// striding the 32-byte slots.
|
///
|
||||||
|
/// Later rounds use a *narrower* pitch: each collision round consumes the leading
|
||||||
|
/// block, so a round-`r` output entry only carries `8 - r` meaningful words. The
|
||||||
|
/// solver packs each round's slots at that width (see `collide`'s `w_out`),
|
||||||
|
/// cutting the per-round slot-buffer DRAM traffic that bounds the collision rounds.
|
||||||
|
/// The XOR producer still loads a full 256-bit register (over-reading up to
|
||||||
|
/// `SLOT` words past a narrow tail slot, hence the `SLOT_SLACK` pad) but
|
||||||
|
/// masked-stores only the `w_out` live lanes so packed neighbours aren't touched.
|
||||||
const SLOT: usize = 8;
|
const SLOT: usize = 8;
|
||||||
|
|
||||||
/// Scalar child producer: `out[0..8] = (a XOR b)` rotated left one lane; returns
|
/// Trailing pad (in u32 words) on every slot buffer so the XOR producer's 256-bit
|
||||||
/// the child's new leading word (lane 1 of the XOR). Lane 0 of the XOR is the
|
/// load over a narrow tail slot stays in bounds. The over-read reaches at most
|
||||||
/// just-collided block (always zero) and is rotated out.
|
/// `(n-1)*w + SLOT` words for pitch `w`, i.e. `SLOT - w ≤ SLOT` words past the
|
||||||
|
/// `n*w` payload; `SLOT` words always covers it.
|
||||||
|
const SLOT_SLACK: usize = SLOT;
|
||||||
|
|
||||||
|
/// Scalar child producer: writes the `w_out` live words of `(a XOR b)` rotated
|
||||||
|
/// left one lane into `out`, and returns the child's new leading word (lane 1 of
|
||||||
|
/// the XOR). Lane 0 of the XOR is the just-collided block (always zero) and is
|
||||||
|
/// rotated out. `a`/`b` are read a full `SLOT` words wide (the caller pads each
|
||||||
|
/// slot buffer by `SLOT_SLACK`); only `out[0..w_out]` is written.
|
||||||
#[inline]
|
#[inline]
|
||||||
unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
|
||||||
let mut x = [0u32; SLOT];
|
let mut x = [0u32; SLOT];
|
||||||
for t in 0..SLOT {
|
for t in 0..SLOT {
|
||||||
x[t] = *a.add(t) ^ *b.add(t);
|
x[t] = *a.add(t) ^ *b.add(t);
|
||||||
}
|
}
|
||||||
for t in 0..SLOT {
|
// out[t] = x[(t + 1) % SLOT]; for t < w_out <= SLOT-1 the modulo is a no-op.
|
||||||
*out.add(t) = x[(t + 1) % SLOT];
|
for t in 0..w_out {
|
||||||
|
*out.add(t) = x[t + 1];
|
||||||
}
|
}
|
||||||
x[1]
|
x[1]
|
||||||
}
|
}
|
||||||
|
|
||||||
/// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
|
/// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
|
||||||
/// one) + one store. Replaces the per-word scalar XOR loop.
|
/// one), then a masked store of the low `w_out` lanes so the packed, `w_out`-
|
||||||
|
/// pitched output never clobbers the next slot.
|
||||||
#[cfg(target_arch = "x86_64")]
|
#[cfg(target_arch = "x86_64")]
|
||||||
#[target_feature(enable = "avx2")]
|
#[target_feature(enable = "avx2")]
|
||||||
unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
|
||||||
use core::arch::x86_64::*;
|
use core::arch::x86_64::*;
|
||||||
let x = _mm256_xor_si256(
|
let x = _mm256_xor_si256(
|
||||||
_mm256_loadu_si256(a as *const __m256i),
|
_mm256_loadu_si256(a as *const __m256i),
|
||||||
@@ -201,7 +226,12 @@ unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
|||||||
);
|
);
|
||||||
// rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
|
// rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
|
||||||
let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
|
let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
|
||||||
_mm256_storeu_si256(out as *mut __m256i, p);
|
// mask = lanes [0, w_out) -> all-ones; maskstore writes only those.
|
||||||
|
let mask = _mm256_cmpgt_epi32(
|
||||||
|
_mm256_set1_epi32(w_out as i32),
|
||||||
|
_mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7),
|
||||||
|
);
|
||||||
|
_mm256_maskstore_epi32(out as *mut i32, mask, p);
|
||||||
_mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
|
_mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -244,38 +274,113 @@ fn repack_index(src: &[u8], dst: &mut [u32]) {
|
|||||||
repack_index_scalar(src, dst);
|
repack_index_scalar(src, dst);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// A raw `*mut u32` shared across rayon workers for the parallel scatter below.
|
||||||
|
/// Sound only because the workers write provably disjoint position sets.
|
||||||
|
#[derive(Clone, Copy)]
|
||||||
|
struct OrderPtr(*mut u32);
|
||||||
|
unsafe impl Send for OrderPtr {}
|
||||||
|
unsafe impl Sync for OrderPtr {}
|
||||||
|
|
||||||
/// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
|
/// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
|
||||||
/// their (dense) leading block. Returns `(starts, order)`, where partition `v`
|
/// their (dense) leading block. Returns `(starts, order, keys_part)`: partition
|
||||||
/// owns the input indices `order[starts[v]..starts[v + 1]]`. The histogram
|
/// `v` owns the input indices `order[starts[v]..starts[v + 1]]`, and `keys_part`
|
||||||
/// passes stream over `keys[]` (4 bytes/entry) instead of striding the slots.
|
/// is the entries' leading words in the same partition-contiguous order
|
||||||
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>) {
|
/// (`keys_part[p] == keys[order[p]]`). Carrying `keys_part` lets the per-partition
|
||||||
|
/// `count_pairs`/`low_group` sweeps read keys sequentially instead of gathering
|
||||||
|
/// `keys[order[..]]` over the whole array — at the cost of one extra 4-byte/entry
|
||||||
|
/// scatter here, folded into the (already parallel) phase-3 pass.
|
||||||
|
///
|
||||||
|
/// Parallel counting sort: the input is split into one contiguous chunk per
|
||||||
|
/// rayon worker. Each chunk histograms its slice (phase 1), a small serial pass
|
||||||
|
/// turns those into per-chunk base offsets within each bucket's output region
|
||||||
|
/// (phase 2), and each chunk scatters its entries into `order`/`keys_part`
|
||||||
|
/// (phase 3). Chunk `c`'s bucket-`b` writes land in `[off[c][b], off[c+1][b])`,
|
||||||
|
/// disjoint from every other chunk and bucket, so the concurrent writes never
|
||||||
|
/// alias. Entries within a bucket end up chunk-major rather than index-major;
|
||||||
|
/// that reordering is immaterial — `count_pairs`/`low_group` depend only on the
|
||||||
|
/// multiset of low keys, and final solutions are canonicalised, de-duplicated,
|
||||||
|
/// and verified.
|
||||||
|
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>, Vec<u32>) {
|
||||||
let mut starts = vec![0u32; TOP_BUCKETS + 1];
|
let mut starts = vec![0u32; TOP_BUCKETS + 1];
|
||||||
for k in 0..n {
|
|
||||||
starts[(keys[k] >> LOW_BITS) as usize + 1] += 1;
|
|
||||||
}
|
|
||||||
for i in 0..TOP_BUCKETS {
|
|
||||||
starts[i + 1] += starts[i];
|
|
||||||
}
|
|
||||||
let mut order = vec![0u32; n];
|
let mut order = vec![0u32; n];
|
||||||
let mut cur = starts.clone(); // small: TOP_BUCKETS + 1 entries
|
let mut keys_part = vec![0u32; n];
|
||||||
for k in 0..n {
|
if n == 0 {
|
||||||
let b = (keys[k] >> LOW_BITS) as usize;
|
return (starts, order, keys_part);
|
||||||
order[cur[b] as usize] = k as u32;
|
}
|
||||||
|
|
||||||
|
let nthreads = rayon::current_num_threads().max(1);
|
||||||
|
let chunk = n.div_ceil(nthreads);
|
||||||
|
let nchunks = n.div_ceil(chunk);
|
||||||
|
|
||||||
|
// Phase 1 (parallel): per-chunk top-bucket histograms.
|
||||||
|
let local_hists: Vec<Vec<u32>> = (0..nchunks)
|
||||||
|
.into_par_iter()
|
||||||
|
.map(|c| {
|
||||||
|
let lo = c * chunk;
|
||||||
|
let hi = ((c + 1) * chunk).min(n);
|
||||||
|
let mut h = vec![0u32; TOP_BUCKETS];
|
||||||
|
for &key in &keys[lo..hi] {
|
||||||
|
h[(key >> LOW_BITS) as usize] += 1;
|
||||||
|
}
|
||||||
|
h
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
// Phase 2 (serial, nchunks * TOP_BUCKETS work): bucket starts, then each
|
||||||
|
// chunk's per-bucket base offset within its bucket's output region.
|
||||||
|
let mut totals = vec![0u32; TOP_BUCKETS];
|
||||||
|
for h in &local_hists {
|
||||||
|
for b in 0..TOP_BUCKETS {
|
||||||
|
totals[b] += h[b];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for b in 0..TOP_BUCKETS {
|
||||||
|
starts[b + 1] = starts[b] + totals[b];
|
||||||
|
}
|
||||||
|
let mut offsets = vec![vec![0u32; TOP_BUCKETS]; nchunks];
|
||||||
|
let mut running = starts[..TOP_BUCKETS].to_vec(); // running[b] starts at starts[b]
|
||||||
|
for c in 0..nchunks {
|
||||||
|
for b in 0..TOP_BUCKETS {
|
||||||
|
offsets[c][b] = running[b];
|
||||||
|
running[b] += local_hists[c][b];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 3 (parallel): each chunk scatters into its disjoint sub-ranges.
|
||||||
|
let optr = OrderPtr(order.as_mut_ptr());
|
||||||
|
let kptr = OrderPtr(keys_part.as_mut_ptr());
|
||||||
|
offsets.into_par_iter().enumerate().for_each(|(c, mut cur)| {
|
||||||
|
let (optr, kptr) = (optr, kptr); // capture whole (Sync) wrappers
|
||||||
|
let lo = c * chunk;
|
||||||
|
let hi = ((c + 1) * chunk).min(n);
|
||||||
|
let (obase, kbase) = (optr.0, kptr.0);
|
||||||
|
for k in lo..hi {
|
||||||
|
let key = keys[k];
|
||||||
|
let b = (key >> LOW_BITS) as usize;
|
||||||
|
let pos = cur[b] as usize;
|
||||||
|
// SAFETY: `pos` ranges over `[off[c][b], off[c+1][b])`, a range owned
|
||||||
|
// exclusively by chunk `c` and within `order`/`keys_part` bounds.
|
||||||
|
unsafe {
|
||||||
|
*obase.add(pos) = k as u32;
|
||||||
|
*kbase.add(pos) = key;
|
||||||
|
}
|
||||||
cur[b] += 1;
|
cur[b] += 1;
|
||||||
}
|
}
|
||||||
(starts, order)
|
});
|
||||||
|
|
||||||
|
(starts, order, keys_part)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Count the colliding pairs a partition will emit, from the low-bit histogram
|
/// Count the colliding pairs a partition will emit, from the low-bit histogram
|
||||||
/// alone (no reordering). `clamp` caps each exact-collision group, matching the
|
/// of its (partition-contiguous) leading keys `keys_run`. `clamp` caps each
|
||||||
/// emit pass so the output offsets line up. `hist` is reusable `LOW_BUCKETS`
|
/// exact-collision group, matching the emit pass so the output offsets line up.
|
||||||
/// scratch.
|
/// `hist` is reusable `LOW_BUCKETS` scratch.
|
||||||
fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
|
fn count_pairs(keys_run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
|
||||||
for h in hist.iter_mut() {
|
for h in hist.iter_mut() {
|
||||||
*h = 0;
|
*h = 0;
|
||||||
}
|
}
|
||||||
for &k in run {
|
for &key in keys_run {
|
||||||
hist[(keys[k as usize] & LOW_MASK) as usize] += 1;
|
hist[(key & LOW_MASK) as usize] += 1;
|
||||||
}
|
}
|
||||||
let mut pairs = 0usize;
|
let mut pairs = 0usize;
|
||||||
for i in 0..LOW_BUCKETS {
|
for i in 0..LOW_BUCKETS {
|
||||||
@@ -285,31 +390,44 @@ fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usi
|
|||||||
pairs
|
pairs
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Within one partition, group `run`'s entries by the low bits of their leading
|
/// Within one partition, group its entries by the low bits of their leading
|
||||||
/// block, writing the grouped indices into `sorted`. `hist` is reusable
|
/// block. Inputs are the partition-contiguous slabs `keys_run` (leading keys) and
|
||||||
/// `LOW_BUCKETS + 1` scratch. After this call `sorted` lists the run's indices
|
/// `order_run` (matching global indices). Outputs, in low-key-sorted order:
|
||||||
/// with equal low keys contiguous, so callers recover each exact-collision
|
/// `sorted` (the global indices, for the emit slot gather + back-refs) and
|
||||||
/// group by walking adjacent equal keys.
|
/// `keys_sorted` (the leading keys, so the emit group walk streams a dense local
|
||||||
fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>) {
|
/// array instead of gathering `keys[sorted[i]]`). `hist` is reusable
|
||||||
let m = run.len();
|
/// `LOW_BUCKETS + 1` scratch. Both reads are sequential over the slabs.
|
||||||
|
fn low_group(
|
||||||
|
keys_run: &[u32],
|
||||||
|
order_run: &[u32],
|
||||||
|
hist: &mut [u32],
|
||||||
|
sorted: &mut Vec<u32>,
|
||||||
|
keys_sorted: &mut Vec<u32>,
|
||||||
|
) {
|
||||||
|
let m = order_run.len();
|
||||||
sorted.clear();
|
sorted.clear();
|
||||||
|
keys_sorted.clear();
|
||||||
if m == 0 {
|
if m == 0 {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
for h in hist.iter_mut() {
|
for h in hist.iter_mut() {
|
||||||
*h = 0;
|
*h = 0;
|
||||||
}
|
}
|
||||||
for &k in run {
|
for &key in keys_run {
|
||||||
hist[(keys[k as usize] & LOW_MASK) as usize + 1] += 1;
|
hist[(key & LOW_MASK) as usize + 1] += 1;
|
||||||
}
|
}
|
||||||
for i in 0..LOW_BUCKETS {
|
for i in 0..LOW_BUCKETS {
|
||||||
hist[i + 1] += hist[i];
|
hist[i + 1] += hist[i];
|
||||||
}
|
}
|
||||||
sorted.resize(m, 0);
|
sorted.resize(m, 0);
|
||||||
|
keys_sorted.resize(m, 0);
|
||||||
// hist[low] now holds the run-start offset; reuse it as the live cursor.
|
// hist[low] now holds the run-start offset; reuse it as the live cursor.
|
||||||
for &k in run {
|
for i in 0..m {
|
||||||
let low = (keys[k as usize] & LOW_MASK) as usize;
|
let key = keys_run[i];
|
||||||
sorted[hist[low] as usize] = k;
|
let low = (key & LOW_MASK) as usize;
|
||||||
|
let pos = hist[low] as usize;
|
||||||
|
sorted[pos] = order_run[i];
|
||||||
|
keys_sorted[pos] = key;
|
||||||
hist[low] += 1;
|
hist[low] += 1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -320,26 +438,42 @@ fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>)
|
|||||||
/// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
|
/// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
|
||||||
/// inside a `target_feature` wrapper while sharing one source of truth.
|
/// inside a `target_feature` wrapper while sharing one source of truth.
|
||||||
macro_rules! emit_bucket_body {
|
macro_rules! emit_bucket_body {
|
||||||
($keys:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $xor:path) => {{
|
($keys_sorted:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $w_in:expr, $w_out:expr, $prefetch:expr, $xor:path) => {{
|
||||||
let s = $sorted;
|
let s = $sorted;
|
||||||
|
let ks = $keys_sorted; // leading keys in `s`-order; group walk streams it
|
||||||
let m = s.len();
|
let m = s.len();
|
||||||
|
let w_in = $w_in;
|
||||||
|
let w_out = $w_out;
|
||||||
let mut w = 0usize;
|
let mut w = 0usize;
|
||||||
let mut i = 0;
|
let mut i = 0;
|
||||||
while i < m {
|
while i < m {
|
||||||
let key = $keys[s[i] as usize] & LOW_MASK;
|
let key = ks[i] & LOW_MASK;
|
||||||
let mut j = i + 1;
|
let mut j = i + 1;
|
||||||
while j < m && ($keys[s[j] as usize] & LOW_MASK) == key {
|
while j < m && (ks[j] & LOW_MASK) == key {
|
||||||
j += 1;
|
j += 1;
|
||||||
}
|
}
|
||||||
let hi = j.min(i.saturating_add($clamp));
|
let hi = j.min(i.saturating_add($clamp));
|
||||||
|
// The pair loops below read `slots[s[a]]` / `slots[s[b]]` at random
|
||||||
|
// global positions in the ~1 GB slot arena — each a likely cache
|
||||||
|
// miss. Prefetch this group's member slots up front so the misses
|
||||||
|
// overlap with the (L1-resident) pair XORs that follow.
|
||||||
|
#[cfg(target_arch = "x86_64")]
|
||||||
|
if $prefetch {
|
||||||
|
for a in i..hi {
|
||||||
|
core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(
|
||||||
|
$slots.as_ptr().add(s[a] as usize * w_in) as *const i8,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
for a in i..hi {
|
for a in i..hi {
|
||||||
let l = s[a] as usize;
|
let l = s[a] as usize;
|
||||||
for b in (a + 1)..hi {
|
for b in (a + 1)..hi {
|
||||||
let mr = s[b] as usize;
|
let mr = s[b] as usize;
|
||||||
let nk = $xor(
|
let nk = $xor(
|
||||||
$sout.as_mut_ptr().add(w * SLOT),
|
$sout.as_mut_ptr().add(w * w_out),
|
||||||
$slots.as_ptr().add(l * SLOT),
|
$slots.as_ptr().add(l * w_in),
|
||||||
$slots.as_ptr().add(mr * SLOT),
|
$slots.as_ptr().add(mr * w_in),
|
||||||
|
w_out,
|
||||||
);
|
);
|
||||||
$kout[w] = nk;
|
$kout[w] = nk;
|
||||||
$pout[w] = ((l as u64) << 32) | mr as u64;
|
$pout[w] = ((l as u64) << 32) | mr as u64;
|
||||||
@@ -353,69 +487,101 @@ macro_rules! emit_bucket_body {
|
|||||||
}
|
}
|
||||||
|
|
||||||
unsafe fn emit_bucket_scalar(
|
unsafe fn emit_bucket_scalar(
|
||||||
keys: &[u32],
|
keys_sorted: &[u32],
|
||||||
slots: &[u32],
|
slots: &[u32],
|
||||||
sorted: &[u32],
|
sorted: &[u32],
|
||||||
kout: &mut [u32],
|
kout: &mut [u32],
|
||||||
sout: &mut [u32],
|
sout: &mut [u32],
|
||||||
pout: &mut [u64],
|
pout: &mut [u64],
|
||||||
clamp: usize,
|
clamp: usize,
|
||||||
|
w_in: usize,
|
||||||
|
w_out: usize,
|
||||||
|
prefetch: bool,
|
||||||
) -> usize {
|
) -> usize {
|
||||||
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_scalar)
|
emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_scalar)
|
||||||
}
|
}
|
||||||
|
|
||||||
#[cfg(target_arch = "x86_64")]
|
#[cfg(target_arch = "x86_64")]
|
||||||
#[target_feature(enable = "avx2")]
|
#[target_feature(enable = "avx2")]
|
||||||
unsafe fn emit_bucket_avx2(
|
unsafe fn emit_bucket_avx2(
|
||||||
keys: &[u32],
|
keys_sorted: &[u32],
|
||||||
slots: &[u32],
|
slots: &[u32],
|
||||||
sorted: &[u32],
|
sorted: &[u32],
|
||||||
kout: &mut [u32],
|
kout: &mut [u32],
|
||||||
sout: &mut [u32],
|
sout: &mut [u32],
|
||||||
pout: &mut [u64],
|
pout: &mut [u64],
|
||||||
clamp: usize,
|
clamp: usize,
|
||||||
|
w_in: usize,
|
||||||
|
w_out: usize,
|
||||||
|
prefetch: bool,
|
||||||
) -> usize {
|
) -> usize {
|
||||||
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_avx2)
|
emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_avx2)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Emit a partition's children, dispatching to the AVX2 producer when available.
|
/// Emit a partition's children, dispatching to the AVX2 producer when available.
|
||||||
|
/// `w_in`/`w_out` are the input/output slot pitches (`w_out == w_in - 1`).
|
||||||
|
/// `prefetch` software-prefetches each group's randomly-gathered member slots.
|
||||||
unsafe fn emit_bucket(
|
unsafe fn emit_bucket(
|
||||||
keys: &[u32],
|
keys_sorted: &[u32],
|
||||||
slots: &[u32],
|
slots: &[u32],
|
||||||
sorted: &[u32],
|
sorted: &[u32],
|
||||||
kout: &mut [u32],
|
kout: &mut [u32],
|
||||||
sout: &mut [u32],
|
sout: &mut [u32],
|
||||||
pout: &mut [u64],
|
pout: &mut [u64],
|
||||||
clamp: usize,
|
clamp: usize,
|
||||||
|
w_in: usize,
|
||||||
|
w_out: usize,
|
||||||
|
prefetch: bool,
|
||||||
) -> usize {
|
) -> usize {
|
||||||
#[cfg(target_arch = "x86_64")]
|
#[cfg(target_arch = "x86_64")]
|
||||||
{
|
{
|
||||||
if is_x86_feature_detected!("avx2") {
|
if is_x86_feature_detected!("avx2") {
|
||||||
return emit_bucket_avx2(keys, slots, sorted, kout, sout, pout, clamp);
|
return emit_bucket_avx2(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
emit_bucket_scalar(keys, slots, sorted, kout, sout, pout, clamp)
|
emit_bucket_scalar(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Group `n` entries by their leading block, then emit one child per colliding
|
/// Group `n` entries by their leading block, then emit one child per colliding
|
||||||
/// pair: the XOR of the residual blocks (rotated into a fresh 8-word slot) plus
|
/// pair: the XOR of the residual blocks (rotated into a fresh slot) plus a packed
|
||||||
/// a packed `(l << 32) | mr` back-reference. Two passes — count, then emit
|
/// `(l << 32) | mr` back-reference. Two passes — count, then emit directly into
|
||||||
/// directly into one pre-sized arena — so there is no per-partition allocation
|
/// one pre-sized arena — so there is no per-partition allocation or final
|
||||||
/// or final concatenation copy. Returns `(keys_out, slots_out, parents)`.
|
/// concatenation copy. Returns `(keys_out, slots_out, parents)`.
|
||||||
fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
|
///
|
||||||
let (starts, order) = partition_top(keys, n);
|
/// `w_in` is the input slot pitch; the round consumes one block, so the output is
|
||||||
|
/// packed at `w_out = w_in - 1` words/slot. `slots_out` carries a `SLOT_SLACK`
|
||||||
|
/// trailing pad so the next round's 256-bit over-read stays in bounds.
|
||||||
|
fn collide(
|
||||||
|
keys: &[u32],
|
||||||
|
slots: &[u32],
|
||||||
|
n: usize,
|
||||||
|
clamp: usize,
|
||||||
|
w_in: usize,
|
||||||
|
) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
|
||||||
|
let w_out = w_in - 1;
|
||||||
|
// Sub-phase timing, gated on `EQ_PROFILE`. Prints partition / count / alloc /
|
||||||
|
// emit splits so we can see which part of the round dominates.
|
||||||
|
let prof = std::env::var_os("EQ_PROFILE").is_some();
|
||||||
|
// Software-prefetch the emit gather (on by default; `EQ_NO_PREFETCH` disables
|
||||||
|
// it for A/B benchmarking). Read once, not per group.
|
||||||
|
let do_prefetch = std::env::var_os("EQ_NO_PREFETCH").is_none();
|
||||||
|
let t0 = std::time::Instant::now();
|
||||||
|
|
||||||
// Pass 1: per-partition child counts (histogram-derived, no reordering).
|
let (starts, order, keys_part) = partition_top(keys, n);
|
||||||
|
let t_part = std::time::Instant::now();
|
||||||
|
|
||||||
|
// Pass 1: per-partition child counts from each partition's contiguous keys.
|
||||||
let counts: Vec<usize> = (0..TOP_BUCKETS)
|
let counts: Vec<usize> = (0..TOP_BUCKETS)
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.map_init(
|
.map_init(
|
||||||
|| vec![0u32; LOW_BUCKETS],
|
|| vec![0u32; LOW_BUCKETS],
|
||||||
|hist, v| {
|
|hist, v| {
|
||||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
let keys_run = &keys_part[starts[v] as usize..starts[v + 1] as usize];
|
||||||
count_pairs(keys, run, hist, clamp)
|
count_pairs(keys_run, hist, clamp)
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
.collect();
|
.collect();
|
||||||
|
let t_count = std::time::Instant::now();
|
||||||
|
|
||||||
let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
|
let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
|
||||||
for v in 0..TOP_BUCKETS {
|
for v in 0..TOP_BUCKETS {
|
||||||
@@ -423,7 +589,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
|||||||
}
|
}
|
||||||
let total = out_starts[TOP_BUCKETS];
|
let total = out_starts[TOP_BUCKETS];
|
||||||
let mut keys_out = vec![0u32; total];
|
let mut keys_out = vec![0u32; total];
|
||||||
let mut slots_out = vec![0u32; total * SLOT];
|
// Packed at `w_out` words/slot, plus a slack pad for the next round's over-read.
|
||||||
|
let mut slots_out = vec![0u32; total * w_out + SLOT_SLACK];
|
||||||
let mut parents = vec![0u64; total];
|
let mut parents = vec![0u64; total];
|
||||||
|
|
||||||
// Carve the output arena into disjoint per-partition sub-slices so workers
|
// Carve the output arena into disjoint per-partition sub-slices so workers
|
||||||
@@ -435,7 +602,7 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
|||||||
let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
|
let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
|
||||||
for &c in &counts {
|
for &c in &counts {
|
||||||
let (kh, kt) = kr.split_at_mut(c);
|
let (kh, kt) = kr.split_at_mut(c);
|
||||||
let (sh, st) = sr.split_at_mut(c * SLOT);
|
let (sh, st) = sr.split_at_mut(c * w_out);
|
||||||
let (ph, pt) = pr.split_at_mut(c);
|
let (ph, pt) = pr.split_at_mut(c);
|
||||||
kparts.push(kh);
|
kparts.push(kh);
|
||||||
sparts.push(sh);
|
sparts.push(sh);
|
||||||
@@ -446,6 +613,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
let t_alloc = std::time::Instant::now();
|
||||||
|
|
||||||
// Pass 2: group each partition and emit its colliding pairs in place.
|
// Pass 2: group each partition and emit its colliding pairs in place.
|
||||||
kparts
|
kparts
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
@@ -453,37 +622,53 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
|||||||
.zip(pparts)
|
.zip(pparts)
|
||||||
.enumerate()
|
.enumerate()
|
||||||
.for_each(|(v, ((kout, sout), pout))| {
|
.for_each(|(v, ((kout, sout), pout))| {
|
||||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
let lo = starts[v] as usize;
|
||||||
|
let hi = starts[v + 1] as usize;
|
||||||
let mut hist = vec![0u32; LOW_BUCKETS + 1];
|
let mut hist = vec![0u32; LOW_BUCKETS + 1];
|
||||||
let mut sorted = Vec::new();
|
let mut sorted = Vec::new();
|
||||||
low_group(keys, run, &mut hist, &mut sorted);
|
let mut keys_sorted = Vec::new();
|
||||||
let w = unsafe { emit_bucket(keys, slots, &sorted, kout, sout, pout, clamp) };
|
low_group(&keys_part[lo..hi], &order[lo..hi], &mut hist, &mut sorted, &mut keys_sorted);
|
||||||
|
let w = unsafe {
|
||||||
|
emit_bucket(&keys_sorted, slots, &sorted, kout, sout, pout, clamp, w_in, w_out, do_prefetch)
|
||||||
|
};
|
||||||
debug_assert_eq!(w, kout.len());
|
debug_assert_eq!(w, kout.len());
|
||||||
});
|
});
|
||||||
|
|
||||||
|
if prof {
|
||||||
|
let ms = |a: std::time::Instant, b: std::time::Instant| (b - a).as_secs_f64() * 1000.0;
|
||||||
|
eprintln!(
|
||||||
|
" [collide n={n}] partition {:6.1} count {:6.1} alloc {:6.1} emit(group+xor) {:6.1} ms",
|
||||||
|
ms(t0, t_part),
|
||||||
|
ms(t_part, t_count),
|
||||||
|
ms(t_count, t_alloc),
|
||||||
|
ms(t_alloc, std::time::Instant::now()),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
(keys_out, slots_out, parents)
|
(keys_out, slots_out, parents)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Final round (slots hold `[w0, w1, …]`): among entries sharing leading block
|
/// Final round (slots hold `[w0, w1, …]` at pitch `w_in`): among entries sharing
|
||||||
/// `w0`, a pair whose `w1` also matches XORs the last two blocks to zero — a
|
/// leading block `w0`, a pair whose `w1` also matches XORs the last two blocks to
|
||||||
/// candidate. Returns the `(l, mr)` parents of each candidate.
|
/// zero — a candidate. Returns the `(l, mr)` parents of each candidate.
|
||||||
fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u32, u32)> {
|
fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize, w_in: usize) -> Vec<(u32, u32)> {
|
||||||
let (starts, order) = partition_top(keys, n);
|
let (starts, order, keys_part) = partition_top(keys, n);
|
||||||
|
|
||||||
(0..TOP_BUCKETS)
|
(0..TOP_BUCKETS)
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.map_init(
|
.map_init(
|
||||||
|| (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new()),
|
|| (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new(), Vec::<u32>::new()),
|
||||||
|(hist, sorted), v| {
|
|(hist, sorted, keys_sorted), v| {
|
||||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
let lo = starts[v] as usize;
|
||||||
low_group(keys, run, hist, sorted);
|
let hi = starts[v + 1] as usize;
|
||||||
|
low_group(&keys_part[lo..hi], &order[lo..hi], hist, sorted, keys_sorted);
|
||||||
let m = sorted.len();
|
let m = sorted.len();
|
||||||
let mut local = Vec::new();
|
let mut local = Vec::new();
|
||||||
let mut i = 0;
|
let mut i = 0;
|
||||||
while i < m {
|
while i < m {
|
||||||
let key = keys[sorted[i] as usize] & LOW_MASK;
|
let key = keys_sorted[i] & LOW_MASK;
|
||||||
let mut j = i + 1;
|
let mut j = i + 1;
|
||||||
while j < m && (keys[sorted[j] as usize] & LOW_MASK) == key {
|
while j < m && (keys_sorted[j] & LOW_MASK) == key {
|
||||||
j += 1;
|
j += 1;
|
||||||
}
|
}
|
||||||
let hi = j.min(i.saturating_add(clamp));
|
let hi = j.min(i.saturating_add(clamp));
|
||||||
@@ -491,7 +676,7 @@ fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u3
|
|||||||
let l = sorted[a] as usize;
|
let l = sorted[a] as usize;
|
||||||
for b in (a + 1)..hi {
|
for b in (a + 1)..hi {
|
||||||
let mr = sorted[b] as usize;
|
let mr = sorted[b] as usize;
|
||||||
if slots[l * SLOT + 1] == slots[mr * SLOT + 1] {
|
if slots[l * w_in + 1] == slots[mr * w_in + 1] {
|
||||||
local.push((l as u32, mr as u32));
|
local.push((l as u32, mr as u32));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -532,6 +717,22 @@ pub fn solve(header: &[u8]) -> Vec<Vec<u32>> {
|
|||||||
pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
||||||
let clamp = clamp.unwrap_or(usize::MAX);
|
let clamp = clamp.unwrap_or(usize::MAX);
|
||||||
|
|
||||||
|
// Optional per-phase timing, gated on `EQ_PROFILE` (any value). Zero cost
|
||||||
|
// when unset — the only overhead is one env lookup. Each `phase(label, n)`
|
||||||
|
// prints the wall time since the previous call and the live entry count.
|
||||||
|
let prof = std::env::var_os("EQ_PROFILE").is_some();
|
||||||
|
let mut t_last = std::time::Instant::now();
|
||||||
|
let mut phase = |label: &str, n: usize| {
|
||||||
|
if prof {
|
||||||
|
let now = std::time::Instant::now();
|
||||||
|
eprintln!(
|
||||||
|
" [profile] {label:<13} {:8.1} ms (n={n})",
|
||||||
|
(now - t_last).as_secs_f64() * 1000.0
|
||||||
|
);
|
||||||
|
t_last = now;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
// ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
|
// ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
|
||||||
// in padded 8-word slots with the leading block mirrored into `keys`. Entry
|
// in padded 8-word slots with the leading block mirrored into `keys`. Entry
|
||||||
// k corresponds to leaf index k (the gen order), so no leaf table is needed.
|
// k corresponds to leaf index k (the gen order), so no leaf table is needed.
|
||||||
@@ -541,18 +742,29 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
|||||||
let hasher = BatchHasher::new(header);
|
let hasher = BatchHasher::new(header);
|
||||||
let mut keys: Vec<u32> = vec![0u32; n0];
|
let mut keys: Vec<u32> = vec![0u32; n0];
|
||||||
let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
|
let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
|
||||||
let kgroup = 4 * INDICES_PER_HASH_OUTPUT; // eight entries
|
// Hash eight `g` values per chunk via `hash8`, which uses one AVX-512
|
||||||
|
// compression where available and falls back to two AVX2 `hash4`s (or scalar)
|
||||||
|
// otherwise — so this single path covers every CPU.
|
||||||
|
let kgroup = 8 * INDICES_PER_HASH_OUTPUT; // sixteen entries
|
||||||
let sgroup = kgroup * SLOT;
|
let sgroup = kgroup * SLOT;
|
||||||
debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 4-g groups");
|
debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 8-g groups");
|
||||||
|
// `EQ_NO_AVX512` forces the AVX2 fallback (two hash4) for A/B benchmarking.
|
||||||
|
let use_avx512 = std::env::var_os("EQ_NO_AVX512").is_none();
|
||||||
slots
|
slots
|
||||||
.par_chunks_mut(sgroup)
|
.par_chunks_mut(sgroup)
|
||||||
.zip(keys.par_chunks_mut(kgroup))
|
.zip(keys.par_chunks_mut(kgroup))
|
||||||
.enumerate()
|
.enumerate()
|
||||||
.for_each(|(c, (schunk, kchunk))| {
|
.for_each(|(c, (schunk, kchunk))| {
|
||||||
let g0 = (c * 4) as u32;
|
let g0 = (c * 8) as u32;
|
||||||
let mut hs = [[0u8; HASH_OUTPUT]; 4];
|
let mut hs = [[0u8; HASH_OUTPUT]; 8];
|
||||||
hasher.hash4(g0, &mut hs);
|
if use_avx512 {
|
||||||
for j in 0..4 {
|
hasher.hash8(g0, &mut hs);
|
||||||
|
} else {
|
||||||
|
let (lo, hi) = hs.split_at_mut(4);
|
||||||
|
hasher.hash4(g0, (&mut lo[..4]).try_into().unwrap());
|
||||||
|
hasher.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
|
||||||
|
}
|
||||||
|
for j in 0..8 {
|
||||||
for i in 0..INDICES_PER_HASH_OUTPUT {
|
for i in 0..INDICES_PER_HASH_OUTPUT {
|
||||||
let e = j * INDICES_PER_HASH_OUTPUT + i;
|
let e = j * INDICES_PER_HASH_OUTPUT + i;
|
||||||
let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
|
let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
|
||||||
@@ -567,20 +779,27 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
|||||||
// parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
|
// parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
|
||||||
// `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
|
// `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
|
||||||
// the new ones replace them).
|
// the new ones replace them).
|
||||||
|
phase("round0-hash", n0);
|
||||||
let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
|
let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
|
||||||
let mut n = n0;
|
let mut n = n0;
|
||||||
for _ in 0..(K - 1) {
|
// Round-0 slots carry all SLOT words; each collision round consumes the
|
||||||
let (ok, os, op) = collide(&keys, &slots, n, clamp);
|
// leading block, so the next round's slots are one word narrower.
|
||||||
|
let mut width = SLOT;
|
||||||
|
for r in 0..(K - 1) {
|
||||||
|
let (ok, os, op) = collide(&keys, &slots, n, clamp, width);
|
||||||
n = op.len();
|
n = op.len();
|
||||||
parents.push(op);
|
parents.push(op);
|
||||||
keys = ok;
|
keys = ok;
|
||||||
slots = os;
|
slots = os;
|
||||||
|
width -= 1;
|
||||||
|
phase(&format!("collide r{}", r + 1), n);
|
||||||
if n == 0 {
|
if n == 0 {
|
||||||
return Vec::new();
|
return Vec::new();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
let candidates = collide_final(&keys, &slots, n, clamp);
|
let candidates = collide_final(&keys, &slots, n, clamp, width);
|
||||||
|
phase("collide-final", candidates.len());
|
||||||
if candidates.is_empty() {
|
if candidates.is_empty() {
|
||||||
return Vec::new();
|
return Vec::new();
|
||||||
}
|
}
|
||||||
@@ -605,8 +824,11 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
|||||||
// refs are now round-0 indices == leaf indices.
|
// refs are now round-0 indices == leaf indices.
|
||||||
recovered.extend_from_slice(&refs);
|
recovered.extend_from_slice(&refs);
|
||||||
}
|
}
|
||||||
|
phase("recover", recovered.len() / SOLUTION_INDICES);
|
||||||
|
|
||||||
filter_candidates(&base_state(header), &recovered)
|
let result = filter_candidates(&base_state(header), &recovered);
|
||||||
|
phase("filter+verify", result.len());
|
||||||
|
result
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Re-order recovered leaf indices into the canonical solution ordering: at
|
/// Re-order recovered leaf indices into the canonical solution ordering: at
|
||||||
@@ -741,26 +963,34 @@ mod tests {
|
|||||||
assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
|
assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
|
||||||
}
|
}
|
||||||
|
|
||||||
// AVX2 XOR-child (xor + rotate-left-one-lane) must match the scalar version.
|
// The XOR-child producers (scalar + AVX2) must agree for every output width,
|
||||||
|
// write exactly `w_out` words (rotate-left-one of the lane-wise XOR), and
|
||||||
|
// leave the rest of the packed buffer untouched (the AVX2 path masked-stores).
|
||||||
#[test]
|
#[test]
|
||||||
fn xor_child_matches_scalar() {
|
fn xor_child_matches_scalar() {
|
||||||
|
const SENT: u32 = 0xDEAD_BEEF;
|
||||||
let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
|
let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
|
||||||
let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
|
let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
|
||||||
let mut o1 = [0u32; SLOT];
|
|
||||||
let mut o2 = [0u32; SLOT];
|
|
||||||
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
|
|
||||||
// Reference: rotate-left-one of the lane-wise XOR; new key = lane 1.
|
|
||||||
let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
|
let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
|
||||||
let expect: Vec<u32> = (0..SLOT).map(|i| x[(i + 1) % SLOT]).collect();
|
for w_out in 1..SLOT {
|
||||||
assert_eq!(&o1[..], &expect[..]);
|
let mut o1 = [SENT; SLOT];
|
||||||
|
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
|
||||||
|
for t in 0..w_out {
|
||||||
|
assert_eq!(o1[t], x[(t + 1) % SLOT], "scalar word {t} (w_out={w_out})");
|
||||||
|
}
|
||||||
|
for t in w_out..SLOT {
|
||||||
|
assert_eq!(o1[t], SENT, "scalar wrote past w_out at {t} (w_out={w_out})");
|
||||||
|
}
|
||||||
assert_eq!(k1, x[1]);
|
assert_eq!(k1, x[1]);
|
||||||
#[cfg(target_arch = "x86_64")]
|
#[cfg(target_arch = "x86_64")]
|
||||||
if is_x86_feature_detected!("avx2") {
|
if is_x86_feature_detected!("avx2") {
|
||||||
let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
|
let mut o2 = [SENT; SLOT];
|
||||||
assert_eq!(o1, o2, "avx2 xor_child != scalar");
|
let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
|
||||||
|
assert_eq!(o1, o2, "avx2 xor_child != scalar (w_out={w_out})");
|
||||||
assert_eq!(k1, k2);
|
assert_eq!(k1, k2);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Clamped solve over a fixed header must find valid solutions (Equihash
|
// Clamped solve over a fixed header must find valid solutions (Equihash
|
||||||
// 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
|
// 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
|
||||||
@@ -780,4 +1010,97 @@ mod tests {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// A deterministic dense 140-byte header for benchmarking, filled by an
|
||||||
|
/// xorshift64* stream keyed on `seed` (the same scheme as `main::pseudo_header`)
|
||||||
|
/// so each nonce exercises a realistic, fully-populated header rather than the
|
||||||
|
/// degenerate all-`0x42` one.
|
||||||
|
#[cfg(test)]
|
||||||
|
fn bench_header(seed: u64) -> Vec<u8> {
|
||||||
|
let mut header = vec![0u8; HEADER_LEN];
|
||||||
|
let mut x = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
|
||||||
|
for b in header.iter_mut() {
|
||||||
|
x ^= x >> 12;
|
||||||
|
x ^= x << 25;
|
||||||
|
x ^= x >> 27;
|
||||||
|
*b = (x.wrapping_mul(0x2545F4914F6CDD1D) >> 33) as u8;
|
||||||
|
}
|
||||||
|
header
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Steady-state throughput baseline for the CPU solver. Ignored by default;
|
||||||
|
/// run it in release (the only meaningful configuration — `debug_assert`s in
|
||||||
|
/// the collision hot path are off and `opt-level=3`/LTO is on) with output
|
||||||
|
/// visible:
|
||||||
|
///
|
||||||
|
/// ```text
|
||||||
|
/// cargo test --release full_solve_baseline -- --ignored --nocapture
|
||||||
|
/// ```
|
||||||
|
///
|
||||||
|
/// Tunables via env:
|
||||||
|
/// * `EQ_BENCH_ITERS` — timed solves per clamp (default 8).
|
||||||
|
/// * `EQ_BENCH_CLAMPS` — comma-separated clamp sweep (default `32`, the
|
||||||
|
/// value the miner and `--selftest` use).
|
||||||
|
///
|
||||||
|
/// Each timed iteration uses a distinct nonce so the figures average over
|
||||||
|
/// per-header variance; one warm-up solve per clamp (pages in the ~GB of
|
||||||
|
/// round buffers, warms caches) is excluded. This measures one solve at a
|
||||||
|
/// time across the whole rayon pool — i.e. solver latency and its memory
|
||||||
|
/// behaviour, which is exactly what the planned radix-scatter change targets
|
||||||
|
/// — not the aggregate multi-group mining rate `--bench` reports.
|
||||||
|
#[test]
|
||||||
|
#[ignore]
|
||||||
|
fn full_solve_baseline() {
|
||||||
|
use std::time::Instant;
|
||||||
|
|
||||||
|
let iters: usize = std::env::var("EQ_BENCH_ITERS")
|
||||||
|
.ok()
|
||||||
|
.and_then(|v| v.parse().ok())
|
||||||
|
.unwrap_or(8);
|
||||||
|
let clamps: Vec<usize> = std::env::var("EQ_BENCH_CLAMPS")
|
||||||
|
.ok()
|
||||||
|
.map(|v| v.split(',').filter_map(|s| s.trim().parse().ok()).collect())
|
||||||
|
.filter(|v: &Vec<usize>| !v.is_empty())
|
||||||
|
.unwrap_or_else(|| vec![32]);
|
||||||
|
|
||||||
|
eprintln!(
|
||||||
|
"CPU solver baseline — {iters} timed solve(s)/clamp, clamps={clamps:?}, \
|
||||||
|
rayon threads={}",
|
||||||
|
rayon::current_num_threads()
|
||||||
|
);
|
||||||
|
|
||||||
|
for &clamp in &clamps {
|
||||||
|
// Warm-up (excluded): fault in buffers, warm caches.
|
||||||
|
let _ = solve_with(&bench_header(0), Some(clamp));
|
||||||
|
|
||||||
|
let mut times = Vec::with_capacity(iters);
|
||||||
|
let mut total_sols = 0usize;
|
||||||
|
for i in 0..iters {
|
||||||
|
let header = bench_header(i as u64 + 1);
|
||||||
|
let base = base_state(&header);
|
||||||
|
let t = Instant::now();
|
||||||
|
let sols = solve_with(&header, Some(clamp));
|
||||||
|
times.push(t.elapsed().as_secs_f64());
|
||||||
|
for s in &sols {
|
||||||
|
assert!(is_valid_solution(&base, s), "solver returned an invalid solution");
|
||||||
|
}
|
||||||
|
total_sols += sols.len();
|
||||||
|
}
|
||||||
|
|
||||||
|
times.sort_by(|a, b| a.partial_cmp(b).unwrap());
|
||||||
|
let n = times.len();
|
||||||
|
let sum: f64 = times.iter().sum();
|
||||||
|
let mean = 1000.0 * sum / n as f64;
|
||||||
|
let median = 1000.0 * times[n / 2];
|
||||||
|
let min = 1000.0 * times[0];
|
||||||
|
let max = 1000.0 * times[n - 1];
|
||||||
|
let solves_per_s = n as f64 / sum;
|
||||||
|
eprintln!(
|
||||||
|
"clamp={clamp:>3}: {solves_per_s:6.2} solve/s | mean {mean:6.0} ms \
|
||||||
|
median {median:6.0} ms min {min:6.0} ms max {max:6.0} ms | \
|
||||||
|
{total_sols} sol over {n} solve(s) ({:.2}/solve)",
|
||||||
|
total_sols as f64 / n as f64,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
+148
-50
@@ -411,6 +411,9 @@ impl LegacySolver {
|
|||||||
/// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
|
/// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
|
||||||
pub struct GpuSolver {
|
pub struct GpuSolver {
|
||||||
inner: SolverInner,
|
inner: SolverInner,
|
||||||
|
/// Per-card telemetry handle (AMD amdgpu sysfs / NVML), matched to the device
|
||||||
|
/// by PCI bus id. `None` when no telemetry backend matches.
|
||||||
|
tuner: Option<Box<dyn crate::gpu_tune::GpuTuner>>,
|
||||||
}
|
}
|
||||||
|
|
||||||
enum SolverInner {
|
enum SolverInner {
|
||||||
@@ -423,13 +426,16 @@ impl GpuSolver {
|
|||||||
/// device vendor (AMD → `equihash192_7.cl`).
|
/// device vendor (AMD → `equihash192_7.cl`).
|
||||||
pub fn new(device_index: usize) -> Result<Self> {
|
pub fn new(device_index: usize) -> Result<Self> {
|
||||||
let (platform, device) = pick_device(device_index)?;
|
let (platform, device) = pick_device(device_index)?;
|
||||||
|
// Resolve a telemetry handle (AMD sysfs / NVML) from the device's PCI bus
|
||||||
|
// before `device` is consumed by the inner solver.
|
||||||
|
let tuner = device_pci_bus_id(&device).and_then(|bus| crate::gpu_tune::open(&bus));
|
||||||
let inner = if use_amd_kernel(&device) {
|
let inner = if use_amd_kernel(&device) {
|
||||||
log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
|
log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
|
||||||
SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
|
SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
|
||||||
} else {
|
} else {
|
||||||
SolverInner::Legacy(LegacySolver::new(platform, device)?)
|
SolverInner::Legacy(LegacySolver::new(platform, device)?)
|
||||||
};
|
};
|
||||||
Ok(Self { inner })
|
Ok(Self { inner, tuner })
|
||||||
}
|
}
|
||||||
|
|
||||||
/// This device's product name, if available.
|
/// This device's product name, if available.
|
||||||
@@ -440,6 +446,34 @@ impl GpuSolver {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Current board power draw in watts (telemetry handle), if available.
|
||||||
|
pub fn power_watts(&self) -> Option<f64> {
|
||||||
|
self.tuner.as_ref().and_then(|t| t.watts())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Current GPU temperature in °C, if available.
|
||||||
|
pub fn temperature_c(&self) -> Option<u32> {
|
||||||
|
self.tuner.as_ref().and_then(|t| t.temperature_c())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Currently enforced power limit in watts, if available.
|
||||||
|
pub fn current_power_limit_w(&self) -> Option<u32> {
|
||||||
|
self.tuner.as_ref().and_then(|t| t.current_power_limit_w())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// (min, max) settable power limit in watts, if available.
|
||||||
|
pub fn power_limit_range_w(&self) -> Option<(u32, u32)> {
|
||||||
|
self.tuner.as_ref().and_then(|t| t.power_limit_range_w())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Current (core, memory) clocks in MHz, each `None` if unavailable.
|
||||||
|
pub fn current_clocks_mhz(&self) -> (Option<u32>, Option<u32>) {
|
||||||
|
match &self.tuner {
|
||||||
|
Some(t) => (t.core_clock_mhz(), t.mem_clock_mhz()),
|
||||||
|
None => (None, None),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// Solve the puzzle for `header` (140 bytes).
|
/// Solve the puzzle for `header` (140 bytes).
|
||||||
pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
|
pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
|
||||||
match &self.inner {
|
match &self.inner {
|
||||||
@@ -475,11 +509,32 @@ impl GpuSolver {
|
|||||||
pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
|
pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
|
||||||
match &self.inner {
|
match &self.inner {
|
||||||
SolverInner::Legacy(s) => s.hash_all(header),
|
SolverInner::Legacy(s) => s.hash_all(header),
|
||||||
SolverInner::Amd(_) => {
|
SolverInner::Amd(_) => Err(anyhow!("hash_all is not supported by the AMD kernel")),
|
||||||
Err(anyhow!("hash_all is not supported by the AMD kernel"))
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// The device's PCI address as `"DDDD:BB:DD.F"` (lowercase), for matching the
|
||||||
|
/// physical card to a telemetry backend. Prefers `cl_khr_pci_bus_info`; falls
|
||||||
|
/// back to `cl_device_topology_amd`. `None` if the device exposes neither.
|
||||||
|
fn device_pci_bus_id(device: &ocl::Device) -> Option<String> {
|
||||||
|
const CL_DEVICE_PCI_BUS_INFO_KHR: u32 = 0x10F2;
|
||||||
|
const CL_DEVICE_TOPOLOGY_AMD: u32 = 0x4037;
|
||||||
|
// cl_device_pci_bus_info_khr = { u32 pci_domain, pci_bus, pci_device, pci_function }.
|
||||||
|
if let Ok(b) = device.info_raw(CL_DEVICE_PCI_BUS_INFO_KHR) {
|
||||||
|
if b.len() >= 16 {
|
||||||
|
let rd = |i: usize| u32::from_ne_bytes([b[i], b[i + 1], b[i + 2], b[i + 3]]);
|
||||||
|
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", rd(0), rd(4), rd(8), rd(12)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// cl_device_topology_amd (PCIE branch): 24-byte struct, bus/device/function
|
||||||
|
// are the last three bytes; domain isn't exposed (assume 0000).
|
||||||
|
if let Ok(b) = device.info_raw(CL_DEVICE_TOPOLOGY_AMD) {
|
||||||
|
if b.len() >= 24 {
|
||||||
|
return Some(format!("0000:{:02x}:{:02x}.{:x}", b[21], b[22], b[23]));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
|
/// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
|
||||||
@@ -501,19 +556,81 @@ fn use_amd_kernel(device: &ocl::Device) -> bool {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// A cross-platform key identifying the physical GPU, as a canonical PCI address
|
||||||
|
/// `"DDDD:BB:DD.F"`. Vendor runtimes (ROCm/NVIDIA) expose `cl_khr_pci_bus_info`;
|
||||||
|
/// rusticl/Mesa doesn't, but its `cl_khr_device_uuid` *encodes* the PCI address
|
||||||
|
/// ({u32 domain LE, u8 bus, u8 device, u8 function, ...}), so the same physical
|
||||||
|
/// card yields the same key on both platforms. `None` if neither is available
|
||||||
|
/// (then the device is never deduped — safe).
|
||||||
|
fn device_dedup_key(device: &ocl::Device) -> Option<String> {
|
||||||
|
if let Some(pci) = device_pci_bus_id(device) {
|
||||||
|
return Some(pci);
|
||||||
|
}
|
||||||
|
const CL_DEVICE_UUID_KHR: u32 = 0x106A;
|
||||||
|
if let Ok(b) = device.info_raw(CL_DEVICE_UUID_KHR) {
|
||||||
|
if b.len() >= 7 {
|
||||||
|
let domain = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
|
||||||
|
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", domain, b[4], b[5], b[6]));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Lower = preferred when the same physical GPU is exposed by multiple OpenCL
|
||||||
|
/// platforms. De-prioritise the Mesa Gallium drivers (rusticl/clover) relative to
|
||||||
|
/// the vendor runtimes (ROCm / NVIDIA / Intel), which are faster and complete.
|
||||||
|
fn platform_rank(p: &ocl::Platform) -> u8 {
|
||||||
|
let name = p.name().unwrap_or_default().to_ascii_lowercase();
|
||||||
|
if name.contains("rusticl") || name.contains("clover") || name.contains("mesa") {
|
||||||
|
1
|
||||||
|
} else {
|
||||||
|
0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// All usable OpenCL `(platform, device)` pairs in a stable flat order, with each
|
||||||
|
/// physical GPU de-duplicated across platforms by PCI bus id — a card exposed by
|
||||||
|
/// both ROCm and rusticl appears once (the vendor runtime wins over Mesa), so
|
||||||
|
/// mining "all" doesn't run the same card twice. This is the single source of
|
||||||
|
/// truth for the flat device index used by `--devices`, `--list-devices`, and
|
||||||
|
/// [`pick_device`]. Devices without a PCI bus id (CPU / PoCL) are never deduped.
|
||||||
|
fn enumerate_devices() -> Vec<(ocl::Platform, ocl::Device)> {
|
||||||
|
use ocl::{Device, Platform};
|
||||||
|
let mut out: Vec<(Platform, Device)> = Vec::new();
|
||||||
|
let mut by_pci: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
|
||||||
|
for platform in Platform::list() {
|
||||||
|
for device in Device::list_all(platform).unwrap_or_default() {
|
||||||
|
match device_dedup_key(&device) {
|
||||||
|
Some(pci) => match by_pci.get(&pci).copied() {
|
||||||
|
// Same physical GPU already listed: keep the preferred platform.
|
||||||
|
Some(existing) => {
|
||||||
|
if platform_rank(&platform) < platform_rank(&out[existing].0) {
|
||||||
|
out[existing] = (platform, device);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None => {
|
||||||
|
by_pci.insert(pci, out.len());
|
||||||
|
out.push((platform, device));
|
||||||
|
}
|
||||||
|
},
|
||||||
|
None => out.push((platform, device)), // no PCI id → can't dedup
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
out
|
||||||
|
}
|
||||||
|
|
||||||
/// List `(platform, device)` names so the user can choose `--device`.
|
/// List `(platform, device)` names so the user can choose `--device`.
|
||||||
pub fn list_devices() -> Result<Vec<String>> {
|
pub fn list_devices() -> Result<Vec<String>> {
|
||||||
use ocl::{Device, Platform};
|
let names = enumerate_devices()
|
||||||
let mut names = Vec::new();
|
.into_iter()
|
||||||
let mut idx = 0;
|
.enumerate()
|
||||||
for platform in Platform::list() {
|
.map(|(idx, (platform, device))| {
|
||||||
let pname = platform.name().unwrap_or_else(|_| "?".into());
|
let pname = platform.name().unwrap_or_else(|_| "?".into());
|
||||||
for device in Device::list_all(platform).unwrap_or_default() {
|
|
||||||
let dname = device.name().unwrap_or_else(|_| "?".into());
|
let dname = device.name().unwrap_or_else(|_| "?".into());
|
||||||
names.push(format!("[{idx}] {pname} / {dname}"));
|
format!("[{idx}] {pname} / {dname}")
|
||||||
idx += 1;
|
})
|
||||||
}
|
.collect();
|
||||||
}
|
|
||||||
Ok(names)
|
Ok(names)
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -522,18 +639,15 @@ pub fn list_devices() -> Result<Vec<String>> {
|
|||||||
/// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
|
/// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
|
||||||
pub fn device_is_nvidia() -> Vec<bool> {
|
pub fn device_is_nvidia() -> Vec<bool> {
|
||||||
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
||||||
use ocl::{Device, Platform};
|
enumerate_devices()
|
||||||
let mut out = Vec::new();
|
.into_iter()
|
||||||
for platform in Platform::list() {
|
.map(|(_, device)| {
|
||||||
for device in Device::list_all(platform).unwrap_or_default() {
|
matches!(
|
||||||
let is_nv = matches!(
|
|
||||||
device.info(DeviceInfo::Vendor),
|
device.info(DeviceInfo::Vendor),
|
||||||
Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
|
Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
|
||||||
);
|
)
|
||||||
out.push(is_nv);
|
})
|
||||||
}
|
.collect()
|
||||||
}
|
|
||||||
out
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
|
/// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
|
||||||
@@ -541,36 +655,20 @@ pub fn device_is_nvidia() -> Vec<bool> {
|
|||||||
/// [`list_devices`] / `--devices`.
|
/// [`list_devices`] / `--devices`.
|
||||||
pub fn cpu_device_index() -> Option<usize> {
|
pub fn cpu_device_index() -> Option<usize> {
|
||||||
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
||||||
use ocl::{Device, Platform};
|
enumerate_devices().into_iter().position(|(_, device)| {
|
||||||
let mut idx = 0;
|
matches!(
|
||||||
for platform in Platform::list() {
|
|
||||||
for device in Device::list_all(platform).unwrap_or_default() {
|
|
||||||
let is_cpu = matches!(
|
|
||||||
device.info(DeviceInfo::Type).ok(),
|
device.info(DeviceInfo::Type).ok(),
|
||||||
Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
|
Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
|
||||||
);
|
)
|
||||||
if is_cpu {
|
})
|
||||||
return Some(idx);
|
|
||||||
}
|
|
||||||
idx += 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
None
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Resolve a flat device index across all platforms, returning the device along
|
/// Resolve a flat device index (into the de-duplicated [`enumerate_devices`]
|
||||||
/// with the platform it belongs to (needed to build the context against the
|
/// list), returning the device along with the platform it belongs to (needed to
|
||||||
/// right platform).
|
/// build the context against the right platform).
|
||||||
fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
|
pub(crate) fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
|
||||||
use ocl::{Device, Platform};
|
enumerate_devices()
|
||||||
let mut idx = 0;
|
.into_iter()
|
||||||
for platform in Platform::list() {
|
.nth(index)
|
||||||
for device in Device::list_all(platform).unwrap_or_default() {
|
.ok_or_else(|| anyhow!("no OpenCL device with index {index}"))
|
||||||
if idx == index {
|
|
||||||
return Ok((platform, device));
|
|
||||||
}
|
|
||||||
idx += 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
Err(anyhow!("no OpenCL device with index {index}"))
|
|
||||||
}
|
}
|
||||||
|
|||||||
+12
-6
@@ -128,19 +128,25 @@ pub trait GpuTuner: Send {
|
|||||||
/// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
|
/// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
|
||||||
/// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
|
/// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
|
||||||
///
|
///
|
||||||
/// NVML is the backend on both Linux (`libnvidia-ml`) and Windows (`nvml.dll`);
|
/// Tries the NVIDIA backend first (NVML, `libnvidia-ml`/`nvml.dll`), then the AMD
|
||||||
/// the C API is identical, so the same [`crate::nvml`] code serves both.
|
/// backend ([`crate::amd_smi`], Linux amdgpu sysfs). A non-matching bus id makes
|
||||||
|
/// each backend return `None`, so probing both is safe on mixed-vendor hosts.
|
||||||
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
|
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
|
||||||
#[cfg(any(unix, windows))]
|
#[cfg(feature = "cuda")]
|
||||||
{
|
{
|
||||||
crate::nvml::open(pci_bus_id)
|
if let Some(t) = crate::nvml::open(pci_bus_id) {
|
||||||
|
return Some(t);
|
||||||
}
|
}
|
||||||
#[cfg(not(any(unix, windows)))]
|
}
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
{
|
{
|
||||||
|
if let Some(t) = crate::amd_smi::open(pci_bus_id) {
|
||||||
|
return Some(t);
|
||||||
|
}
|
||||||
|
}
|
||||||
let _ = pci_bus_id;
|
let _ = pci_bus_id;
|
||||||
None
|
None
|
||||||
}
|
}
|
||||||
}
|
|
||||||
|
|
||||||
static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);
|
static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);
|
||||||
|
|
||||||
|
|||||||
+64
-12
@@ -18,6 +18,10 @@ mod gpu;
|
|||||||
#[cfg(feature = "gpu")]
|
#[cfg(feature = "gpu")]
|
||||||
mod gpu_amd;
|
mod gpu_amd;
|
||||||
|
|
||||||
|
// AMD GPU telemetry via Linux amdgpu sysfs (a `gpu_tune::GpuTuner` backend).
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
mod amd_smi;
|
||||||
|
|
||||||
// Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
|
// Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
mod dylib;
|
mod dylib;
|
||||||
@@ -28,7 +32,9 @@ mod cuda;
|
|||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
mod nvml;
|
mod nvml;
|
||||||
|
|
||||||
#[cfg(feature = "cuda")]
|
// Platform-agnostic GPU tuning/telemetry surface. The trait + policy compile for
|
||||||
|
// either GPU backend; NVML (cuda) and amd_smi (gpu) are the implementations.
|
||||||
|
#[cfg(any(feature = "cuda", feature = "gpu"))]
|
||||||
mod gpu_tune;
|
mod gpu_tune;
|
||||||
|
|
||||||
use std::io::IsTerminal;
|
use std::io::IsTerminal;
|
||||||
@@ -125,11 +131,12 @@ struct Args {
|
|||||||
#[arg(long, value_name = "SPEC")]
|
#[arg(long, value_name = "SPEC")]
|
||||||
cpu_cores: Option<String>,
|
cpu_cores: Option<String>,
|
||||||
|
|
||||||
/// Cores per CPU mining row (default 4). Each row runs one shared solve
|
/// Cores per CPU mining row. Each row runs one shared solve across its
|
||||||
/// across its cores; larger groups cut memory sharply: total RAM is ~4 GB ×
|
/// cores; larger groups cut memory sharply: total RAM is ~4 GB × (enabled
|
||||||
/// (enabled cores / this size). Use 1 for one row (and one solve) per core.
|
/// cores / this size). Rows align to core-index blocks of this size. Capped
|
||||||
/// Rows are aligned to core-index blocks of this size, so a row never
|
/// by core count so the row count stays manageable — ≤4 cores toggle
|
||||||
/// straddles a boundary. Cycle it live in the dashboard with 'g'.
|
/// individually (1), 5-8 cores in groups of ≤2, more than 8 in groups of ≤4
|
||||||
|
/// — and the default is that cap. Cycle it live (within the cap) with 'g'.
|
||||||
#[arg(long, value_name = "N", default_value_t = 4)]
|
#[arg(long, value_name = "N", default_value_t = 4)]
|
||||||
cpu_group_size: usize,
|
cpu_group_size: usize,
|
||||||
|
|
||||||
@@ -203,6 +210,13 @@ struct Args {
|
|||||||
#[arg(long)]
|
#[arg(long)]
|
||||||
auto_tune: bool,
|
auto_tune: bool,
|
||||||
|
|
||||||
|
/// Sustained-Sol/s governor: hold each GPU at/below this edge temperature (°C)
|
||||||
|
/// by pacing the solve cadence (no hardware writes, no root). Trades a little
|
||||||
|
/// throughput for lower temp/power; off by default (runs flat-out). Needs a
|
||||||
|
/// backend that reports temperature (AMD amdgpu / NVIDIA).
|
||||||
|
#[arg(long, value_name = "CELSIUS")]
|
||||||
|
target_temp: Option<u32>,
|
||||||
|
|
||||||
/// Efficiency: cap each GPU's power limit in watts (default: card max).
|
/// Efficiency: cap each GPU's power limit in watts (default: card max).
|
||||||
/// Lower power trades a little hashrate for much better Sol/W.
|
/// Lower power trades a little hashrate for much better Sol/W.
|
||||||
#[arg(long, value_name = "WATTS")]
|
#[arg(long, value_name = "WATTS")]
|
||||||
@@ -619,6 +633,8 @@ fn main() -> Result<()> {
|
|||||||
args.power_limit.unwrap_or(0),
|
args.power_limit.unwrap_or(0),
|
||||||
args.unlock_controls,
|
args.unlock_controls,
|
||||||
);
|
);
|
||||||
|
// Software temp governor target (paces solve cadence; no hardware writes).
|
||||||
|
miner::set_target_temp(args.target_temp);
|
||||||
miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
|
miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1076,10 +1092,21 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
|||||||
use std::time::Instant;
|
use std::time::Instant;
|
||||||
info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());
|
info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());
|
||||||
|
|
||||||
|
/// Per-worker benchmark result, including a steady-state telemetry snapshot
|
||||||
|
/// (sampled right after the timed loop, while the card is warm).
|
||||||
|
struct WorkerResult {
|
||||||
|
sols: usize,
|
||||||
|
dt: f64,
|
||||||
|
watts: Option<f64>,
|
||||||
|
temp_c: Option<u32>,
|
||||||
|
core_mhz: Option<u32>,
|
||||||
|
mem_mhz: Option<u32>,
|
||||||
|
}
|
||||||
|
|
||||||
let start = Instant::now();
|
let start = Instant::now();
|
||||||
let mut handles = Vec::new();
|
let mut handles = Vec::new();
|
||||||
for (id, spec) in specs.into_iter().enumerate() {
|
for (id, spec) in specs.into_iter().enumerate() {
|
||||||
handles.push(std::thread::spawn(move || -> Result<(usize, f64)> {
|
handles.push(std::thread::spawn(move || -> Result<WorkerResult> {
|
||||||
let backend = spec.build()?;
|
let backend = spec.build()?;
|
||||||
backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
|
backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
|
||||||
let t = Instant::now();
|
let t = Instant::now();
|
||||||
@@ -1089,7 +1116,17 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
|||||||
let seed = ((id as u64) << 40) | (i as u64 + 1);
|
let seed = ((id as u64) << 40) | (i as u64 + 1);
|
||||||
sols += backend.solve(&pseudo_header(seed))?.len();
|
sols += backend.solve(&pseudo_header(seed))?.len();
|
||||||
}
|
}
|
||||||
Ok((sols, t.elapsed().as_secs_f64()))
|
let dt = t.elapsed().as_secs_f64();
|
||||||
|
// Snapshot telemetry while the card is still under load.
|
||||||
|
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
|
||||||
|
Ok(WorkerResult {
|
||||||
|
sols,
|
||||||
|
dt,
|
||||||
|
watts: backend.power_watts(),
|
||||||
|
temp_c: backend.temperature_c(),
|
||||||
|
core_mhz,
|
||||||
|
mem_mhz,
|
||||||
|
})
|
||||||
}));
|
}));
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1098,11 +1135,26 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
|||||||
let mut workers = 0usize;
|
let mut workers = 0usize;
|
||||||
for h in handles {
|
for h in handles {
|
||||||
match h.join().unwrap() {
|
match h.join().unwrap() {
|
||||||
Ok((sols, dt)) => {
|
Ok(r) => {
|
||||||
let sol_s = sols as f64 / dt;
|
let sol_s = r.sols as f64 / r.dt;
|
||||||
|
// Optional telemetry tail: " | 142 W, 41.7 Sol/W, 68°C, 2700/2500 MHz".
|
||||||
|
let mut tail = String::new();
|
||||||
|
if let Some(w) = r.watts {
|
||||||
|
tail.push_str(&format!(" | {w:.0} W"));
|
||||||
|
if w > 0.0 {
|
||||||
|
tail.push_str(&format!(", {:.2} Sol/W", sol_s / w));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if let Some(t) = r.temp_c {
|
||||||
|
tail.push_str(&format!(", {t}°C"));
|
||||||
|
}
|
||||||
|
if let (Some(c), m) = (r.core_mhz, r.mem_mhz) {
|
||||||
|
tail.push_str(&format!(", {c}/{} MHz", m.map(|m| m.to_string()).unwrap_or_else(|| "?".into())));
|
||||||
|
}
|
||||||
info!(
|
info!(
|
||||||
" worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {sols} solutions",
|
" worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {} solutions{tail}",
|
||||||
1000.0 * dt / runs as f64
|
1000.0 * r.dt / runs as f64,
|
||||||
|
r.sols
|
||||||
);
|
);
|
||||||
agg_sols += sol_s;
|
agg_sols += sol_s;
|
||||||
workers += 1;
|
workers += 1;
|
||||||
|
|||||||
+105
@@ -14,6 +14,22 @@ use crate::equihash;
|
|||||||
use crate::params::{HEADER_LEN, SOLUTION_BYTES};
|
use crate::params::{HEADER_LEN, SOLUTION_BYTES};
|
||||||
use crate::stratum::{StratumClient, Work};
|
use crate::stratum::{StratumClient, Work};
|
||||||
|
|
||||||
|
/// Process-wide target edge temperature (°C) for the software solve-cadence
|
||||||
|
/// governor; `None` ⇒ run flat-out. Set once at startup from `--target-temp`.
|
||||||
|
static TARGET_TEMP_C: OnceLock<Option<u32>> = OnceLock::new();
|
||||||
|
|
||||||
|
/// Install the governor's target temperature (call once, before workers start).
|
||||||
|
pub fn set_target_temp(c: Option<u32>) {
|
||||||
|
if let Some(t) = c {
|
||||||
|
info!("temperature governor enabled: holding GPUs ≤{t}°C (paced cadence)");
|
||||||
|
}
|
||||||
|
let _ = TARGET_TEMP_C.set(c);
|
||||||
|
}
|
||||||
|
|
||||||
|
fn target_temp_c() -> Option<u32> {
|
||||||
|
TARGET_TEMP_C.get().copied().flatten()
|
||||||
|
}
|
||||||
|
|
||||||
/// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
|
/// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
|
||||||
fn sha256d(data: &[u8]) -> [u8; 32] {
|
fn sha256d(data: &[u8]) -> [u8; 32] {
|
||||||
let first = Sha256::digest(data);
|
let first = Sha256::digest(data);
|
||||||
@@ -136,6 +152,8 @@ impl Backend {
|
|||||||
match self {
|
match self {
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
Backend::Cuda(solver) => solver.power_watts(),
|
Backend::Cuda(solver) => solver.power_watts(),
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
Backend::Gpu(solver) => solver.power_watts(),
|
||||||
_ => None,
|
_ => None,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -145,6 +163,8 @@ impl Backend {
|
|||||||
match self {
|
match self {
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
Backend::Cuda(solver) => solver.temperature_c(),
|
Backend::Cuda(solver) => solver.temperature_c(),
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
Backend::Gpu(solver) => solver.temperature_c(),
|
||||||
_ => None,
|
_ => None,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -154,6 +174,8 @@ impl Backend {
|
|||||||
match self {
|
match self {
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
Backend::Cuda(solver) => solver.current_power_limit_w(),
|
Backend::Cuda(solver) => solver.current_power_limit_w(),
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
Backend::Gpu(solver) => solver.current_power_limit_w(),
|
||||||
_ => None,
|
_ => None,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -163,6 +185,8 @@ impl Backend {
|
|||||||
match self {
|
match self {
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
Backend::Cuda(solver) => solver.power_limit_range_w(),
|
Backend::Cuda(solver) => solver.power_limit_range_w(),
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
Backend::Gpu(solver) => solver.power_limit_range_w(),
|
||||||
_ => None,
|
_ => None,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -190,6 +214,8 @@ impl Backend {
|
|||||||
match self {
|
match self {
|
||||||
#[cfg(feature = "cuda")]
|
#[cfg(feature = "cuda")]
|
||||||
Backend::Cuda(solver) => solver.current_clocks_mhz(),
|
Backend::Cuda(solver) => solver.current_clocks_mhz(),
|
||||||
|
#[cfg(feature = "gpu")]
|
||||||
|
Backend::Gpu(solver) => solver.current_clocks_mhz(),
|
||||||
_ => (None, None),
|
_ => (None, None),
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -513,6 +539,14 @@ fn worker(
|
|||||||
let mut last_job = Instant::now();
|
let mut last_job = Instant::now();
|
||||||
let mut paused = false;
|
let mut paused = false;
|
||||||
let mut disabled_pause = false;
|
let mut disabled_pause = false;
|
||||||
|
// Software temperature governor: pace the solve cadence to hold edge temp at
|
||||||
|
// or below `--target-temp` (no hardware writes). `gov_sleep` is the per-pass
|
||||||
|
// pause, nudged from the periodic temperature sample.
|
||||||
|
let gov_target = target_temp_c();
|
||||||
|
let mut gov_sleep = Duration::ZERO;
|
||||||
|
if let Some(target) = gov_target {
|
||||||
|
info!("worker {id}: temperature governor active — pacing cadence to hold ≤{target}°C");
|
||||||
|
}
|
||||||
|
|
||||||
while running.load(Ordering::Relaxed) {
|
while running.load(Ordering::Relaxed) {
|
||||||
if work_handle.epoch() != current.epoch {
|
if work_handle.epoch() != current.epoch {
|
||||||
@@ -583,6 +617,9 @@ fn worker(
|
|||||||
}
|
}
|
||||||
if let Some(t) = backend.temperature_c() {
|
if let Some(t) = backend.temperature_c() {
|
||||||
stats.workers[id].temp_c.store(t, Ordering::Relaxed);
|
stats.workers[id].temp_c.store(t, Ordering::Relaxed);
|
||||||
|
if let Some(target) = gov_target {
|
||||||
|
gov_sleep = govern_cadence(gov_sleep, t, target);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
|
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
|
||||||
if let Some(c) = core_mhz {
|
if let Some(c) = core_mhz {
|
||||||
@@ -614,12 +651,34 @@ fn worker(
|
|||||||
let ctx = inflight.pop_front().unwrap();
|
let ctx = inflight.pop_front().unwrap();
|
||||||
process_results(id, &client, &ctx, &solutions, &stats);
|
process_results(id, &client, &ctx, &solutions, &stats);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Temperature governor: pace the cadence (held off entirely when flat-out).
|
||||||
|
if !gov_sleep.is_zero() {
|
||||||
|
std::thread::sleep(gov_sleep);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
|
drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Nudge the per-pass governor pause to hold edge temperature near `target` (°C).
|
||||||
|
/// Over target → lengthen the pause (proportional to the overshoot); comfortably
|
||||||
|
/// under → shorten it. Called from the periodic temperature sample (~1 Hz), so
|
||||||
|
/// this is a slow integral controller — fine for the card's thermal time constant.
|
||||||
|
fn govern_cadence(cur: Duration, temp: u32, target: u32) -> Duration {
|
||||||
|
const MAX_SLEEP: Duration = Duration::from_millis(100);
|
||||||
|
let (t, target) = (temp as i64, target as i64);
|
||||||
|
if t > target {
|
||||||
|
cur.saturating_add(Duration::from_micros(400 * (t - target) as u64))
|
||||||
|
.min(MAX_SLEEP)
|
||||||
|
} else if t < target - 1 {
|
||||||
|
cur.saturating_sub(Duration::from_millis(1))
|
||||||
|
} else {
|
||||||
|
cur
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// Supervise CPU mining: spawn one worker per group for the current group size,
|
/// Supervise CPU mining: spawn one worker per group for the current group size,
|
||||||
/// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
|
/// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
|
||||||
/// rebuild the grouping, and respawn. Runs until `running` is cleared.
|
/// rebuild the grouping, and respawn. Runs until `running` is cleared.
|
||||||
@@ -843,3 +902,49 @@ fn build_nonce(nonce1: &[u8], counter: u64) -> Result<[u8; 32]> {
|
|||||||
tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
|
tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
|
||||||
Ok(nonce)
|
Ok(nonce)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[cfg(test)]
|
||||||
|
mod governor_tests {
|
||||||
|
use super::*;
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn governor_idles_below_target() {
|
||||||
|
// At/under target → no pause is introduced.
|
||||||
|
assert_eq!(govern_cadence(Duration::ZERO, 60, 70), Duration::ZERO);
|
||||||
|
assert_eq!(govern_cadence(Duration::ZERO, 70, 70), Duration::ZERO);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn governor_lengthens_over_target_and_clamps() {
|
||||||
|
// Over target → pause grows proportional to the overshoot.
|
||||||
|
let s = govern_cadence(Duration::ZERO, 75, 70); // +5°C
|
||||||
|
assert_eq!(s, Duration::from_micros(400 * 5));
|
||||||
|
// Far over target saturates at the cap, never beyond.
|
||||||
|
let hot = govern_cadence(Duration::from_millis(99), 200, 70);
|
||||||
|
assert_eq!(hot, Duration::from_millis(100));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn governor_eases_off_when_cool() {
|
||||||
|
// 2°C+ under target → pause shrinks; floors at zero (no underflow).
|
||||||
|
let s = govern_cadence(Duration::from_millis(3), 67, 70);
|
||||||
|
assert_eq!(s, Duration::from_millis(2));
|
||||||
|
assert_eq!(govern_cadence(Duration::ZERO, 50, 70), Duration::ZERO);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn governor_converges_to_hold_target() {
|
||||||
|
// Simulate a card whose temp rises with duty: more pause → cooler. The
|
||||||
|
// controller should settle near the target without runaway.
|
||||||
|
let target = 70u32;
|
||||||
|
let mut sleep = Duration::ZERO;
|
||||||
|
let mut temp = 85i32; // starts hot
|
||||||
|
for _ in 0..200 {
|
||||||
|
sleep = govern_cadence(sleep, temp as u32, target);
|
||||||
|
// crude plant: each ms of pause sheds ~1.5°C off a 90°C flat-out temp.
|
||||||
|
let modeled = 90.0 - 1.5 * sleep.as_millis() as f64;
|
||||||
|
temp = modeled.round() as i32;
|
||||||
|
}
|
||||||
|
assert!((target as i32 - 2..=target as i32 + 2).contains(&temp), "settled at {temp}°C, want ~{target}");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -0,0 +1,124 @@
|
|||||||
|
; OpenCL import library definition for jackpotminer's Windows build.
|
||||||
|
; Generated from the cl-sys 0.4.3 exports so a vendor-neutral OpenCL.lib /
|
||||||
|
; libOpenCL.a can be produced at build time without an OpenCL SDK. The actual
|
||||||
|
; OpenCL.dll (ICD loader) is provided at runtime by the GPU driver.
|
||||||
|
LIBRARY OpenCL
|
||||||
|
EXPORTS
|
||||||
|
clBuildProgram
|
||||||
|
clCloneKernel
|
||||||
|
clCompileProgram
|
||||||
|
clCreateBuffer
|
||||||
|
clCreateCommandQueue
|
||||||
|
clCreateCommandQueueWithProperties
|
||||||
|
clCreateContext
|
||||||
|
clCreateContextFromType
|
||||||
|
clCreateFromGLBuffer
|
||||||
|
clCreateFromGLRenderbuffer
|
||||||
|
clCreateFromGLTexture
|
||||||
|
clCreateFromGLTexture2D
|
||||||
|
clCreateFromGLTexture3D
|
||||||
|
clCreateImage
|
||||||
|
clCreateImage2D
|
||||||
|
clCreateImage3D
|
||||||
|
clCreateKernel
|
||||||
|
clCreateKernelsInProgram
|
||||||
|
clCreatePipe
|
||||||
|
clCreateProgramWithBinary
|
||||||
|
clCreateProgramWithBuiltInKernels
|
||||||
|
clCreateProgramWithIL
|
||||||
|
clCreateProgramWithSource
|
||||||
|
clCreateSampler
|
||||||
|
clCreateSamplerWithProperties
|
||||||
|
clCreateSubBuffer
|
||||||
|
clCreateSubDevices
|
||||||
|
clCreateUserEvent
|
||||||
|
clEnqueueAcquireGLObjects
|
||||||
|
clEnqueueBarrier
|
||||||
|
clEnqueueBarrierWithWaitList
|
||||||
|
clEnqueueCopyBuffer
|
||||||
|
clEnqueueCopyBufferRect
|
||||||
|
clEnqueueCopyBufferToImage
|
||||||
|
clEnqueueCopyImage
|
||||||
|
clEnqueueCopyImageToBuffer
|
||||||
|
clEnqueueFillBuffer
|
||||||
|
clEnqueueFillImage
|
||||||
|
clEnqueueMapBuffer
|
||||||
|
clEnqueueMapImage
|
||||||
|
clEnqueueMarker
|
||||||
|
clEnqueueMarkerWithWaitList
|
||||||
|
clEnqueueMigrateMemObjects
|
||||||
|
clEnqueueNativeKernel
|
||||||
|
clEnqueueNDRangeKernel
|
||||||
|
clEnqueueReadBuffer
|
||||||
|
clEnqueueReadBufferRect
|
||||||
|
clEnqueueReadImage
|
||||||
|
clEnqueueReleaseGLObjects
|
||||||
|
clEnqueueSVMFree
|
||||||
|
clEnqueueSVMMap
|
||||||
|
clEnqueueSVMMemcpy
|
||||||
|
clEnqueueSVMMemFill
|
||||||
|
clEnqueueSVMMigrateMem
|
||||||
|
clEnqueueSVMUnmap
|
||||||
|
clEnqueueTask
|
||||||
|
clEnqueueUnmapMemObject
|
||||||
|
clEnqueueWaitForEvents
|
||||||
|
clEnqueueWriteBuffer
|
||||||
|
clEnqueueWriteBufferRect
|
||||||
|
clEnqueueWriteImage
|
||||||
|
clFinish
|
||||||
|
clFlush
|
||||||
|
clGetCommandQueueInfo
|
||||||
|
clGetContextInfo
|
||||||
|
clGetDeviceAndHostTimer
|
||||||
|
clGetDeviceIDs
|
||||||
|
clGetDeviceInfo
|
||||||
|
clGetEventInfo
|
||||||
|
clGetEventProfilingInfo
|
||||||
|
clGetExtensionFunctionAddress
|
||||||
|
clGetExtensionFunctionAddressForPlatform
|
||||||
|
clGetGLContextInfoKHR
|
||||||
|
clGetGLObjectInfo
|
||||||
|
clGetGLTextureInfo
|
||||||
|
clGetHostTimer
|
||||||
|
clGetImageInfo
|
||||||
|
clGetKernelArgInfo
|
||||||
|
clGetKernelInfo
|
||||||
|
clGetKernelSubGroupInfo
|
||||||
|
clGetKernelWorkGroupInfo
|
||||||
|
clGetMemObjectInfo
|
||||||
|
clGetPipeInfo
|
||||||
|
clGetPlatformIDs
|
||||||
|
clGetPlatformInfo
|
||||||
|
clGetProgramBuildInfo
|
||||||
|
clGetProgramInfo
|
||||||
|
clGetSamplerInfo
|
||||||
|
clGetSupportedImageFormats
|
||||||
|
clLinkProgram
|
||||||
|
clReleaseCommandQueue
|
||||||
|
clReleaseContext
|
||||||
|
clReleaseDevice
|
||||||
|
clReleaseEvent
|
||||||
|
clReleaseKernel
|
||||||
|
clReleaseMemObject
|
||||||
|
clReleaseProgram
|
||||||
|
clReleaseSampler
|
||||||
|
clRetainCommandQueue
|
||||||
|
clRetainContext
|
||||||
|
clRetainDevice
|
||||||
|
clRetainEvent
|
||||||
|
clRetainKernel
|
||||||
|
clRetainMemObject
|
||||||
|
clRetainProgram
|
||||||
|
clRetainSampler
|
||||||
|
clSetDefaultDeviceCommandQueue
|
||||||
|
clSetEventCallback
|
||||||
|
clSetKernelArg
|
||||||
|
clSetKernelArgSVMPointer
|
||||||
|
clSetKernelExecInfo
|
||||||
|
clSetMemObjectDestructorCallback
|
||||||
|
clSetUserEventStatus
|
||||||
|
clSVMAlloc
|
||||||
|
clSVMFree
|
||||||
|
clUnloadCompiler
|
||||||
|
clUnloadPlatformCompiler
|
||||||
|
clWaitForEvents
|
||||||
Reference in New Issue
Block a user