Compare commits

..
10 Commits
Author SHA1 Message Date
jackpotincorporated 09725cf674 OpenCL: de-duplicate the same physical GPU across platforms
A GPU exposed by both a vendor runtime (ROCm) and rusticl/Mesa appeared twice in
the device list, so mining 'all' ran each card twice (pure contention). Add a
single canonical enumerate_devices() — used by list_devices, device_is_nvidia,
cpu_device_index and pick_device — that dedupes by physical GPU and prefers the
vendor runtime over Mesa.

Dedup key is the PCI address: ROCm/NVIDIA expose it via cl_khr_pci_bus_info;
rusticl doesn't, but its cl_khr_device_uuid encodes the PCI BDF, so the same card
yields the same key on both. Devices without either (CPU/PoCL) are never deduped.
No behavior change on single-platform hosts (nothing to dedup); here the list
drops 4->2 (both physical GPUs on ROCm, ~38 Sol/s) and device indices are
unchanged for the kept devices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:57:19 -04:00
jackpotincorporated 41db98af69 AMD kernel: drop fp64 dependency (build on rusticl/Mesa OpenCL)
equihash192_7.cl's decompress/decompress2 used double + sqrt behind
cl_khr_fp64, so the kernel failed to build on OpenCL stacks without fp64
(notably rusticl/Mesa) — those workers died with 'use of type double requires
cl_khr_fp64'. Replace round(sqrt(2*x+1)) with an exact integer square root
(single-precision estimate corrected to the integer floor, then rounded; inputs
are triangular indices < ~2^26). No fp64, no behavior change on ROCm (verified
bit-identical: 77 solutions/40 nonces, same as before), and rusticl devices now
build and solve correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:39:45 -04:00
jackpotincorporated 31aa85733e AMD GPU telemetry + --target-temp governor
Brings AMD cards to parity with NVIDIA for monitoring/control surface, which was
NVML-only. New src/amd_smi.rs is a gpu_tune::GpuTuner backed by Linux amdgpu
sysfs (power1_average, temp1_input edge, freq1_input sclk, pp_dpm_sclk/mclk),
matched to the device by PCI bus id from OpenCL cl_khr_pci_bus_info. gpu_tune is
un-gated to compile under the gpu feature; open() probes NVML then amd_smi.
GpuSolver carries the tuner and Backend::Gpu dispatches power/temp/clocks, so the
TUI and --benchmark now show power, temperature, clocks and Sol/W for AMD.
Telemetry-only — setters are Unsupported (amdgpu control nodes are root-only).

--target-temp <C> adds an opt-in software governor (miner::govern_cadence) that
paces solve cadence to hold edge temperature, no hardware writes/root. With small
thermal throttle it won't beat flat-out on raw Sol/s; it's a temp/efficiency
lever. Unit-tested controller; flag/plumbing verified live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:17:59 -04:00
jackpotincorporated 0002e90451 TUI: scale CPU core-toggle granularity by core count
Cap the cores-per-row group size (and the dashboard 'g' cycle) by total core
count so the toggleable-row count stays sensible and small machines get finer
control: ≤4 cores toggle individually (size 1), 5-8 cores in groups of up to 2,
and more than 8 in groups of up to 4. The cap is also the default — the prior
fixed default of 4 now clamps to the tier (1/2/4), and an explicit
--cpu-group-size is clamped to the cap too.

Add max_group_size() in cpu_groups; update the help text and the cpu_groups /
control tests (the cycle test now uses 16 cores so it can exercise sizes 4/2/1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 18:34:10 -04:00
jackpotincorporated afd56bee1b CPU solver: AVX-512 round-0 hashing + prefetch the emit gather
Branch the CPU solver onto modern x86 extensions, runtime-dispatched with the
existing AVX2/scalar fallbacks:

- BatchHasher::hash8 — an 8-lane AVX-512 BLAKE2b final-block compression (native
  _mm512_ror_epi64 rotates), falling back to two AVX2 hash4s or scalar. Round 0
  now hashes eight g-values per chunk. round0-hash drops ~1.45x on AVX-512 CPUs
  (≈225→155 ms here, AMD Zen4).
- emit_bucket software-prefetches each collision group's randomly-gathered
  member slots (the ~1 GB slot arena is the round's cache-miss bottleneck),
  shaving a few percent off the dominant emit phase.

Controlled A/B on this Zen4 box (same thermal state): ~4-5% faster overall.
The collision rounds are memory-bandwidth bound, so SIMD width is not the
limiter — the modern-ISA win is modest by nature. EQ_NO_AVX512 / EQ_NO_PREFETCH
opt out per-CPU (e.g. parts where AVX-512 downclocks) and back the A/B harness.

hash8 is validated against the scalar reference (batch_matches_reference) and
full solves still find valid solutions in every dispatch configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 13:12:57 -04:00
jackpotincorporated 00531fb591 Windows build: generate the OpenCL import library, drop the SDK dependency
Building the `gpu` (OpenCL) backend on Windows no longer needs a vendor OpenCL
SDK. `cl-sys` links `OpenCL` (#[link(name = "OpenCL")]); instead of requiring an
SDK-provided OpenCL.lib, build.rs now generates a vendor-neutral import library
at build time from windows/OpenCL.def (all 118 cl-sys exports) — lib.exe for
MSVC (located via the cc crate), dlltool for MinGW — and puts it on the link
search path. The real OpenCL.dll is supplied at runtime by the GPU driver.

build.rs no-ops on non-Windows targets and when the gpu feature is off, and
warns rather than panics if the toolchain tool is absent so `cargo check` still
works. Combined with the runtime-loaded (dlopen) CUDA/NVML, a Windows build now
needs zero external GPU libraries.

Add BUILD-windows.md (toolchain, build, crt-static packaging, runtime deps, CI)
and link it from the README. Verified the whole crate compiles for
x86_64-pc-windows-gnu (default features and gpu,cuda); Linux is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 12:38:15 -04:00
jackpotincorporated 8a9d98a21d Refresh solver architecture comment for the optimized design
The module header still described a fixed 32-byte slot, a serial partition, and
a "not yet taken" radix scatter. Update it to reflect the current memory-traffic-
oriented design: parallel counting-sort partition_top, keys_part/keys_sorted to
make the key reads sequential, narrowing per-round packed slots with a masked
store, and the remaining slot-gather floor (with the note that the full payload
radix scatter was evaluated and loses on the wide early rounds). Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 11:48:29 -04:00
jackpotincorporated 1b4a2a4dd9 Scatter partition-contiguous keys to kill the per-round key gathers
count_pairs, low_group, and the emit group-walk all read each entry's leading
key via `keys[order[..]]` — a random gather over the whole ~128 MB keys array,
three times per round. partition_top now also produces `keys_part` (the leading
keys in partition order, keys_part[p] == keys[order[p]]), written by the same
parallel, disjoint phase-3 scatter at 4 bytes/entry. count_pairs and low_group
then stream their partition's keys sequentially, and low_group emits a
`keys_sorted` array so the emit group walk streams a dense local copy instead of
gathering keys[sorted[i]]. The only remaining DRAM-random access in the rounds is
the unavoidable slot gather.

Measured (16 threads, clamp 16/32): count ~160 -> ~10 ms/round, emit ~770 -> ~550
ms/round, partition +~80 ms (the added 128 MB scatter); full solve ~8.4 -> ~7.04 s
(~16%). Cumulative across the three CPU-solver changes: ~13.4 -> ~7.04 s (-47%),
0.07 -> 0.14 solve/s. Identical solution yield; cross-clamp validity and
full_solve_baseline pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 11:23:10 -04:00
jackpotincorporated 966ce3e262 Pack collision-round slots at shrinking per-round width
Each collision round consumes the leading 24-bit block, so a round-r output
entry only carries 8-r meaningful words — but every round previously stored a
fixed 8-word (32 B) slot, moving dead trailing words. Since the collision
rounds are memory-bandwidth bound, that wasted DRAM traffic is wasted time.

Pack each round's slots at w_out = w_in - 1 words. The XOR child producer still
loads a full 256-bit register (over-reading up to SLOT words past a narrow tail
slot — buffers carry a SLOT_SLACK pad so this stays in bounds) but masked-stores
only the w_out live lanes so packed neighbours aren't clobbered. The over-read
garbage only ever lands in non-stored lanes: storing out[0..w_out] = x[1..w_in]
uses exclusively meaningful input words. Width is threaded through
collide/collide_final/emit_bucket from solve_with (round 0 stays full 8-word).

Measured (16 threads, clamp 16/32): ~9.2 -> ~8.4 s/solve (~9%); per-round time
now shrinks r1~1230 -> r6~1045 ms. Cumulative with the parallel-partition change:
~13.4 -> ~8.4 s (-37%). Identical solution yield; xor_child_matches_scalar now
covers every width + the masked-store no-clobber property; cross-clamp validity
and full_solve_baseline pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 11:11:05 -04:00
jackpotincorporated 501527d3cb Parallelize partition_top; add solver benchmark + phase profiling
partition_top was the only serial stage in the otherwise rayon-parallel
collision rounds — plain `for k in 0..n` count and scatter loops that left
15/16 cores idle for ~35% of every round (~650 ms). Replace it with a
parallel counting sort: per-chunk top-bucket histograms, a small serial
pass to per-chunk base offsets, then a disjoint-region scatter through a
shared raw pointer (each chunk writes a provably non-overlapping set of
positions). Entries within a bucket become chunk-major rather than
index-major, which is immaterial: count_pairs/low_group depend only on the
low-key multiset, and solutions are canonicalized, de-duplicated, and
verified downstream.

Measured (16 threads): partition_top ~650 -> ~100 ms/round (6.5x),
collide-final ~1.18 -> ~0.59 s, full solve ~13.4 -> ~9.2 s (-31%,
0.07 -> 0.11 solve/s), with identical solution yield and all validity
tests passing.

Also add (gated/ignored, no production-path behavior change):
- full_solve_baseline: an #[ignore] throughput benchmark over realistic
  dense headers (EQ_BENCH_ITERS / EQ_BENCH_CLAMPS).
- EQ_PROFILE-gated per-phase and per-collide-sub-phase timing in solve_with.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 10:42:54 -04:00
16 changed files with 1506 additions and 242 deletions
+119
View File
@@ -0,0 +1,119 @@
# Building jackpotminer on Windows
The miner builds on Windows with **no external GPU SDKs**. The CUDA driver and
NVML are loaded at runtime (`dlopen`/`LoadLibrary`), and the OpenCL import
library is generated at build time from `windows/OpenCL.def` — so you don't need
the CUDA Toolkit, an OpenCL SDK, or any vendor libraries to compile. The whole
codebase (miner + the `jackpotminer-config` GUI) is verified to compile for
`x86_64-pc-windows-*`.
## Prerequisites
- **Rust** (https://rustup.rs). The default target is `x86_64-pc-windows-msvc`.
- **MSVC build tools** — Visual Studio 2019/2022 or the standalone *Build Tools
for Visual Studio* with the **“Desktop development with C++”** workload (gives
`link.exe`, `lib.exe`, and the Windows SDK).
- Build from a **“x64 Native Tools Command Prompt for VS”** (or any shell where
the MSVC environment is active) so `lib.exe` is found. `cargo` also locates
it automatically via the registry in most cases.
- *(Alternative toolchain)* the GNU target `x86_64-pc-windows-gnu` works too,
with **MinGW-w64** on `PATH` (provides `dlltool` and the linker).
No CUDA Toolkit and **no OpenCL SDK** are required.
## Build
```powershell
:: Default: OpenCL + CUDA backends + the GUI config tool
cargo build --release
:: Miner only (no GUI), both GPU backends
cargo build --release --no-default-features --features gpu,cuda
:: OpenCL only (AMD / Intel / NVIDIA)
cargo build --release --no-default-features --features gpu
:: NVIDIA only needs nothing external at build time (CUDA is dlopen'd)
cargo build --release --no-default-features --features cuda
```
Outputs: `target\release\jackpotminer.exe` (and `jackpotminer-config.exe` with
the default features).
### How the OpenCL build dependency is avoided
`ocl`/`cl-sys` link `OpenCL` (`#[link(name = "OpenCL")]`), which normally needs
an `OpenCL.lib` from a vendor SDK. Instead, `build.rs` generates a
vendor-neutral import library from `windows/OpenCL.def`:
- **MSVC:** `lib.exe /def:windows\OpenCL.def /out:OpenCL.lib /machine:X64`
- **GNU:** `dlltool -d windows/OpenCL.def -l libOpenCL.a -m i386:x86-64`
and puts it on the link search path. The import library only forwards to
`OpenCL.dll`, which the GPU driver provides at runtime. If the toolchain tool
isnt on `PATH`, `build.rs` prints a warning and the link step fails with a clear
message (compilation/`cargo check` still works).
## Distribution
Statically link the MSVC C runtime so users dont need the VC++ redistributable:
```powershell
set RUSTFLAGS=-C target-feature=+crt-static
cargo build --release
```
## Runtime dependencies (on the mining machine)
- **`OpenCL.dll`** — the ICD loader, installed with any GPU driver (AMD/NVIDIA/
Intel). Required for the OpenCL backend.
- **`nvcuda.dll`** + **`nvml.dll`** — installed with the NVIDIA driver. Loaded
on demand for the CUDA backend; absent on AMD-only machines, where the miner
simply reports no CUDA devices.
- The **VC++ runtime**, unless you built with `+crt-static`.
A `cuda`-enabled binary still starts on a machine with no NVIDIA driver.
## Building Windows binaries without a Windows machine
### GitHub Actions (recommended)
```yaml
name: windows
on: [push, workflow_dispatch]
jobs:
build:
runs-on: windows-latest # MSVC + lib.exe already on PATH
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo build --release --features gpu,cuda
env:
RUSTFLAGS: -C target-feature=+crt-static
- uses: actions/upload-artifact@v4
with:
name: jackpotminer-windows
path: target/release/*.exe
```
### Cross-compile from Linux (GNU target)
```bash
rustup target add x86_64-pc-windows-gnu
sudo pacman -S mingw-w64 # or your distro's mingw-w64 (gives dlltool + linker)
cargo build --release --target x86_64-pc-windows-gnu --no-default-features --features gpu,cuda
```
The MSVC target cant be linked from Linux. The GUI config tool (`eframe`) is
easiest to build natively on Windows.
## Status / caveats
- **Compilation for Windows is verified** here via `cargo check
--target x86_64-pc-windows-gnu` (default features, and `gpu,cuda`).
- The OpenCL **import-library linking** uses the standard `lib.exe`/`dlltool`
technique; validate it with an actual Windows (or MinGW cross) build, which
needs those tools present.
- `relaunch_in_terminal` (reopen-in-a-terminal on GUI launch) is Linux-only;
harmless on Windows, where double-clicking a console binary already opens a
console.
Generated
+1
View File
@@ -2230,6 +2230,7 @@ version = "0.1.0"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"blake2b_simd", "blake2b_simd",
"cc",
"clap", "clap",
"core_affinity", "core_affinity",
"ctrlc", "ctrlc",
+5
View File
@@ -45,6 +45,11 @@ name = "jackpotminer-config"
path = "src/config_gui.rs" path = "src/config_gui.rs"
required-features = ["config-gui"] required-features = ["config-gui"]
[build-dependencies]
# Locates the MSVC `lib.exe` (to generate the OpenCL import library on Windows);
# unused on other platforms. See build.rs.
cc = "1"
[profile.release] [profile.release]
opt-level = 3 opt-level = 3
lto = true lto = true
+7
View File
@@ -114,6 +114,13 @@ cargo build --release --no-default-features --features cuda # CUDA only
cargo build --release --no-default-features # CPU-only (no GPU) cargo build --release --no-default-features # CPU-only (no GPU)
``` ```
### Windows
Builds with no external GPU SDKs — CUDA/NVML are loaded at runtime and the
OpenCL import library is generated at build time. See
[BUILD-windows.md](BUILD-windows.md) for the toolchain, build, and packaging
steps.
### Portable / distributable builds ### Portable / distributable builds
The miner's only runtime dependencies are the C library and the OpenCL ICD loader The miner's only runtime dependencies are the C library and the OpenCL ICD loader
+79
View File
@@ -0,0 +1,79 @@
//! Build script — Windows OpenCL import library only.
//!
//! On Windows, the `ocl`/`cl-sys` crates link `OpenCL` (`#[link(name =
//! "OpenCL")]`), which normally requires an `OpenCL.lib` import library from a
//! vendor OpenCL SDK. To avoid that build dependency, we generate a
//! vendor-neutral import library ourselves from `windows/OpenCL.def` (the list
//! of OpenCL exports) using the toolchain's own tools — `lib.exe` for MSVC,
//! `dlltool` for the GNU (MinGW) toolchain — and put it on the link search path.
//! The real `OpenCL.dll` (the ICD loader) is supplied at runtime by the GPU
//! driver, exactly like `libOpenCL.so.1` on Linux.
//!
//! Nothing here is needed on non-Windows targets (the system `libOpenCL` is used
//! directly) or when the `gpu` (OpenCL) feature is off — the script no-ops. The
//! CUDA driver / NVML are loaded at runtime via dlopen (see `src/dylib.rs`), so
//! they need no build-script support.
use std::env;
use std::path::PathBuf;
use std::process::Command;
fn main() {
println!("cargo:rerun-if-changed=build.rs");
println!("cargo:rerun-if-changed=windows/OpenCL.def");
// Only Windows targets with the OpenCL backend need an import library.
if env::var("CARGO_CFG_TARGET_OS").as_deref() != Ok("windows") {
return;
}
if env::var_os("CARGO_FEATURE_GPU").is_none() {
return;
}
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
let def = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("windows/OpenCL.def");
let target = env::var("TARGET").unwrap_or_default();
let is_msvc = env::var("CARGO_CFG_TARGET_ENV").as_deref() == Ok("msvc");
let (mut cmd, tool) = if is_msvc {
// lib.exe /def:OpenCL.def /out:OpenCL.lib /machine:X64
let mut c = cc::windows_registry::find(&target, "lib.exe")
.unwrap_or_else(|| Command::new("lib.exe"));
c.arg("/nologo")
.arg(format!("/def:{}", def.display()))
.arg(format!("/out:{}", out_dir.join("OpenCL.lib").display()))
.arg("/machine:X64");
(c, "lib.exe")
} else {
// GNU/MinGW: dlltool -d OpenCL.def -l libOpenCL.a -m i386:x86-64
let dlltool = ["x86_64-w64-mingw32-dlltool", "dlltool"]
.into_iter()
.find(|t| Command::new(t).arg("--version").output().is_ok())
.unwrap_or("dlltool");
let mut c = Command::new(dlltool);
c.arg("-d")
.arg(&def)
.arg("-l")
.arg(out_dir.join("libOpenCL.a"))
.arg("-m")
.arg("i386:x86-64");
(c, "dlltool")
};
match cmd.status() {
Ok(s) if s.success() => {
// Resolve `#[link(name = "OpenCL")]` against the generated import lib.
println!("cargo:rustc-link-search=native={}", out_dir.display());
}
Ok(s) => panic!("{tool} failed ({s}) generating the OpenCL import library from {}", def.display()),
Err(e) => {
// Tool not found: let `cargo check` (which doesn't link) still succeed;
// a real build will fail at link with a clear "cannot find OpenCL".
println!(
"cargo:warning=could not run {tool} to generate the OpenCL import library ({e}); \
ensure the toolchain tools are on PATH (MSVC dev prompt, or mingw-w64). \
Linking the `gpu` feature will fail until then."
);
}
}
}
+15 -14
View File
@@ -2,8 +2,6 @@
//#define PRINT 1 //#define PRINT 1
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__constant ulong blake_iv[] = __constant ulong blake_iv[] =
{ {
0x6a09e667f3bcc908, 0xbb67ae8584caa73b, 0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
@@ -507,14 +505,22 @@ uint compress2(uint in0, uint in1) {
return tmp; return tmp;
} }
// Exact round(sqrt(x)) without fp64: a single-precision estimate corrected to the
// exact integer floor, then rounded. Lets the kernel build on OpenCL stacks that
// lack cl_khr_fp64 (e.g. rusticl/Mesa) while staying bit-identical on ROCm. Inputs
// are triangular indices (< ~2^26), well within range for the float estimate +
// integer correction.
inline uint isqrt_round(ulong x) {
long m = (long) sqrt((float) x);
while (m > 0 && (ulong)(m * m) > x) m--; // correct down to floor(sqrt)
while ((ulong)((m + 1) * (m + 1)) <= x) m++; // correct up to floor(sqrt)
// round to nearest: round up iff the fractional part is >= 0.5.
return (uint)(((x - (ulong)(m * m)) > (ulong) m) ? (m + 1) : m);
}
uint2 decompress(uint2 in) { uint2 decompress(uint2 in) {
double inFl = (double) (in.s0 >> 6);
inFl *= 2.0;
inFl += 1.0;
uint2 res; uint2 res;
res.s0 = (uint) round(sqrt(inFl)); res.s0 = isqrt_round(2ul * (ulong)(in.s0 >> 6) + 1ul);
uint tmp = res.s0 * (res.s0-1); uint tmp = res.s0 * (res.s0-1);
@@ -535,13 +541,8 @@ uint2 decompress(uint2 in) {
} }
uint2 decompress2(uint in) { uint2 decompress2(uint in) {
double inFl = (double) in;
inFl *= 2.0;
inFl += 1.0;
uint2 res; uint2 res;
res.s0 = (uint) round(sqrt(inFl)); res.s0 = isqrt_round(2ul * (ulong) in + 1ul);
uint tmp = res.s0 * (res.s0-1); uint tmp = res.s0 * (res.s0-1);
+204
View File
@@ -0,0 +1,204 @@
//! AMD GPU telemetry for [`crate::gpu_tune::GpuTuner`] via the Linux amdgpu
//! sysfs interface (`/sys/class/drm/cardN/device/...`).
//!
//! Telemetry only: board power, edge temperature, and core/memory clocks come
//! from the world-readable hwmon + DPM nodes, so it works unprivileged. The
//! control nodes (`pp_od_clk_voltage`, power cap, performance level) are
//! root-write-only and carry GPU-hang risk, so every setter returns
//! [`SetOutcome::Unsupported`] — this backend never writes. The handle is matched
//! to the physical card by PCI bus id (e.g. from OpenCL's `cl_khr_pci_bus_info`),
//! so it lines up with whichever device the solver actually opened.
use std::path::{Path, PathBuf};
use crate::gpu_tune::{GpuTuner, SetOutcome};
/// One amdgpu card's sysfs telemetry handle.
pub struct AmdTuner {
/// Canonical `/sys/.../<PCI BDF>` device directory (holds `pp_dpm_*`).
device_dir: PathBuf,
/// `device_dir/hwmon/hwmonM` (the index M is not stable — resolved by glob).
hwmon_dir: PathBuf,
name: String,
}
// Only `PathBuf`/`String` — `AmdTuner` is `Send` automatically; no `unsafe impl`.
/// Open a telemetry handle for the amdgpu card at `pci_bus_id` (e.g.
/// "0000:03:00.0"). Returns `None` off Linux, or when no amdgpu card with
/// telemetry nodes matches the bus id.
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
#[cfg(not(target_os = "linux"))]
{
let _ = pci_bus_id;
None
}
#[cfg(target_os = "linux")]
{
let want = bdf_tail(pci_bus_id)?;
for entry in std::fs::read_dir("/sys/class/drm").ok()?.flatten() {
let fname = entry.file_name();
let fname = fname.to_string_lossy();
// Match the GPU nodes ("card0", "card1", …), not the per-connector
// dirs ("card1-DP-1") or render nodes.
if !fname.starts_with("card") || fname.contains('-') {
continue;
}
// `cardN/device` symlinks to the PCI device dir `…/<BDF>`.
let Ok(device_dir) = std::fs::canonicalize(entry.path().join("device")) else {
continue;
};
let matches = device_dir
.file_name()
.and_then(|s| s.to_str())
.and_then(bdf_tail)
.map(|t| t == want)
.unwrap_or(false);
if !matches {
continue;
}
let Some(hwmon_dir) = find_hwmon(&device_dir) else { continue };
// Require a telemetry node so we don't attach to a card without
// sensors (e.g. some virtual/headless devices).
if !hwmon_dir.join("temp1_input").exists()
&& !hwmon_dir.join("power1_average").exists()
{
continue;
}
let name = read_name(&device_dir);
return Some(Box::new(AmdTuner { device_dir, hwmon_dir, name }));
}
None
}
}
/// First `hwmon/hwmon*` subdir of a device dir (the index isn't stable).
fn find_hwmon(device_dir: &Path) -> Option<PathBuf> {
for e in std::fs::read_dir(device_dir.join("hwmon")).ok()?.flatten() {
if e.file_name().to_string_lossy().starts_with("hwmon") {
return Some(e.path());
}
}
None
}
/// A display name for the card; amdgpu sysfs rarely exposes a marketing name, so
/// fall back to a generic label.
fn read_name(device_dir: &Path) -> String {
read_trim(&device_dir.join("product_name"))
.filter(|s| !s.is_empty())
.unwrap_or_else(|| "AMD GPU".to_string())
}
/// Normalize a PCI BDF to its `bus:device.function` tail (drop the domain), so
/// e.g. "0000:03:00.0" and "03:00.0" compare equal.
fn bdf_tail(bdf: &str) -> Option<String> {
let bdf = bdf.trim().to_ascii_lowercase();
match bdf.split(':').collect::<Vec<_>>().as_slice() {
[_domain, bus, devfunc] => Some(format!("{bus}:{devfunc}")),
[bus, devfunc] => Some(format!("{bus}:{devfunc}")),
_ => None,
}
}
fn read_trim(p: &Path) -> Option<String> {
std::fs::read_to_string(p).ok().map(|s| s.trim().to_string())
}
fn read_u64(p: &Path) -> Option<u64> {
read_trim(p)?.parse().ok()
}
/// Parse the MHz value out of a `pp_dpm_*` line like `"2: 2700Mhz *"`.
fn parse_mhz(line: &str) -> Option<u32> {
let after_colon = line.split(':').nth(1)?;
let token = after_colon.trim().split_whitespace().next()?;
let digits: String = token.chars().take_while(|c| c.is_ascii_digit()).collect();
digits.parse().ok()
}
impl AmdTuner {
/// MHz of the currently active DPM level (the line marked `*`) in `file`.
fn dpm_active_mhz(&self, file: &str) -> Option<u32> {
let s = read_trim(&self.device_dir.join(file))?;
s.lines().find(|l| l.contains('*')).and_then(parse_mhz)
}
/// Highest DPM level (MHz) listed in `file`.
fn dpm_max_mhz(&self, file: &str) -> Option<u32> {
let s = read_trim(&self.device_dir.join(file))?;
s.lines().filter_map(parse_mhz).max()
}
}
impl GpuTuner for AmdTuner {
fn name(&self) -> String {
self.name.clone()
}
fn watts(&self) -> Option<f64> {
// power1_average is µW; fall back to the instantaneous power1_input.
let uw = read_u64(&self.hwmon_dir.join("power1_average"))
.or_else(|| read_u64(&self.hwmon_dir.join("power1_input")))?;
Some(uw as f64 / 1_000_000.0)
}
fn temperature_c(&self) -> Option<u32> {
// temp1 = edge (m°C).
let mc = read_u64(&self.hwmon_dir.join("temp1_input"))?;
Some(((mc + 500) / 1000) as u32)
}
fn current_power_limit_w(&self) -> Option<u32> {
// Absent on Navi 44 (RX 9060 XT); best-effort for cards that expose it.
read_u64(&self.hwmon_dir.join("power1_cap")).map(|uw| (uw / 1_000_000) as u32)
}
fn core_clock_mhz(&self) -> Option<u32> {
// freq1_input (Hz) is the live sclk; the DPM active level is the fallback.
if let Some(hz) = read_u64(&self.hwmon_dir.join("freq1_input")) {
if hz > 0 {
return Some((hz / 1_000_000) as u32);
}
}
self.dpm_active_mhz("pp_dpm_sclk")
}
fn mem_clock_mhz(&self) -> Option<u32> {
self.dpm_active_mhz("pp_dpm_mclk")
}
fn max_core_clock_mhz(&self) -> Option<u32> {
self.dpm_max_mhz("pp_dpm_sclk")
}
fn max_mem_clock_mhz(&self) -> Option<u32> {
self.dpm_max_mhz("pp_dpm_mclk")
}
fn power_limit_range_w(&self) -> Option<(u32, u32)> {
let mn = read_u64(&self.hwmon_dir.join("power1_cap_min"))?;
let mx = read_u64(&self.hwmon_dir.join("power1_cap_max"))?;
Some(((mn / 1_000_000) as u32, (mx / 1_000_000) as u32))
}
// Telemetry-only backend: never writes the root-only control nodes.
fn set_persistence(&self, _on: bool) -> SetOutcome {
SetOutcome::Unsupported
}
fn lock_core_clock_mhz(&self, _mhz: u32) -> SetOutcome {
SetOutcome::Unsupported
}
fn lock_mem_clock_mhz(&self, _mhz: u32) -> SetOutcome {
SetOutcome::Unsupported
}
fn set_power_limit_w(&self, _watts: u32) -> SetOutcome {
SetOutcome::Unsupported
}
fn set_core_offset_mhz(&self, _mhz: i32) -> SetOutcome {
SetOutcome::Unsupported
}
fn set_mem_offset_mhz(&self, _mhz: i32) -> SetOutcome {
SetOutcome::Unsupported
}
fn reset(&self) {}
}
+95
View File
@@ -215,6 +215,92 @@ impl BatchHasher {
} }
} }
/// Hash eight consecutive indices `g0..g0+8`, writing each 48-byte digest
/// into `out[0..8]`. Uses AVX-512 (one 8-wide BLAKE2b compression) when
/// available, else two AVX2 `hash4`s. Modern AMD (Zen4+) and Intel CPUs run
/// AVX-512 without the clock penalty older Intel parts had.
#[inline]
pub fn hash8(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx512f") {
unsafe { self.hash8_avx512(g0, out) };
return;
}
}
// Fallback: two 4-lane batches into the two halves of `out`.
let (lo, hi) = out.split_at_mut(4);
self.hash4(g0, (&mut lo[..4]).try_into().unwrap());
self.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
}
/// Eight-lane BLAKE2b final-block compression (AVX-512). Same structure as
/// [`Self::hash4_avx2`] but 512-bit lanes and native 64-bit rotates.
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx512f")]
unsafe fn hash8_avx512(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
use core::arch::x86_64::*;
#[inline(always)]
unsafe fn g8(
v: &mut [__m512i; 16],
a: usize, b: usize, c: usize, d: usize,
x: __m512i, y: __m512i,
) {
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), x);
v[d] = _mm512_ror_epi64::<32>(_mm512_xor_si512(v[d], v[a]));
v[c] = _mm512_add_epi64(v[c], v[d]);
v[b] = _mm512_ror_epi64::<24>(_mm512_xor_si512(v[b], v[c]));
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), y);
v[d] = _mm512_ror_epi64::<16>(_mm512_xor_si512(v[d], v[a]));
v[c] = _mm512_add_epi64(v[c], v[d]);
v[b] = _mm512_ror_epi64::<63>(_mm512_xor_si512(v[b], v[c]));
}
// Only m0 and m1 are nonzero; m1's high 32 bits hold the per-lane `g`.
let tail0 = u64::from_le_bytes(self.tail[0..8].try_into().unwrap());
let tail_hi = u32::from_le_bytes(self.tail[8..12].try_into().unwrap()) as u64;
let m1 = |g: u32| (tail_hi | ((g as u64) << 32)) as i64;
let mut m = [_mm512_setzero_si512(); 16];
m[0] = _mm512_set1_epi64(tail0 as i64);
m[1] = _mm512_set_epi64(
m1(g0 + 7), m1(g0 + 6), m1(g0 + 5), m1(g0 + 4),
m1(g0 + 3), m1(g0 + 2), m1(g0 + 1), m1(g0),
);
let mut v = [_mm512_setzero_si512(); 16];
for i in 0..8 {
v[i] = _mm512_set1_epi64(self.mid[i] as i64);
v[i + 8] = _mm512_set1_epi64(IV[i] as i64);
}
v[12] = _mm512_xor_si512(v[12], _mm512_set1_epi64(FINAL_COUNT as i64));
v[14] = _mm512_xor_si512(v[14], _mm512_set1_epi64(-1)); // last-block flag
for s in &SIGMA {
g8(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
g8(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
g8(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
g8(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
g8(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
g8(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
g8(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
g8(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
}
// h[i] = mid[i] ^ v[i] ^ v[i+8]; first HASH_OUTPUT/8 words per lane, LE.
let mut tmp = [0u64; 8];
for i in 0..HASH_OUTPUT / 8 {
let o = _mm512_xor_si512(
_mm512_xor_si512(_mm512_set1_epi64(self.mid[i] as i64), v[i]),
v[i + 8],
);
_mm512_storeu_si512(tmp.as_mut_ptr() as *mut _, o);
for l in 0..8 {
out[l][i * 8..i * 8 + 8].copy_from_slice(&tmp[l].to_le_bytes());
}
}
}
#[cfg(target_arch = "x86_64")] #[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")] #[target_feature(enable = "avx2")]
unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) { unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
@@ -316,5 +402,14 @@ mod tests {
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32); assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
} }
} }
// hash8 (AVX-512 or hash4 fallback) must match for several batches.
for base_g in [0u32, 8, 64, 1_000_000] {
let mut out = [[0u8; HASH_OUTPUT]; 8];
hasher.hash8(base_g, &mut out);
for l in 0..8 {
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash8 mismatch at g={}", base_g + l as u32);
}
}
} }
} }
+3 -3
View File
@@ -175,15 +175,15 @@ mod tests {
#[test] #[test]
fn protocol_get_and_set() { fn protocol_get_and_set() {
let controls = Controls::new(1, 0, 0, 0, 0, true); let controls = Controls::new(1, 0, 0, 0, 0, true);
let cpu = CpuMining::new((0..8).collect(), 4, false); // sizes [1,2,4,8] let cpu = CpuMining::new((0..16).collect(), 4, false); // 16 cores -> tier cap 4, sizes [1,2,4]
let stats = Stats::for_test(vec!["GPU 0".into()]); let stats = Stats::for_test(vec!["GPU 0".into()]);
// get: one device, cpu rows reflect the grouping (8 cores / 4 = 2 rows). // get: one device, cpu rows reflect the grouping (16 cores / 4 = 4 rows).
let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats); let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
assert_eq!(v["ok"], true); assert_eq!(v["ok"], true);
assert_eq!(v["devices"].as_array().unwrap().len(), 1); assert_eq!(v["devices"].as_array().unwrap().len(), 1);
assert_eq!(v["cpu"]["group_size"], 4); assert_eq!(v["cpu"]["group_size"], 4);
assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 2); assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 4);
// set device 0 disabled. // set device 0 disabled.
assert!(controls.device(0).enabled()); assert!(controls.device(0).enabled());
+70 -25
View File
@@ -165,6 +165,21 @@ fn grouped(cores: &[usize], size: usize, enabled: &BTreeSet<usize>) -> Arc<CpuGr
groups groups
} }
/// Largest cores-per-row the dashboard allows, by total core count. Finer
/// control on small machines, coarser on big ones so the toggleable-row count
/// stays manageable: **≤4 cores** toggle individually (size 1), **5-8 cores** in
/// groups of up to 2, and **more than 8** in groups of up to 4. This caps both
/// the starting group size and the 'g' cycle.
fn max_group_size(cores: usize) -> usize {
if cores <= 4 {
1
} else if cores <= 8 {
2
} else {
4
}
}
/// Live controller for CPU mining. Owns the selected cores plus the current /// Live controller for CPU mining. Owns the selected cores plus the current
/// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the /// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
/// dashboard (the worker supervisor watches `group_size()` and respawns to /// dashboard (the worker supervisor watches `group_size()` and respawns to
@@ -188,14 +203,16 @@ impl CpuMining {
/// the starting group size (`--cpu-group-size`), and `start_enabled` whether /// the starting group size (`--cpu-group-size`), and `start_enabled` whether
/// mining begins on (`--cpu-mining`). /// mining begins on (`--cpu-mining`).
pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> { pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
let initial_size = initial_size.max(1); // Toggle granularity scales with core count so the row count stays
// manageable (see [`max_group_size`]): the cap bounds both the starting
// size and the dashboard 'g' cycle.
let cap = max_group_size(cores.len());
let initial_size = initial_size.clamp(1, cap);
let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() }; let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
let groups = grouped(&cores, initial_size, &enabled); let groups = grouped(&cores, initial_size, &enabled);
// Cycle list: the usual powers of two plus the requested size, capped so a // Cycle list: powers of two from 1 up to the cap, plus the (clamped)
// group never exceeds the core count (unless the user explicitly asked for // requested size, sorted and de-duplicated.
// a larger size), sorted and de-duplicated.
let cap = cores.len().max(initial_size).max(1);
let mut sizes: Vec<usize> = [1usize, 2, 4, 8] let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
.into_iter() .into_iter()
.chain([initial_size]) .chain([initial_size])
@@ -346,39 +363,67 @@ mod tests {
} }
#[test] #[test]
fn cpu_mining_cycles_size_and_preserves_enabled_cores() { fn group_size_tier_by_core_count() {
// 8 cores, start at size 4 fully enabled -> two groups [0-3],[4-7], both on. // ≤4 cores: individual cores only — size 1, no larger option.
let m = CpuMining::new((0..8).collect(), 4, true); let m = CpuMining::new((0..4).collect(), 4, false); // requested 4 clamps to 1
assert_eq!(m.group_size(), 1);
assert_eq!(m.groups().len(), 4);
m.cycle_group_size();
assert_eq!(m.group_size(), 1); // only one size, cycle is a no-op
// 5-8 cores: groups of up to 2 (requested 4 clamps to 2).
let m = CpuMining::new((0..8).collect(), 4, false);
assert_eq!(m.group_size(), 2);
assert_eq!(m.groups().len(), 4);
m.cycle_group_size();
assert_eq!(m.group_size(), 1); // cycle covers {1, 2}
// >8 cores: groups of up to 4 (the default).
let m = CpuMining::new((0..16).collect(), 4, false);
assert_eq!(m.group_size(), 4); assert_eq!(m.group_size(), 4);
assert_eq!(m.groups().len(), 2); assert_eq!(m.groups().len(), 4);
// A larger explicit request is still capped at the tier max.
let m = CpuMining::new((0..16).collect(), 8, false);
assert_eq!(m.group_size(), 4);
}
#[test]
fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
// 16 cores (tier cap 4): start at size 4 fully enabled -> four groups,
// all on.
let m = CpuMining::new((0..16).collect(), 4, true);
assert_eq!(m.group_size(), 4);
assert_eq!(m.groups().len(), 4);
assert!(m.groups().iter().all(|g| g.enabled())); assert!(m.groups().iter().all(|g| g.enabled()));
// Disable the second group (cores 4-7): now only cores 0-3 are enabled. // Disable the last group (cores 12-15): now only cores 0-11 are enabled.
m.toggle_group(1); m.toggle_group(3);
assert!(m.groups().group(0).enabled()); assert!(!m.groups().group(3).enabled());
assert!(!m.groups().group(1).enabled());
// Cycle to size 8 and rebuild: the single [0-7] group is off, because not // Cycle to size 1 (individual cores) and rebuild: 16 rows; cores 0-11 on,
// all of its cores were enabled. // 12-15 off.
while m.group_size() != 8 { while m.group_size() != 1 {
m.cycle_group_size(); m.cycle_group_size();
} }
let g = m.rebuild(); let g = m.rebuild();
assert_eq!(g.len(), 1); assert_eq!(g.len(), 16);
assert!(!g.group(0).enabled()); assert!(g.group(0).enabled()); // core 0
assert!(g.group(11).enabled()); // core 11
assert!(!g.group(12).enabled()); // core 12
assert!(!g.group(15).enabled()); // core 15
// Cycle to size 2 and rebuild: cores 0-3 are still tracked as enabled, so // Cycle to size 2 and rebuild: cores 0-11 are still tracked as enabled, so
// [0,1] and [2,3] come back on while [4,5] and [6,7] stay off — the choice // [0,1]..[10,11] come back on while [12,13],[14,15] stay off — the choice
// survived two regroups (not derived from the all-off size-8 grouping). // survived two regroups.
while m.group_size() != 2 { while m.group_size() != 2 {
m.cycle_group_size(); m.cycle_group_size();
} }
let g = m.rebuild(); let g = m.rebuild();
assert_eq!(g.len(), 4); assert_eq!(g.len(), 8);
assert!(g.group(0).enabled()); // [0,1] assert!(g.group(0).enabled()); // [0,1]
assert!(g.group(1).enabled()); // [2,3] assert!(g.group(5).enabled()); // [10,11]
assert!(!g.group(2).enabled()); // [4,5] assert!(!g.group(6).enabled()); // [12,13]
assert!(!g.group(3).enabled()); // [6,7] assert!(!g.group(7).enabled()); // [14,15]
assert!(m.generation() >= 2); assert!(m.generation() >= 2);
} }
+442 -119
View File
@@ -131,23 +131,31 @@ fn leaf_row(base: &State, idx: u32) -> Row {
// histogram that has to be zeroed, scatter-filled, and cloned every round). Two // histogram that has to be zeroed, scatter-filled, and cloned every round). Two
// entries can collide only when their *whole* leading block matches, so they // entries can collide only when their *whole* leading block matches, so they
// always land in the same partition — making the partitions independent // always land in the same partition — making the partitions independent
// collision domains that are processed in parallel across the rayon pool the // collision domains processed in parallel across the rayon pool.
// single CPU worker otherwise leaves idle.
// //
// The data layout follows xenoncat's packed slots and targets AVX2 throughput: // The data layout targets AVX2 throughput; the round is memory-bandwidth bound,
// * Entries live in fixed 32-byte `SLOT`s (one `__m256i`); producing a child // so every choice below exists to cut DRAM traffic / random access:
// is a single load/xor/permute/store (`xor_child_avx2`). // * `partition_top` is a parallel counting sort (per-chunk histograms ->
// * A dense parallel `keys[]` array mirrors each entry's leading block, so the // per-chunk base offsets -> disjoint-region scatter), not a serial pass.
// `partition_top`/`count_pairs`/`low_group` histogram sweeps stream over // * It also emits `keys_part` (the leading keys in partition order) so the
// 4 bytes/entry instead of striding the slots. // per-partition `count_pairs`/`low_group` sweeps read keys sequentially
// instead of gathering `keys[order[..]]`; `low_group` likewise emits
// `keys_sorted` so the emit group walk streams a dense local copy.
// * Slots hold residual hash words in a *narrowing* packed slot: round 0 is
// `SLOT` words; each round consumes the leading block, so round r packs at
// `SLOT - r` words. The XOR child is one 256-bit load/xor/permute that
// masked-stores the live lanes (`xor_child_avx2`); buffers carry a
// `SLOT_SLACK` pad for the 256-bit over-read of a narrow tail slot.
// * `collide` runs two passes — count pairs, then emit children directly into // * `collide` runs two passes — count pairs, then emit children directly into
// one pre-sized arena at per-partition offsets — so there is no per-bucket // one pre-sized arena at per-partition offsets — so there is no per-bucket
// allocation and no final concatenation copy; the `keys`/`slots` buffers // allocation and no final concatenation copy; the `keys`/`slots` buffers
// ping-pong between rounds. // ping-pong between rounds.
// * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`. // * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
// A further bandwidth step (not yet taken) is xenoncat's full radix scatter of // The remaining bottleneck is the random slot gather in `emit_bucket` over the
// the payload into bucket-contiguous storage, turning the inner-loop slot // ~1 GiB round buffer — the Equihash memory-hardness floor. A full radix scatter
// gathers into sequential reads. // of the payload into bucket-contiguous storage (to make that gather sequential)
// was evaluated and loses: the wide-slot scatter costs more than the gather it
// saves on the early, wide rounds.
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
/// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7). /// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
@@ -164,36 +172,53 @@ const LOW_BUCKETS: usize = 1 << LOW_BITS;
/// Mask isolating the low resolved bits. /// Mask isolating the low resolved bits.
const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32; const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;
/// Number of u32 words in a padded entry slot (32 bytes = one AVX2 register). /// Number of u32 words in a round-0 entry slot (32 bytes = one AVX2 register).
/// Every round stores its residual hash words in a fixed 8-word slot so the XOR /// Round 0 stores all eight 24-bit blocks; the XOR that produces a child is a
/// that produces a child is a single 256-bit load/xor/permute/store and every /// single 256-bit load/xor/permute — xenoncat's packed-slot trick (and the same
/// slot access is naturally aligned — xenoncat's packed-slot trick (and the same /// `uint4`-aligned-slot idea the CUDA backend uses), on the CPU. The leading
/// `uint4`-aligned-slot idea the CUDA backend already uses), on the CPU. The /// collision word lives in lane 0 and is mirrored into a dense parallel `keys[]`
/// leading collision word lives in lane 0 and is mirrored into a dense parallel /// array so the histogram passes stream over 4 bytes/entry instead of striding
/// `keys[]` array so the histogram passes stream over 4 bytes/entry instead of /// the slots.
/// striding the 32-byte slots. ///
/// Later rounds use a *narrower* pitch: each collision round consumes the leading
/// block, so a round-`r` output entry only carries `8 - r` meaningful words. The
/// solver packs each round's slots at that width (see `collide`'s `w_out`),
/// cutting the per-round slot-buffer DRAM traffic that bounds the collision rounds.
/// The XOR producer still loads a full 256-bit register (over-reading up to
/// `SLOT` words past a narrow tail slot, hence the `SLOT_SLACK` pad) but
/// masked-stores only the `w_out` live lanes so packed neighbours aren't touched.
const SLOT: usize = 8; const SLOT: usize = 8;
/// Scalar child producer: `out[0..8] = (a XOR b)` rotated left one lane; returns /// Trailing pad (in u32 words) on every slot buffer so the XOR producer's 256-bit
/// the child's new leading word (lane 1 of the XOR). Lane 0 of the XOR is the /// load over a narrow tail slot stays in bounds. The over-read reaches at most
/// just-collided block (always zero) and is rotated out. /// `(n-1)*w + SLOT` words for pitch `w`, i.e. `SLOT - w ≤ SLOT` words past the
/// `n*w` payload; `SLOT` words always covers it.
const SLOT_SLACK: usize = SLOT;
/// Scalar child producer: writes the `w_out` live words of `(a XOR b)` rotated
/// left one lane into `out`, and returns the child's new leading word (lane 1 of
/// the XOR). Lane 0 of the XOR is the just-collided block (always zero) and is
/// rotated out. `a`/`b` are read a full `SLOT` words wide (the caller pads each
/// slot buffer by `SLOT_SLACK`); only `out[0..w_out]` is written.
#[inline] #[inline]
unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32) -> u32 { unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
let mut x = [0u32; SLOT]; let mut x = [0u32; SLOT];
for t in 0..SLOT { for t in 0..SLOT {
x[t] = *a.add(t) ^ *b.add(t); x[t] = *a.add(t) ^ *b.add(t);
} }
for t in 0..SLOT { // out[t] = x[(t + 1) % SLOT]; for t < w_out <= SLOT-1 the modulo is a no-op.
*out.add(t) = x[(t + 1) % SLOT]; for t in 0..w_out {
*out.add(t) = x[t + 1];
} }
x[1] x[1]
} }
/// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by /// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
/// one) + one store. Replaces the per-word scalar XOR loop. /// one), then a masked store of the low `w_out` lanes so the packed, `w_out`-
/// pitched output never clobbers the next slot.
#[cfg(target_arch = "x86_64")] #[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")] #[target_feature(enable = "avx2")]
unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 { unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
use core::arch::x86_64::*; use core::arch::x86_64::*;
let x = _mm256_xor_si256( let x = _mm256_xor_si256(
_mm256_loadu_si256(a as *const __m256i), _mm256_loadu_si256(a as *const __m256i),
@@ -201,7 +226,12 @@ unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
); );
// rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8] // rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0)); let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
_mm256_storeu_si256(out as *mut __m256i, p); // mask = lanes [0, w_out) -> all-ones; maskstore writes only those.
let mask = _mm256_cmpgt_epi32(
_mm256_set1_epi32(w_out as i32),
_mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7),
);
_mm256_maskstore_epi32(out as *mut i32, mask, p);
_mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32 _mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
} }
@@ -244,38 +274,113 @@ fn repack_index(src: &[u8], dst: &mut [u32]) {
repack_index_scalar(src, dst); repack_index_scalar(src, dst);
} }
/// A raw `*mut u32` shared across rayon workers for the parallel scatter below.
/// Sound only because the workers write provably disjoint position sets.
#[derive(Clone, Copy)]
struct OrderPtr(*mut u32);
unsafe impl Send for OrderPtr {}
unsafe impl Sync for OrderPtr {}
/// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of /// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
/// their (dense) leading block. Returns `(starts, order)`, where partition `v` /// their (dense) leading block. Returns `(starts, order, keys_part)`: partition
/// owns the input indices `order[starts[v]..starts[v + 1]]`. The histogram /// `v` owns the input indices `order[starts[v]..starts[v + 1]]`, and `keys_part`
/// passes stream over `keys[]` (4 bytes/entry) instead of striding the slots. /// is the entries' leading words in the same partition-contiguous order
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>) { /// (`keys_part[p] == keys[order[p]]`). Carrying `keys_part` lets the per-partition
/// `count_pairs`/`low_group` sweeps read keys sequentially instead of gathering
/// `keys[order[..]]` over the whole array — at the cost of one extra 4-byte/entry
/// scatter here, folded into the (already parallel) phase-3 pass.
///
/// Parallel counting sort: the input is split into one contiguous chunk per
/// rayon worker. Each chunk histograms its slice (phase 1), a small serial pass
/// turns those into per-chunk base offsets within each bucket's output region
/// (phase 2), and each chunk scatters its entries into `order`/`keys_part`
/// (phase 3). Chunk `c`'s bucket-`b` writes land in `[off[c][b], off[c+1][b])`,
/// disjoint from every other chunk and bucket, so the concurrent writes never
/// alias. Entries within a bucket end up chunk-major rather than index-major;
/// that reordering is immaterial — `count_pairs`/`low_group` depend only on the
/// multiset of low keys, and final solutions are canonicalised, de-duplicated,
/// and verified.
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>, Vec<u32>) {
let mut starts = vec![0u32; TOP_BUCKETS + 1]; let mut starts = vec![0u32; TOP_BUCKETS + 1];
for k in 0..n {
starts[(keys[k] >> LOW_BITS) as usize + 1] += 1;
}
for i in 0..TOP_BUCKETS {
starts[i + 1] += starts[i];
}
let mut order = vec![0u32; n]; let mut order = vec![0u32; n];
let mut cur = starts.clone(); // small: TOP_BUCKETS + 1 entries let mut keys_part = vec![0u32; n];
for k in 0..n { if n == 0 {
let b = (keys[k] >> LOW_BITS) as usize; return (starts, order, keys_part);
order[cur[b] as usize] = k as u32; }
let nthreads = rayon::current_num_threads().max(1);
let chunk = n.div_ceil(nthreads);
let nchunks = n.div_ceil(chunk);
// Phase 1 (parallel): per-chunk top-bucket histograms.
let local_hists: Vec<Vec<u32>> = (0..nchunks)
.into_par_iter()
.map(|c| {
let lo = c * chunk;
let hi = ((c + 1) * chunk).min(n);
let mut h = vec![0u32; TOP_BUCKETS];
for &key in &keys[lo..hi] {
h[(key >> LOW_BITS) as usize] += 1;
}
h
})
.collect();
// Phase 2 (serial, nchunks * TOP_BUCKETS work): bucket starts, then each
// chunk's per-bucket base offset within its bucket's output region.
let mut totals = vec![0u32; TOP_BUCKETS];
for h in &local_hists {
for b in 0..TOP_BUCKETS {
totals[b] += h[b];
}
}
for b in 0..TOP_BUCKETS {
starts[b + 1] = starts[b] + totals[b];
}
let mut offsets = vec![vec![0u32; TOP_BUCKETS]; nchunks];
let mut running = starts[..TOP_BUCKETS].to_vec(); // running[b] starts at starts[b]
for c in 0..nchunks {
for b in 0..TOP_BUCKETS {
offsets[c][b] = running[b];
running[b] += local_hists[c][b];
}
}
// Phase 3 (parallel): each chunk scatters into its disjoint sub-ranges.
let optr = OrderPtr(order.as_mut_ptr());
let kptr = OrderPtr(keys_part.as_mut_ptr());
offsets.into_par_iter().enumerate().for_each(|(c, mut cur)| {
let (optr, kptr) = (optr, kptr); // capture whole (Sync) wrappers
let lo = c * chunk;
let hi = ((c + 1) * chunk).min(n);
let (obase, kbase) = (optr.0, kptr.0);
for k in lo..hi {
let key = keys[k];
let b = (key >> LOW_BITS) as usize;
let pos = cur[b] as usize;
// SAFETY: `pos` ranges over `[off[c][b], off[c+1][b])`, a range owned
// exclusively by chunk `c` and within `order`/`keys_part` bounds.
unsafe {
*obase.add(pos) = k as u32;
*kbase.add(pos) = key;
}
cur[b] += 1; cur[b] += 1;
} }
(starts, order) });
(starts, order, keys_part)
} }
/// Count the colliding pairs a partition will emit, from the low-bit histogram /// Count the colliding pairs a partition will emit, from the low-bit histogram
/// alone (no reordering). `clamp` caps each exact-collision group, matching the /// of its (partition-contiguous) leading keys `keys_run`. `clamp` caps each
/// emit pass so the output offsets line up. `hist` is reusable `LOW_BUCKETS` /// exact-collision group, matching the emit pass so the output offsets line up.
/// scratch. /// `hist` is reusable `LOW_BUCKETS` scratch.
fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usize { fn count_pairs(keys_run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
for h in hist.iter_mut() { for h in hist.iter_mut() {
*h = 0; *h = 0;
} }
for &k in run { for &key in keys_run {
hist[(keys[k as usize] & LOW_MASK) as usize] += 1; hist[(key & LOW_MASK) as usize] += 1;
} }
let mut pairs = 0usize; let mut pairs = 0usize;
for i in 0..LOW_BUCKETS { for i in 0..LOW_BUCKETS {
@@ -285,31 +390,44 @@ fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usi
pairs pairs
} }
/// Within one partition, group `run`'s entries by the low bits of their leading /// Within one partition, group its entries by the low bits of their leading
/// block, writing the grouped indices into `sorted`. `hist` is reusable /// block. Inputs are the partition-contiguous slabs `keys_run` (leading keys) and
/// `LOW_BUCKETS + 1` scratch. After this call `sorted` lists the run's indices /// `order_run` (matching global indices). Outputs, in low-key-sorted order:
/// with equal low keys contiguous, so callers recover each exact-collision /// `sorted` (the global indices, for the emit slot gather + back-refs) and
/// group by walking adjacent equal keys. /// `keys_sorted` (the leading keys, so the emit group walk streams a dense local
fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>) { /// array instead of gathering `keys[sorted[i]]`). `hist` is reusable
let m = run.len(); /// `LOW_BUCKETS + 1` scratch. Both reads are sequential over the slabs.
fn low_group(
keys_run: &[u32],
order_run: &[u32],
hist: &mut [u32],
sorted: &mut Vec<u32>,
keys_sorted: &mut Vec<u32>,
) {
let m = order_run.len();
sorted.clear(); sorted.clear();
keys_sorted.clear();
if m == 0 { if m == 0 {
return; return;
} }
for h in hist.iter_mut() { for h in hist.iter_mut() {
*h = 0; *h = 0;
} }
for &k in run { for &key in keys_run {
hist[(keys[k as usize] & LOW_MASK) as usize + 1] += 1; hist[(key & LOW_MASK) as usize + 1] += 1;
} }
for i in 0..LOW_BUCKETS { for i in 0..LOW_BUCKETS {
hist[i + 1] += hist[i]; hist[i + 1] += hist[i];
} }
sorted.resize(m, 0); sorted.resize(m, 0);
keys_sorted.resize(m, 0);
// hist[low] now holds the run-start offset; reuse it as the live cursor. // hist[low] now holds the run-start offset; reuse it as the live cursor.
for &k in run { for i in 0..m {
let low = (keys[k as usize] & LOW_MASK) as usize; let key = keys_run[i];
sorted[hist[low] as usize] = k; let low = (key & LOW_MASK) as usize;
let pos = hist[low] as usize;
sorted[pos] = order_run[i];
keys_sorted[pos] = key;
hist[low] += 1; hist[low] += 1;
} }
} }
@@ -320,26 +438,42 @@ fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>)
/// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly /// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
/// inside a `target_feature` wrapper while sharing one source of truth. /// inside a `target_feature` wrapper while sharing one source of truth.
macro_rules! emit_bucket_body { macro_rules! emit_bucket_body {
($keys:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $xor:path) => {{ ($keys_sorted:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $w_in:expr, $w_out:expr, $prefetch:expr, $xor:path) => {{
let s = $sorted; let s = $sorted;
let ks = $keys_sorted; // leading keys in `s`-order; group walk streams it
let m = s.len(); let m = s.len();
let w_in = $w_in;
let w_out = $w_out;
let mut w = 0usize; let mut w = 0usize;
let mut i = 0; let mut i = 0;
while i < m { while i < m {
let key = $keys[s[i] as usize] & LOW_MASK; let key = ks[i] & LOW_MASK;
let mut j = i + 1; let mut j = i + 1;
while j < m && ($keys[s[j] as usize] & LOW_MASK) == key { while j < m && (ks[j] & LOW_MASK) == key {
j += 1; j += 1;
} }
let hi = j.min(i.saturating_add($clamp)); let hi = j.min(i.saturating_add($clamp));
// The pair loops below read `slots[s[a]]` / `slots[s[b]]` at random
// global positions in the ~1 GB slot arena — each a likely cache
// miss. Prefetch this group's member slots up front so the misses
// overlap with the (L1-resident) pair XORs that follow.
#[cfg(target_arch = "x86_64")]
if $prefetch {
for a in i..hi {
core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(
$slots.as_ptr().add(s[a] as usize * w_in) as *const i8,
);
}
}
for a in i..hi { for a in i..hi {
let l = s[a] as usize; let l = s[a] as usize;
for b in (a + 1)..hi { for b in (a + 1)..hi {
let mr = s[b] as usize; let mr = s[b] as usize;
let nk = $xor( let nk = $xor(
$sout.as_mut_ptr().add(w * SLOT), $sout.as_mut_ptr().add(w * w_out),
$slots.as_ptr().add(l * SLOT), $slots.as_ptr().add(l * w_in),
$slots.as_ptr().add(mr * SLOT), $slots.as_ptr().add(mr * w_in),
w_out,
); );
$kout[w] = nk; $kout[w] = nk;
$pout[w] = ((l as u64) << 32) | mr as u64; $pout[w] = ((l as u64) << 32) | mr as u64;
@@ -353,69 +487,101 @@ macro_rules! emit_bucket_body {
} }
unsafe fn emit_bucket_scalar( unsafe fn emit_bucket_scalar(
keys: &[u32], keys_sorted: &[u32],
slots: &[u32], slots: &[u32],
sorted: &[u32], sorted: &[u32],
kout: &mut [u32], kout: &mut [u32],
sout: &mut [u32], sout: &mut [u32],
pout: &mut [u64], pout: &mut [u64],
clamp: usize, clamp: usize,
w_in: usize,
w_out: usize,
prefetch: bool,
) -> usize { ) -> usize {
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_scalar) emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_scalar)
} }
#[cfg(target_arch = "x86_64")] #[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")] #[target_feature(enable = "avx2")]
unsafe fn emit_bucket_avx2( unsafe fn emit_bucket_avx2(
keys: &[u32], keys_sorted: &[u32],
slots: &[u32], slots: &[u32],
sorted: &[u32], sorted: &[u32],
kout: &mut [u32], kout: &mut [u32],
sout: &mut [u32], sout: &mut [u32],
pout: &mut [u64], pout: &mut [u64],
clamp: usize, clamp: usize,
w_in: usize,
w_out: usize,
prefetch: bool,
) -> usize { ) -> usize {
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_avx2) emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_avx2)
} }
/// Emit a partition's children, dispatching to the AVX2 producer when available. /// Emit a partition's children, dispatching to the AVX2 producer when available.
/// `w_in`/`w_out` are the input/output slot pitches (`w_out == w_in - 1`).
/// `prefetch` software-prefetches each group's randomly-gathered member slots.
unsafe fn emit_bucket( unsafe fn emit_bucket(
keys: &[u32], keys_sorted: &[u32],
slots: &[u32], slots: &[u32],
sorted: &[u32], sorted: &[u32],
kout: &mut [u32], kout: &mut [u32],
sout: &mut [u32], sout: &mut [u32],
pout: &mut [u64], pout: &mut [u64],
clamp: usize, clamp: usize,
w_in: usize,
w_out: usize,
prefetch: bool,
) -> usize { ) -> usize {
#[cfg(target_arch = "x86_64")] #[cfg(target_arch = "x86_64")]
{ {
if is_x86_feature_detected!("avx2") { if is_x86_feature_detected!("avx2") {
return emit_bucket_avx2(keys, slots, sorted, kout, sout, pout, clamp); return emit_bucket_avx2(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch);
} }
} }
emit_bucket_scalar(keys, slots, sorted, kout, sout, pout, clamp) emit_bucket_scalar(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch)
} }
/// Group `n` entries by their leading block, then emit one child per colliding /// Group `n` entries by their leading block, then emit one child per colliding
/// pair: the XOR of the residual blocks (rotated into a fresh 8-word slot) plus /// pair: the XOR of the residual blocks (rotated into a fresh slot) plus a packed
/// a packed `(l << 32) | mr` back-reference. Two passes — count, then emit /// `(l << 32) | mr` back-reference. Two passes — count, then emit directly into
/// directly into one pre-sized arena — so there is no per-partition allocation /// one pre-sized arena — so there is no per-partition allocation or final
/// or final concatenation copy. Returns `(keys_out, slots_out, parents)`. /// concatenation copy. Returns `(keys_out, slots_out, parents)`.
fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Vec<u32>, Vec<u64>) { ///
let (starts, order) = partition_top(keys, n); /// `w_in` is the input slot pitch; the round consumes one block, so the output is
/// packed at `w_out = w_in - 1` words/slot. `slots_out` carries a `SLOT_SLACK`
/// trailing pad so the next round's 256-bit over-read stays in bounds.
fn collide(
keys: &[u32],
slots: &[u32],
n: usize,
clamp: usize,
w_in: usize,
) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
let w_out = w_in - 1;
// Sub-phase timing, gated on `EQ_PROFILE`. Prints partition / count / alloc /
// emit splits so we can see which part of the round dominates.
let prof = std::env::var_os("EQ_PROFILE").is_some();
// Software-prefetch the emit gather (on by default; `EQ_NO_PREFETCH` disables
// it for A/B benchmarking). Read once, not per group.
let do_prefetch = std::env::var_os("EQ_NO_PREFETCH").is_none();
let t0 = std::time::Instant::now();
// Pass 1: per-partition child counts (histogram-derived, no reordering). let (starts, order, keys_part) = partition_top(keys, n);
let t_part = std::time::Instant::now();
// Pass 1: per-partition child counts from each partition's contiguous keys.
let counts: Vec<usize> = (0..TOP_BUCKETS) let counts: Vec<usize> = (0..TOP_BUCKETS)
.into_par_iter() .into_par_iter()
.map_init( .map_init(
|| vec![0u32; LOW_BUCKETS], || vec![0u32; LOW_BUCKETS],
|hist, v| { |hist, v| {
let run = &order[starts[v] as usize..starts[v + 1] as usize]; let keys_run = &keys_part[starts[v] as usize..starts[v + 1] as usize];
count_pairs(keys, run, hist, clamp) count_pairs(keys_run, hist, clamp)
}, },
) )
.collect(); .collect();
let t_count = std::time::Instant::now();
let mut out_starts = vec![0usize; TOP_BUCKETS + 1]; let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
for v in 0..TOP_BUCKETS { for v in 0..TOP_BUCKETS {
@@ -423,7 +589,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
} }
let total = out_starts[TOP_BUCKETS]; let total = out_starts[TOP_BUCKETS];
let mut keys_out = vec![0u32; total]; let mut keys_out = vec![0u32; total];
let mut slots_out = vec![0u32; total * SLOT]; // Packed at `w_out` words/slot, plus a slack pad for the next round's over-read.
let mut slots_out = vec![0u32; total * w_out + SLOT_SLACK];
let mut parents = vec![0u64; total]; let mut parents = vec![0u64; total];
// Carve the output arena into disjoint per-partition sub-slices so workers // Carve the output arena into disjoint per-partition sub-slices so workers
@@ -435,7 +602,7 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]); let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
for &c in &counts { for &c in &counts {
let (kh, kt) = kr.split_at_mut(c); let (kh, kt) = kr.split_at_mut(c);
let (sh, st) = sr.split_at_mut(c * SLOT); let (sh, st) = sr.split_at_mut(c * w_out);
let (ph, pt) = pr.split_at_mut(c); let (ph, pt) = pr.split_at_mut(c);
kparts.push(kh); kparts.push(kh);
sparts.push(sh); sparts.push(sh);
@@ -446,6 +613,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
} }
} }
let t_alloc = std::time::Instant::now();
// Pass 2: group each partition and emit its colliding pairs in place. // Pass 2: group each partition and emit its colliding pairs in place.
kparts kparts
.into_par_iter() .into_par_iter()
@@ -453,37 +622,53 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
.zip(pparts) .zip(pparts)
.enumerate() .enumerate()
.for_each(|(v, ((kout, sout), pout))| { .for_each(|(v, ((kout, sout), pout))| {
let run = &order[starts[v] as usize..starts[v + 1] as usize]; let lo = starts[v] as usize;
let hi = starts[v + 1] as usize;
let mut hist = vec![0u32; LOW_BUCKETS + 1]; let mut hist = vec![0u32; LOW_BUCKETS + 1];
let mut sorted = Vec::new(); let mut sorted = Vec::new();
low_group(keys, run, &mut hist, &mut sorted); let mut keys_sorted = Vec::new();
let w = unsafe { emit_bucket(keys, slots, &sorted, kout, sout, pout, clamp) }; low_group(&keys_part[lo..hi], &order[lo..hi], &mut hist, &mut sorted, &mut keys_sorted);
let w = unsafe {
emit_bucket(&keys_sorted, slots, &sorted, kout, sout, pout, clamp, w_in, w_out, do_prefetch)
};
debug_assert_eq!(w, kout.len()); debug_assert_eq!(w, kout.len());
}); });
if prof {
let ms = |a: std::time::Instant, b: std::time::Instant| (b - a).as_secs_f64() * 1000.0;
eprintln!(
" [collide n={n}] partition {:6.1} count {:6.1} alloc {:6.1} emit(group+xor) {:6.1} ms",
ms(t0, t_part),
ms(t_part, t_count),
ms(t_count, t_alloc),
ms(t_alloc, std::time::Instant::now()),
);
}
(keys_out, slots_out, parents) (keys_out, slots_out, parents)
} }
/// Final round (slots hold `[w0, w1, …]`): among entries sharing leading block /// Final round (slots hold `[w0, w1, …]` at pitch `w_in`): among entries sharing
/// `w0`, a pair whose `w1` also matches XORs the last two blocks to zero — a /// leading block `w0`, a pair whose `w1` also matches XORs the last two blocks to
/// candidate. Returns the `(l, mr)` parents of each candidate. /// zero — a candidate. Returns the `(l, mr)` parents of each candidate.
fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u32, u32)> { fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize, w_in: usize) -> Vec<(u32, u32)> {
let (starts, order) = partition_top(keys, n); let (starts, order, keys_part) = partition_top(keys, n);
(0..TOP_BUCKETS) (0..TOP_BUCKETS)
.into_par_iter() .into_par_iter()
.map_init( .map_init(
|| (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new()), || (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new(), Vec::<u32>::new()),
|(hist, sorted), v| { |(hist, sorted, keys_sorted), v| {
let run = &order[starts[v] as usize..starts[v + 1] as usize]; let lo = starts[v] as usize;
low_group(keys, run, hist, sorted); let hi = starts[v + 1] as usize;
low_group(&keys_part[lo..hi], &order[lo..hi], hist, sorted, keys_sorted);
let m = sorted.len(); let m = sorted.len();
let mut local = Vec::new(); let mut local = Vec::new();
let mut i = 0; let mut i = 0;
while i < m { while i < m {
let key = keys[sorted[i] as usize] & LOW_MASK; let key = keys_sorted[i] & LOW_MASK;
let mut j = i + 1; let mut j = i + 1;
while j < m && (keys[sorted[j] as usize] & LOW_MASK) == key { while j < m && (keys_sorted[j] & LOW_MASK) == key {
j += 1; j += 1;
} }
let hi = j.min(i.saturating_add(clamp)); let hi = j.min(i.saturating_add(clamp));
@@ -491,7 +676,7 @@ fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u3
let l = sorted[a] as usize; let l = sorted[a] as usize;
for b in (a + 1)..hi { for b in (a + 1)..hi {
let mr = sorted[b] as usize; let mr = sorted[b] as usize;
if slots[l * SLOT + 1] == slots[mr * SLOT + 1] { if slots[l * w_in + 1] == slots[mr * w_in + 1] {
local.push((l as u32, mr as u32)); local.push((l as u32, mr as u32));
} }
} }
@@ -532,6 +717,22 @@ pub fn solve(header: &[u8]) -> Vec<Vec<u32>> {
pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> { pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
let clamp = clamp.unwrap_or(usize::MAX); let clamp = clamp.unwrap_or(usize::MAX);
// Optional per-phase timing, gated on `EQ_PROFILE` (any value). Zero cost
// when unset — the only overhead is one env lookup. Each `phase(label, n)`
// prints the wall time since the previous call and the live entry count.
let prof = std::env::var_os("EQ_PROFILE").is_some();
let mut t_last = std::time::Instant::now();
let mut phase = |label: &str, n: usize| {
if prof {
let now = std::time::Instant::now();
eprintln!(
" [profile] {label:<13} {:8.1} ms (n={n})",
(now - t_last).as_secs_f64() * 1000.0
);
t_last = now;
}
};
// ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored // ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
// in padded 8-word slots with the leading block mirrored into `keys`. Entry // in padded 8-word slots with the leading block mirrored into `keys`. Entry
// k corresponds to leaf index k (the gen order), so no leaf table is needed. // k corresponds to leaf index k (the gen order), so no leaf table is needed.
@@ -541,18 +742,29 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
let hasher = BatchHasher::new(header); let hasher = BatchHasher::new(header);
let mut keys: Vec<u32> = vec![0u32; n0]; let mut keys: Vec<u32> = vec![0u32; n0];
let mut slots: Vec<u32> = vec![0u32; n0 * SLOT]; let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
let kgroup = 4 * INDICES_PER_HASH_OUTPUT; // eight entries // Hash eight `g` values per chunk via `hash8`, which uses one AVX-512
// compression where available and falls back to two AVX2 `hash4`s (or scalar)
// otherwise — so this single path covers every CPU.
let kgroup = 8 * INDICES_PER_HASH_OUTPUT; // sixteen entries
let sgroup = kgroup * SLOT; let sgroup = kgroup * SLOT;
debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 4-g groups"); debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 8-g groups");
// `EQ_NO_AVX512` forces the AVX2 fallback (two hash4) for A/B benchmarking.
let use_avx512 = std::env::var_os("EQ_NO_AVX512").is_none();
slots slots
.par_chunks_mut(sgroup) .par_chunks_mut(sgroup)
.zip(keys.par_chunks_mut(kgroup)) .zip(keys.par_chunks_mut(kgroup))
.enumerate() .enumerate()
.for_each(|(c, (schunk, kchunk))| { .for_each(|(c, (schunk, kchunk))| {
let g0 = (c * 4) as u32; let g0 = (c * 8) as u32;
let mut hs = [[0u8; HASH_OUTPUT]; 4]; let mut hs = [[0u8; HASH_OUTPUT]; 8];
hasher.hash4(g0, &mut hs); if use_avx512 {
for j in 0..4 { hasher.hash8(g0, &mut hs);
} else {
let (lo, hi) = hs.split_at_mut(4);
hasher.hash4(g0, (&mut lo[..4]).try_into().unwrap());
hasher.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
}
for j in 0..8 {
for i in 0..INDICES_PER_HASH_OUTPUT { for i in 0..INDICES_PER_HASH_OUTPUT {
let e = j * INDICES_PER_HASH_OUTPUT + i; let e = j * INDICES_PER_HASH_OUTPUT + i;
let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES]; let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
@@ -567,20 +779,27 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
// parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents. // parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
// `keys`/`slots` ping-pong between rounds (the previous buffers are freed as // `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
// the new ones replace them). // the new ones replace them).
phase("round0-hash", n0);
let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1); let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
let mut n = n0; let mut n = n0;
for _ in 0..(K - 1) { // Round-0 slots carry all SLOT words; each collision round consumes the
let (ok, os, op) = collide(&keys, &slots, n, clamp); // leading block, so the next round's slots are one word narrower.
let mut width = SLOT;
for r in 0..(K - 1) {
let (ok, os, op) = collide(&keys, &slots, n, clamp, width);
n = op.len(); n = op.len();
parents.push(op); parents.push(op);
keys = ok; keys = ok;
slots = os; slots = os;
width -= 1;
phase(&format!("collide r{}", r + 1), n);
if n == 0 { if n == 0 {
return Vec::new(); return Vec::new();
} }
} }
let candidates = collide_final(&keys, &slots, n, clamp); let candidates = collide_final(&keys, &slots, n, clamp, width);
phase("collide-final", candidates.len());
if candidates.is_empty() { if candidates.is_empty() {
return Vec::new(); return Vec::new();
} }
@@ -605,8 +824,11 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
// refs are now round-0 indices == leaf indices. // refs are now round-0 indices == leaf indices.
recovered.extend_from_slice(&refs); recovered.extend_from_slice(&refs);
} }
phase("recover", recovered.len() / SOLUTION_INDICES);
filter_candidates(&base_state(header), &recovered) let result = filter_candidates(&base_state(header), &recovered);
phase("filter+verify", result.len());
result
} }
/// Re-order recovered leaf indices into the canonical solution ordering: at /// Re-order recovered leaf indices into the canonical solution ordering: at
@@ -741,26 +963,34 @@ mod tests {
assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32); assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
} }
// AVX2 XOR-child (xor + rotate-left-one-lane) must match the scalar version. // The XOR-child producers (scalar + AVX2) must agree for every output width,
// write exactly `w_out` words (rotate-left-one of the lane-wise XOR), and
// leave the rest of the packed buffer untouched (the AVX2 path masked-stores).
#[test] #[test]
fn xor_child_matches_scalar() { fn xor_child_matches_scalar() {
const SENT: u32 = 0xDEAD_BEEF;
let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2]; let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8]; let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
let mut o1 = [0u32; SLOT];
let mut o2 = [0u32; SLOT];
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
// Reference: rotate-left-one of the lane-wise XOR; new key = lane 1.
let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect(); let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
let expect: Vec<u32> = (0..SLOT).map(|i| x[(i + 1) % SLOT]).collect(); for w_out in 1..SLOT {
assert_eq!(&o1[..], &expect[..]); let mut o1 = [SENT; SLOT];
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
for t in 0..w_out {
assert_eq!(o1[t], x[(t + 1) % SLOT], "scalar word {t} (w_out={w_out})");
}
for t in w_out..SLOT {
assert_eq!(o1[t], SENT, "scalar wrote past w_out at {t} (w_out={w_out})");
}
assert_eq!(k1, x[1]); assert_eq!(k1, x[1]);
#[cfg(target_arch = "x86_64")] #[cfg(target_arch = "x86_64")]
if is_x86_feature_detected!("avx2") { if is_x86_feature_detected!("avx2") {
let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) }; let mut o2 = [SENT; SLOT];
assert_eq!(o1, o2, "avx2 xor_child != scalar"); let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
assert_eq!(o1, o2, "avx2 xor_child != scalar (w_out={w_out})");
assert_eq!(k1, k2); assert_eq!(k1, k2);
} }
} }
}
// Clamped solve over a fixed header must find valid solutions (Equihash // Clamped solve over a fixed header must find valid solutions (Equihash
// 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's // 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
@@ -780,4 +1010,97 @@ mod tests {
} }
} }
} }
/// A deterministic dense 140-byte header for benchmarking, filled by an
/// xorshift64* stream keyed on `seed` (the same scheme as `main::pseudo_header`)
/// so each nonce exercises a realistic, fully-populated header rather than the
/// degenerate all-`0x42` one.
#[cfg(test)]
fn bench_header(seed: u64) -> Vec<u8> {
let mut header = vec![0u8; HEADER_LEN];
let mut x = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
for b in header.iter_mut() {
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
*b = (x.wrapping_mul(0x2545F4914F6CDD1D) >> 33) as u8;
}
header
}
/// Steady-state throughput baseline for the CPU solver. Ignored by default;
/// run it in release (the only meaningful configuration — `debug_assert`s in
/// the collision hot path are off and `opt-level=3`/LTO is on) with output
/// visible:
///
/// ```text
/// cargo test --release full_solve_baseline -- --ignored --nocapture
/// ```
///
/// Tunables via env:
/// * `EQ_BENCH_ITERS` — timed solves per clamp (default 8).
/// * `EQ_BENCH_CLAMPS` — comma-separated clamp sweep (default `32`, the
/// value the miner and `--selftest` use).
///
/// Each timed iteration uses a distinct nonce so the figures average over
/// per-header variance; one warm-up solve per clamp (pages in the ~GB of
/// round buffers, warms caches) is excluded. This measures one solve at a
/// time across the whole rayon pool — i.e. solver latency and its memory
/// behaviour, which is exactly what the planned radix-scatter change targets
/// — not the aggregate multi-group mining rate `--bench` reports.
#[test]
#[ignore]
fn full_solve_baseline() {
use std::time::Instant;
let iters: usize = std::env::var("EQ_BENCH_ITERS")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(8);
let clamps: Vec<usize> = std::env::var("EQ_BENCH_CLAMPS")
.ok()
.map(|v| v.split(',').filter_map(|s| s.trim().parse().ok()).collect())
.filter(|v: &Vec<usize>| !v.is_empty())
.unwrap_or_else(|| vec![32]);
eprintln!(
"CPU solver baseline — {iters} timed solve(s)/clamp, clamps={clamps:?}, \
rayon threads={}",
rayon::current_num_threads()
);
for &clamp in &clamps {
// Warm-up (excluded): fault in buffers, warm caches.
let _ = solve_with(&bench_header(0), Some(clamp));
let mut times = Vec::with_capacity(iters);
let mut total_sols = 0usize;
for i in 0..iters {
let header = bench_header(i as u64 + 1);
let base = base_state(&header);
let t = Instant::now();
let sols = solve_with(&header, Some(clamp));
times.push(t.elapsed().as_secs_f64());
for s in &sols {
assert!(is_valid_solution(&base, s), "solver returned an invalid solution");
}
total_sols += sols.len();
}
times.sort_by(|a, b| a.partial_cmp(b).unwrap());
let n = times.len();
let sum: f64 = times.iter().sum();
let mean = 1000.0 * sum / n as f64;
let median = 1000.0 * times[n / 2];
let min = 1000.0 * times[0];
let max = 1000.0 * times[n - 1];
let solves_per_s = n as f64 / sum;
eprintln!(
"clamp={clamp:>3}: {solves_per_s:6.2} solve/s | mean {mean:6.0} ms \
median {median:6.0} ms min {min:6.0} ms max {max:6.0} ms | \
{total_sols} sol over {n} solve(s) ({:.2}/solve)",
total_sols as f64 / n as f64,
);
}
}
} }
+148 -50
View File
@@ -411,6 +411,9 @@ impl LegacySolver {
/// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`. /// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
pub struct GpuSolver { pub struct GpuSolver {
inner: SolverInner, inner: SolverInner,
/// Per-card telemetry handle (AMD amdgpu sysfs / NVML), matched to the device
/// by PCI bus id. `None` when no telemetry backend matches.
tuner: Option<Box<dyn crate::gpu_tune::GpuTuner>>,
} }
enum SolverInner { enum SolverInner {
@@ -423,13 +426,16 @@ impl GpuSolver {
/// device vendor (AMD → `equihash192_7.cl`). /// device vendor (AMD → `equihash192_7.cl`).
pub fn new(device_index: usize) -> Result<Self> { pub fn new(device_index: usize) -> Result<Self> {
let (platform, device) = pick_device(device_index)?; let (platform, device) = pick_device(device_index)?;
// Resolve a telemetry handle (AMD sysfs / NVML) from the device's PCI bus
// before `device` is consumed by the inner solver.
let tuner = device_pci_bus_id(&device).and_then(|bus| crate::gpu_tune::open(&bus));
let inner = if use_amd_kernel(&device) { let inner = if use_amd_kernel(&device) {
log::info!("OpenCL: AMD device — using the equihash192_7 kernel"); log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?) SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
} else { } else {
SolverInner::Legacy(LegacySolver::new(platform, device)?) SolverInner::Legacy(LegacySolver::new(platform, device)?)
}; };
Ok(Self { inner }) Ok(Self { inner, tuner })
} }
/// This device's product name, if available. /// This device's product name, if available.
@@ -440,6 +446,34 @@ impl GpuSolver {
} }
} }
/// Current board power draw in watts (telemetry handle), if available.
pub fn power_watts(&self) -> Option<f64> {
self.tuner.as_ref().and_then(|t| t.watts())
}
/// Current GPU temperature in °C, if available.
pub fn temperature_c(&self) -> Option<u32> {
self.tuner.as_ref().and_then(|t| t.temperature_c())
}
/// Currently enforced power limit in watts, if available.
pub fn current_power_limit_w(&self) -> Option<u32> {
self.tuner.as_ref().and_then(|t| t.current_power_limit_w())
}
/// (min, max) settable power limit in watts, if available.
pub fn power_limit_range_w(&self) -> Option<(u32, u32)> {
self.tuner.as_ref().and_then(|t| t.power_limit_range_w())
}
/// Current (core, memory) clocks in MHz, each `None` if unavailable.
pub fn current_clocks_mhz(&self) -> (Option<u32>, Option<u32>) {
match &self.tuner {
Some(t) => (t.core_clock_mhz(), t.mem_clock_mhz()),
None => (None, None),
}
}
/// Solve the puzzle for `header` (140 bytes). /// Solve the puzzle for `header` (140 bytes).
pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> { pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
match &self.inner { match &self.inner {
@@ -475,11 +509,32 @@ impl GpuSolver {
pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> { pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
match &self.inner { match &self.inner {
SolverInner::Legacy(s) => s.hash_all(header), SolverInner::Legacy(s) => s.hash_all(header),
SolverInner::Amd(_) => { SolverInner::Amd(_) => Err(anyhow!("hash_all is not supported by the AMD kernel")),
Err(anyhow!("hash_all is not supported by the AMD kernel"))
} }
} }
} }
/// The device's PCI address as `"DDDD:BB:DD.F"` (lowercase), for matching the
/// physical card to a telemetry backend. Prefers `cl_khr_pci_bus_info`; falls
/// back to `cl_device_topology_amd`. `None` if the device exposes neither.
fn device_pci_bus_id(device: &ocl::Device) -> Option<String> {
const CL_DEVICE_PCI_BUS_INFO_KHR: u32 = 0x10F2;
const CL_DEVICE_TOPOLOGY_AMD: u32 = 0x4037;
// cl_device_pci_bus_info_khr = { u32 pci_domain, pci_bus, pci_device, pci_function }.
if let Ok(b) = device.info_raw(CL_DEVICE_PCI_BUS_INFO_KHR) {
if b.len() >= 16 {
let rd = |i: usize| u32::from_ne_bytes([b[i], b[i + 1], b[i + 2], b[i + 3]]);
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", rd(0), rd(4), rd(8), rd(12)));
}
}
// cl_device_topology_amd (PCIE branch): 24-byte struct, bus/device/function
// are the last three bytes; domain isn't exposed (assume 0000).
if let Ok(b) = device.info_raw(CL_DEVICE_TOPOLOGY_AMD) {
if b.len() >= 24 {
return Some(format!("0000:{:02x}:{:02x}.{:x}", b[21], b[22], b[23]));
}
}
None
} }
/// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel. /// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
@@ -501,19 +556,81 @@ fn use_amd_kernel(device: &ocl::Device) -> bool {
} }
} }
/// A cross-platform key identifying the physical GPU, as a canonical PCI address
/// `"DDDD:BB:DD.F"`. Vendor runtimes (ROCm/NVIDIA) expose `cl_khr_pci_bus_info`;
/// rusticl/Mesa doesn't, but its `cl_khr_device_uuid` *encodes* the PCI address
/// ({u32 domain LE, u8 bus, u8 device, u8 function, ...}), so the same physical
/// card yields the same key on both platforms. `None` if neither is available
/// (then the device is never deduped — safe).
fn device_dedup_key(device: &ocl::Device) -> Option<String> {
if let Some(pci) = device_pci_bus_id(device) {
return Some(pci);
}
const CL_DEVICE_UUID_KHR: u32 = 0x106A;
if let Ok(b) = device.info_raw(CL_DEVICE_UUID_KHR) {
if b.len() >= 7 {
let domain = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", domain, b[4], b[5], b[6]));
}
}
None
}
/// Lower = preferred when the same physical GPU is exposed by multiple OpenCL
/// platforms. De-prioritise the Mesa Gallium drivers (rusticl/clover) relative to
/// the vendor runtimes (ROCm / NVIDIA / Intel), which are faster and complete.
fn platform_rank(p: &ocl::Platform) -> u8 {
let name = p.name().unwrap_or_default().to_ascii_lowercase();
if name.contains("rusticl") || name.contains("clover") || name.contains("mesa") {
1
} else {
0
}
}
/// All usable OpenCL `(platform, device)` pairs in a stable flat order, with each
/// physical GPU de-duplicated across platforms by PCI bus id — a card exposed by
/// both ROCm and rusticl appears once (the vendor runtime wins over Mesa), so
/// mining "all" doesn't run the same card twice. This is the single source of
/// truth for the flat device index used by `--devices`, `--list-devices`, and
/// [`pick_device`]. Devices without a PCI bus id (CPU / PoCL) are never deduped.
fn enumerate_devices() -> Vec<(ocl::Platform, ocl::Device)> {
use ocl::{Device, Platform};
let mut out: Vec<(Platform, Device)> = Vec::new();
let mut by_pci: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
for platform in Platform::list() {
for device in Device::list_all(platform).unwrap_or_default() {
match device_dedup_key(&device) {
Some(pci) => match by_pci.get(&pci).copied() {
// Same physical GPU already listed: keep the preferred platform.
Some(existing) => {
if platform_rank(&platform) < platform_rank(&out[existing].0) {
out[existing] = (platform, device);
}
}
None => {
by_pci.insert(pci, out.len());
out.push((platform, device));
}
},
None => out.push((platform, device)), // no PCI id → can't dedup
}
}
}
out
}
/// List `(platform, device)` names so the user can choose `--device`. /// List `(platform, device)` names so the user can choose `--device`.
pub fn list_devices() -> Result<Vec<String>> { pub fn list_devices() -> Result<Vec<String>> {
use ocl::{Device, Platform}; let names = enumerate_devices()
let mut names = Vec::new(); .into_iter()
let mut idx = 0; .enumerate()
for platform in Platform::list() { .map(|(idx, (platform, device))| {
let pname = platform.name().unwrap_or_else(|_| "?".into()); let pname = platform.name().unwrap_or_else(|_| "?".into());
for device in Device::list_all(platform).unwrap_or_default() {
let dname = device.name().unwrap_or_else(|_| "?".into()); let dname = device.name().unwrap_or_else(|_| "?".into());
names.push(format!("[{idx}] {pname} / {dname}")); format!("[{idx}] {pname} / {dname}")
idx += 1; })
} .collect();
}
Ok(names) Ok(names)
} }
@@ -522,18 +639,15 @@ pub fn list_devices() -> Result<Vec<String>> {
/// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices). /// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
pub fn device_is_nvidia() -> Vec<bool> { pub fn device_is_nvidia() -> Vec<bool> {
use ocl::enums::{DeviceInfo, DeviceInfoResult}; use ocl::enums::{DeviceInfo, DeviceInfoResult};
use ocl::{Device, Platform}; enumerate_devices()
let mut out = Vec::new(); .into_iter()
for platform in Platform::list() { .map(|(_, device)| {
for device in Device::list_all(platform).unwrap_or_default() { matches!(
let is_nv = matches!(
device.info(DeviceInfo::Vendor), device.info(DeviceInfo::Vendor),
Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia") Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
); )
out.push(is_nv); })
} .collect()
}
out
} }
/// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any. /// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
@@ -541,36 +655,20 @@ pub fn device_is_nvidia() -> Vec<bool> {
/// [`list_devices`] / `--devices`. /// [`list_devices`] / `--devices`.
pub fn cpu_device_index() -> Option<usize> { pub fn cpu_device_index() -> Option<usize> {
use ocl::enums::{DeviceInfo, DeviceInfoResult}; use ocl::enums::{DeviceInfo, DeviceInfoResult};
use ocl::{Device, Platform}; enumerate_devices().into_iter().position(|(_, device)| {
let mut idx = 0; matches!(
for platform in Platform::list() {
for device in Device::list_all(platform).unwrap_or_default() {
let is_cpu = matches!(
device.info(DeviceInfo::Type).ok(), device.info(DeviceInfo::Type).ok(),
Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU) Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
); )
if is_cpu { })
return Some(idx);
}
idx += 1;
}
}
None
} }
/// Resolve a flat device index across all platforms, returning the device along /// Resolve a flat device index (into the de-duplicated [`enumerate_devices`]
/// with the platform it belongs to (needed to build the context against the /// list), returning the device along with the platform it belongs to (needed to
/// right platform). /// build the context against the right platform).
fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> { pub(crate) fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
use ocl::{Device, Platform}; enumerate_devices()
let mut idx = 0; .into_iter()
for platform in Platform::list() { .nth(index)
for device in Device::list_all(platform).unwrap_or_default() { .ok_or_else(|| anyhow!("no OpenCL device with index {index}"))
if idx == index {
return Ok((platform, device));
}
idx += 1;
}
}
Err(anyhow!("no OpenCL device with index {index}"))
} }
+12 -6
View File
@@ -128,19 +128,25 @@ pub trait GpuTuner: Send {
/// Open a control handle for the GPU at `pci_bus_id` (matches the physical card /// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
/// regardless of CUDA-vs-driver index ordering). `None` if unavailable. /// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
/// ///
/// NVML is the backend on both Linux (`libnvidia-ml`) and Windows (`nvml.dll`); /// Tries the NVIDIA backend first (NVML, `libnvidia-ml`/`nvml.dll`), then the AMD
/// the C API is identical, so the same [`crate::nvml`] code serves both. /// backend ([`crate::amd_smi`], Linux amdgpu sysfs). A non-matching bus id makes
/// each backend return `None`, so probing both is safe on mixed-vendor hosts.
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> { pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
#[cfg(any(unix, windows))] #[cfg(feature = "cuda")]
{ {
crate::nvml::open(pci_bus_id) if let Some(t) = crate::nvml::open(pci_bus_id) {
return Some(t);
} }
#[cfg(not(any(unix, windows)))] }
#[cfg(feature = "gpu")]
{ {
if let Some(t) = crate::amd_smi::open(pci_bus_id) {
return Some(t);
}
}
let _ = pci_bus_id; let _ = pci_bus_id;
None None
} }
}
static WARNED_PRIVS: AtomicBool = AtomicBool::new(false); static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);
+64 -12
View File
@@ -18,6 +18,10 @@ mod gpu;
#[cfg(feature = "gpu")] #[cfg(feature = "gpu")]
mod gpu_amd; mod gpu_amd;
// AMD GPU telemetry via Linux amdgpu sysfs (a `gpu_tune::GpuTuner` backend).
#[cfg(feature = "gpu")]
mod amd_smi;
// Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML. // Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
mod dylib; mod dylib;
@@ -28,7 +32,9 @@ mod cuda;
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
mod nvml; mod nvml;
#[cfg(feature = "cuda")] // Platform-agnostic GPU tuning/telemetry surface. The trait + policy compile for
// either GPU backend; NVML (cuda) and amd_smi (gpu) are the implementations.
#[cfg(any(feature = "cuda", feature = "gpu"))]
mod gpu_tune; mod gpu_tune;
use std::io::IsTerminal; use std::io::IsTerminal;
@@ -125,11 +131,12 @@ struct Args {
#[arg(long, value_name = "SPEC")] #[arg(long, value_name = "SPEC")]
cpu_cores: Option<String>, cpu_cores: Option<String>,
/// Cores per CPU mining row (default 4). Each row runs one shared solve /// Cores per CPU mining row. Each row runs one shared solve across its
/// across its cores; larger groups cut memory sharply: total RAM is ~4 GB × /// cores; larger groups cut memory sharply: total RAM is ~4 GB × (enabled
/// (enabled cores / this size). Use 1 for one row (and one solve) per core. /// cores / this size). Rows align to core-index blocks of this size. Capped
/// Rows are aligned to core-index blocks of this size, so a row never /// by core count so the row count stays manageable — ≤4 cores toggle
/// straddles a boundary. Cycle it live in the dashboard with 'g'. /// individually (1), 5-8 cores in groups of ≤2, more than 8 in groups of ≤4
/// — and the default is that cap. Cycle it live (within the cap) with 'g'.
#[arg(long, value_name = "N", default_value_t = 4)] #[arg(long, value_name = "N", default_value_t = 4)]
cpu_group_size: usize, cpu_group_size: usize,
@@ -203,6 +210,13 @@ struct Args {
#[arg(long)] #[arg(long)]
auto_tune: bool, auto_tune: bool,
/// Sustained-Sol/s governor: hold each GPU at/below this edge temperature (°C)
/// by pacing the solve cadence (no hardware writes, no root). Trades a little
/// throughput for lower temp/power; off by default (runs flat-out). Needs a
/// backend that reports temperature (AMD amdgpu / NVIDIA).
#[arg(long, value_name = "CELSIUS")]
target_temp: Option<u32>,
/// Efficiency: cap each GPU's power limit in watts (default: card max). /// Efficiency: cap each GPU's power limit in watts (default: card max).
/// Lower power trades a little hashrate for much better Sol/W. /// Lower power trades a little hashrate for much better Sol/W.
#[arg(long, value_name = "WATTS")] #[arg(long, value_name = "WATTS")]
@@ -619,6 +633,8 @@ fn main() -> Result<()> {
args.power_limit.unwrap_or(0), args.power_limit.unwrap_or(0),
args.unlock_controls, args.unlock_controls,
); );
// Software temp governor target (paces solve cadence; no hardware writes).
miner::set_target_temp(args.target_temp);
miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port) miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
} }
@@ -1076,10 +1092,21 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
use std::time::Instant; use std::time::Instant;
info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len()); info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());
/// Per-worker benchmark result, including a steady-state telemetry snapshot
/// (sampled right after the timed loop, while the card is warm).
struct WorkerResult {
sols: usize,
dt: f64,
watts: Option<f64>,
temp_c: Option<u32>,
core_mhz: Option<u32>,
mem_mhz: Option<u32>,
}
let start = Instant::now(); let start = Instant::now();
let mut handles = Vec::new(); let mut handles = Vec::new();
for (id, spec) in specs.into_iter().enumerate() { for (id, spec) in specs.into_iter().enumerate() {
handles.push(std::thread::spawn(move || -> Result<(usize, f64)> { handles.push(std::thread::spawn(move || -> Result<WorkerResult> {
let backend = spec.build()?; let backend = spec.build()?;
backend.solve(&pseudo_header(id as u64))?; // warm up (excluded) backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
let t = Instant::now(); let t = Instant::now();
@@ -1089,7 +1116,17 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
let seed = ((id as u64) << 40) | (i as u64 + 1); let seed = ((id as u64) << 40) | (i as u64 + 1);
sols += backend.solve(&pseudo_header(seed))?.len(); sols += backend.solve(&pseudo_header(seed))?.len();
} }
Ok((sols, t.elapsed().as_secs_f64())) let dt = t.elapsed().as_secs_f64();
// Snapshot telemetry while the card is still under load.
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
Ok(WorkerResult {
sols,
dt,
watts: backend.power_watts(),
temp_c: backend.temperature_c(),
core_mhz,
mem_mhz,
})
})); }));
} }
@@ -1098,11 +1135,26 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
let mut workers = 0usize; let mut workers = 0usize;
for h in handles { for h in handles {
match h.join().unwrap() { match h.join().unwrap() {
Ok((sols, dt)) => { Ok(r) => {
let sol_s = sols as f64 / dt; let sol_s = r.sols as f64 / r.dt;
// Optional telemetry tail: " | 142 W, 41.7 Sol/W, 68°C, 2700/2500 MHz".
let mut tail = String::new();
if let Some(w) = r.watts {
tail.push_str(&format!(" | {w:.0} W"));
if w > 0.0 {
tail.push_str(&format!(", {:.2} Sol/W", sol_s / w));
}
}
if let Some(t) = r.temp_c {
tail.push_str(&format!(", {t}°C"));
}
if let (Some(c), m) = (r.core_mhz, r.mem_mhz) {
tail.push_str(&format!(", {c}/{} MHz", m.map(|m| m.to_string()).unwrap_or_else(|| "?".into())));
}
info!( info!(
" worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {sols} solutions", " worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {} solutions{tail}",
1000.0 * dt / runs as f64 1000.0 * r.dt / runs as f64,
r.sols
); );
agg_sols += sol_s; agg_sols += sol_s;
workers += 1; workers += 1;
+105
View File
@@ -14,6 +14,22 @@ use crate::equihash;
use crate::params::{HEADER_LEN, SOLUTION_BYTES}; use crate::params::{HEADER_LEN, SOLUTION_BYTES};
use crate::stratum::{StratumClient, Work}; use crate::stratum::{StratumClient, Work};
/// Process-wide target edge temperature (°C) for the software solve-cadence
/// governor; `None` ⇒ run flat-out. Set once at startup from `--target-temp`.
static TARGET_TEMP_C: OnceLock<Option<u32>> = OnceLock::new();
/// Install the governor's target temperature (call once, before workers start).
pub fn set_target_temp(c: Option<u32>) {
if let Some(t) = c {
info!("temperature governor enabled: holding GPUs ≤{t}°C (paced cadence)");
}
let _ = TARGET_TEMP_C.set(c);
}
fn target_temp_c() -> Option<u32> {
TARGET_TEMP_C.get().copied().flatten()
}
/// Double SHA-256, as used for the Zcash/ZClassic block PoW hash. /// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
fn sha256d(data: &[u8]) -> [u8; 32] { fn sha256d(data: &[u8]) -> [u8; 32] {
let first = Sha256::digest(data); let first = Sha256::digest(data);
@@ -136,6 +152,8 @@ impl Backend {
match self { match self {
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
Backend::Cuda(solver) => solver.power_watts(), Backend::Cuda(solver) => solver.power_watts(),
#[cfg(feature = "gpu")]
Backend::Gpu(solver) => solver.power_watts(),
_ => None, _ => None,
} }
} }
@@ -145,6 +163,8 @@ impl Backend {
match self { match self {
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
Backend::Cuda(solver) => solver.temperature_c(), Backend::Cuda(solver) => solver.temperature_c(),
#[cfg(feature = "gpu")]
Backend::Gpu(solver) => solver.temperature_c(),
_ => None, _ => None,
} }
} }
@@ -154,6 +174,8 @@ impl Backend {
match self { match self {
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
Backend::Cuda(solver) => solver.current_power_limit_w(), Backend::Cuda(solver) => solver.current_power_limit_w(),
#[cfg(feature = "gpu")]
Backend::Gpu(solver) => solver.current_power_limit_w(),
_ => None, _ => None,
} }
} }
@@ -163,6 +185,8 @@ impl Backend {
match self { match self {
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
Backend::Cuda(solver) => solver.power_limit_range_w(), Backend::Cuda(solver) => solver.power_limit_range_w(),
#[cfg(feature = "gpu")]
Backend::Gpu(solver) => solver.power_limit_range_w(),
_ => None, _ => None,
} }
} }
@@ -190,6 +214,8 @@ impl Backend {
match self { match self {
#[cfg(feature = "cuda")] #[cfg(feature = "cuda")]
Backend::Cuda(solver) => solver.current_clocks_mhz(), Backend::Cuda(solver) => solver.current_clocks_mhz(),
#[cfg(feature = "gpu")]
Backend::Gpu(solver) => solver.current_clocks_mhz(),
_ => (None, None), _ => (None, None),
} }
} }
@@ -513,6 +539,14 @@ fn worker(
let mut last_job = Instant::now(); let mut last_job = Instant::now();
let mut paused = false; let mut paused = false;
let mut disabled_pause = false; let mut disabled_pause = false;
// Software temperature governor: pace the solve cadence to hold edge temp at
// or below `--target-temp` (no hardware writes). `gov_sleep` is the per-pass
// pause, nudged from the periodic temperature sample.
let gov_target = target_temp_c();
let mut gov_sleep = Duration::ZERO;
if let Some(target) = gov_target {
info!("worker {id}: temperature governor active — pacing cadence to hold ≤{target}°C");
}
while running.load(Ordering::Relaxed) { while running.load(Ordering::Relaxed) {
if work_handle.epoch() != current.epoch { if work_handle.epoch() != current.epoch {
@@ -583,6 +617,9 @@ fn worker(
} }
if let Some(t) = backend.temperature_c() { if let Some(t) = backend.temperature_c() {
stats.workers[id].temp_c.store(t, Ordering::Relaxed); stats.workers[id].temp_c.store(t, Ordering::Relaxed);
if let Some(target) = gov_target {
gov_sleep = govern_cadence(gov_sleep, t, target);
}
} }
let (core_mhz, mem_mhz) = backend.current_clocks_mhz(); let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
if let Some(c) = core_mhz { if let Some(c) = core_mhz {
@@ -614,12 +651,34 @@ fn worker(
let ctx = inflight.pop_front().unwrap(); let ctx = inflight.pop_front().unwrap();
process_results(id, &client, &ctx, &solutions, &stats); process_results(id, &client, &ctx, &solutions, &stats);
} }
// Temperature governor: pace the cadence (held off entirely when flat-out).
if !gov_sleep.is_zero() {
std::thread::sleep(gov_sleep);
}
} }
drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?; drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
Ok(()) Ok(())
} }
/// Nudge the per-pass governor pause to hold edge temperature near `target` (°C).
/// Over target → lengthen the pause (proportional to the overshoot); comfortably
/// under → shorten it. Called from the periodic temperature sample (~1 Hz), so
/// this is a slow integral controller — fine for the card's thermal time constant.
fn govern_cadence(cur: Duration, temp: u32, target: u32) -> Duration {
const MAX_SLEEP: Duration = Duration::from_millis(100);
let (t, target) = (temp as i64, target as i64);
if t > target {
cur.saturating_add(Duration::from_micros(400 * (t - target) as u64))
.min(MAX_SLEEP)
} else if t < target - 1 {
cur.saturating_sub(Duration::from_millis(1))
} else {
cur
}
}
/// Supervise CPU mining: spawn one worker per group for the current group size, /// Supervise CPU mining: spawn one worker per group for the current group size,
/// and whenever the dashboard cycles the size, stop those workers (`gen_running`), /// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
/// rebuild the grouping, and respawn. Runs until `running` is cleared. /// rebuild the grouping, and respawn. Runs until `running` is cleared.
@@ -843,3 +902,49 @@ fn build_nonce(nonce1: &[u8], counter: u64) -> Result<[u8; 32]> {
tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]); tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
Ok(nonce) Ok(nonce)
} }
#[cfg(test)]
mod governor_tests {
use super::*;
#[test]
fn governor_idles_below_target() {
// At/under target → no pause is introduced.
assert_eq!(govern_cadence(Duration::ZERO, 60, 70), Duration::ZERO);
assert_eq!(govern_cadence(Duration::ZERO, 70, 70), Duration::ZERO);
}
#[test]
fn governor_lengthens_over_target_and_clamps() {
// Over target → pause grows proportional to the overshoot.
let s = govern_cadence(Duration::ZERO, 75, 70); // +5°C
assert_eq!(s, Duration::from_micros(400 * 5));
// Far over target saturates at the cap, never beyond.
let hot = govern_cadence(Duration::from_millis(99), 200, 70);
assert_eq!(hot, Duration::from_millis(100));
}
#[test]
fn governor_eases_off_when_cool() {
// 2°C+ under target → pause shrinks; floors at zero (no underflow).
let s = govern_cadence(Duration::from_millis(3), 67, 70);
assert_eq!(s, Duration::from_millis(2));
assert_eq!(govern_cadence(Duration::ZERO, 50, 70), Duration::ZERO);
}
#[test]
fn governor_converges_to_hold_target() {
// Simulate a card whose temp rises with duty: more pause → cooler. The
// controller should settle near the target without runaway.
let target = 70u32;
let mut sleep = Duration::ZERO;
let mut temp = 85i32; // starts hot
for _ in 0..200 {
sleep = govern_cadence(sleep, temp as u32, target);
// crude plant: each ms of pause sheds ~1.5°C off a 90°C flat-out temp.
let modeled = 90.0 - 1.5 * sleep.as_millis() as f64;
temp = modeled.round() as i32;
}
assert!((target as i32 - 2..=target as i32 + 2).contains(&temp), "settled at {temp}°C, want ~{target}");
}
}
+124
View File
@@ -0,0 +1,124 @@
; OpenCL import library definition for jackpotminer's Windows build.
; Generated from the cl-sys 0.4.3 exports so a vendor-neutral OpenCL.lib /
; libOpenCL.a can be produced at build time without an OpenCL SDK. The actual
; OpenCL.dll (ICD loader) is provided at runtime by the GPU driver.
LIBRARY OpenCL
EXPORTS
clBuildProgram
clCloneKernel
clCompileProgram
clCreateBuffer
clCreateCommandQueue
clCreateCommandQueueWithProperties
clCreateContext
clCreateContextFromType
clCreateFromGLBuffer
clCreateFromGLRenderbuffer
clCreateFromGLTexture
clCreateFromGLTexture2D
clCreateFromGLTexture3D
clCreateImage
clCreateImage2D
clCreateImage3D
clCreateKernel
clCreateKernelsInProgram
clCreatePipe
clCreateProgramWithBinary
clCreateProgramWithBuiltInKernels
clCreateProgramWithIL
clCreateProgramWithSource
clCreateSampler
clCreateSamplerWithProperties
clCreateSubBuffer
clCreateSubDevices
clCreateUserEvent
clEnqueueAcquireGLObjects
clEnqueueBarrier
clEnqueueBarrierWithWaitList
clEnqueueCopyBuffer
clEnqueueCopyBufferRect
clEnqueueCopyBufferToImage
clEnqueueCopyImage
clEnqueueCopyImageToBuffer
clEnqueueFillBuffer
clEnqueueFillImage
clEnqueueMapBuffer
clEnqueueMapImage
clEnqueueMarker
clEnqueueMarkerWithWaitList
clEnqueueMigrateMemObjects
clEnqueueNativeKernel
clEnqueueNDRangeKernel
clEnqueueReadBuffer
clEnqueueReadBufferRect
clEnqueueReadImage
clEnqueueReleaseGLObjects
clEnqueueSVMFree
clEnqueueSVMMap
clEnqueueSVMMemcpy
clEnqueueSVMMemFill
clEnqueueSVMMigrateMem
clEnqueueSVMUnmap
clEnqueueTask
clEnqueueUnmapMemObject
clEnqueueWaitForEvents
clEnqueueWriteBuffer
clEnqueueWriteBufferRect
clEnqueueWriteImage
clFinish
clFlush
clGetCommandQueueInfo
clGetContextInfo
clGetDeviceAndHostTimer
clGetDeviceIDs
clGetDeviceInfo
clGetEventInfo
clGetEventProfilingInfo
clGetExtensionFunctionAddress
clGetExtensionFunctionAddressForPlatform
clGetGLContextInfoKHR
clGetGLObjectInfo
clGetGLTextureInfo
clGetHostTimer
clGetImageInfo
clGetKernelArgInfo
clGetKernelInfo
clGetKernelSubGroupInfo
clGetKernelWorkGroupInfo
clGetMemObjectInfo
clGetPipeInfo
clGetPlatformIDs
clGetPlatformInfo
clGetProgramBuildInfo
clGetProgramInfo
clGetSamplerInfo
clGetSupportedImageFormats
clLinkProgram
clReleaseCommandQueue
clReleaseContext
clReleaseDevice
clReleaseEvent
clReleaseKernel
clReleaseMemObject
clReleaseProgram
clReleaseSampler
clRetainCommandQueue
clRetainContext
clRetainDevice
clRetainEvent
clRetainKernel
clRetainMemObject
clRetainProgram
clRetainSampler
clSetDefaultDeviceCommandQueue
clSetEventCallback
clSetKernelArg
clSetKernelArgSVMPointer
clSetKernelExecInfo
clSetMemObjectDestructorCallback
clSetUserEventStatus
clSVMAlloc
clSVMFree
clUnloadCompiler
clUnloadPlatformCompiler
clWaitForEvents