Compare commits
10
Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 09725cf674 | |||
| 41db98af69 | |||
| 31aa85733e | |||
| 0002e90451 | |||
| afd56bee1b | |||
| 00531fb591 | |||
| 8a9d98a21d | |||
| 1b4a2a4dd9 | |||
| 966ce3e262 | |||
| 501527d3cb |
@@ -0,0 +1,119 @@
|
||||
# Building jackpotminer on Windows
|
||||
|
||||
The miner builds on Windows with **no external GPU SDKs**. The CUDA driver and
|
||||
NVML are loaded at runtime (`dlopen`/`LoadLibrary`), and the OpenCL import
|
||||
library is generated at build time from `windows/OpenCL.def` — so you don't need
|
||||
the CUDA Toolkit, an OpenCL SDK, or any vendor libraries to compile. The whole
|
||||
codebase (miner + the `jackpotminer-config` GUI) is verified to compile for
|
||||
`x86_64-pc-windows-*`.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Rust** (https://rustup.rs). The default target is `x86_64-pc-windows-msvc`.
|
||||
- **MSVC build tools** — Visual Studio 2019/2022 or the standalone *Build Tools
|
||||
for Visual Studio* with the **“Desktop development with C++”** workload (gives
|
||||
`link.exe`, `lib.exe`, and the Windows SDK).
|
||||
- Build from a **“x64 Native Tools Command Prompt for VS”** (or any shell where
|
||||
the MSVC environment is active) so `lib.exe` is found. `cargo` also locates
|
||||
it automatically via the registry in most cases.
|
||||
- *(Alternative toolchain)* the GNU target `x86_64-pc-windows-gnu` works too,
|
||||
with **MinGW-w64** on `PATH` (provides `dlltool` and the linker).
|
||||
|
||||
No CUDA Toolkit and **no OpenCL SDK** are required.
|
||||
|
||||
## Build
|
||||
|
||||
```powershell
|
||||
:: Default: OpenCL + CUDA backends + the GUI config tool
|
||||
cargo build --release
|
||||
|
||||
:: Miner only (no GUI), both GPU backends
|
||||
cargo build --release --no-default-features --features gpu,cuda
|
||||
|
||||
:: OpenCL only (AMD / Intel / NVIDIA)
|
||||
cargo build --release --no-default-features --features gpu
|
||||
|
||||
:: NVIDIA only — needs nothing external at build time (CUDA is dlopen'd)
|
||||
cargo build --release --no-default-features --features cuda
|
||||
```
|
||||
|
||||
Outputs: `target\release\jackpotminer.exe` (and `jackpotminer-config.exe` with
|
||||
the default features).
|
||||
|
||||
### How the OpenCL build dependency is avoided
|
||||
|
||||
`ocl`/`cl-sys` link `OpenCL` (`#[link(name = "OpenCL")]`), which normally needs
|
||||
an `OpenCL.lib` from a vendor SDK. Instead, `build.rs` generates a
|
||||
vendor-neutral import library from `windows/OpenCL.def`:
|
||||
|
||||
- **MSVC:** `lib.exe /def:windows\OpenCL.def /out:OpenCL.lib /machine:X64`
|
||||
- **GNU:** `dlltool -d windows/OpenCL.def -l libOpenCL.a -m i386:x86-64`
|
||||
|
||||
and puts it on the link search path. The import library only forwards to
|
||||
`OpenCL.dll`, which the GPU driver provides at runtime. If the toolchain tool
|
||||
isn’t on `PATH`, `build.rs` prints a warning and the link step fails with a clear
|
||||
message (compilation/`cargo check` still works).
|
||||
|
||||
## Distribution
|
||||
|
||||
Statically link the MSVC C runtime so users don’t need the VC++ redistributable:
|
||||
|
||||
```powershell
|
||||
set RUSTFLAGS=-C target-feature=+crt-static
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
## Runtime dependencies (on the mining machine)
|
||||
|
||||
- **`OpenCL.dll`** — the ICD loader, installed with any GPU driver (AMD/NVIDIA/
|
||||
Intel). Required for the OpenCL backend.
|
||||
- **`nvcuda.dll`** + **`nvml.dll`** — installed with the NVIDIA driver. Loaded
|
||||
on demand for the CUDA backend; absent on AMD-only machines, where the miner
|
||||
simply reports no CUDA devices.
|
||||
- The **VC++ runtime**, unless you built with `+crt-static`.
|
||||
|
||||
A `cuda`-enabled binary still starts on a machine with no NVIDIA driver.
|
||||
|
||||
## Building Windows binaries without a Windows machine
|
||||
|
||||
### GitHub Actions (recommended)
|
||||
|
||||
```yaml
|
||||
name: windows
|
||||
on: [push, workflow_dispatch]
|
||||
jobs:
|
||||
build:
|
||||
runs-on: windows-latest # MSVC + lib.exe already on PATH
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
- run: cargo build --release --features gpu,cuda
|
||||
env:
|
||||
RUSTFLAGS: -C target-feature=+crt-static
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: jackpotminer-windows
|
||||
path: target/release/*.exe
|
||||
```
|
||||
|
||||
### Cross-compile from Linux (GNU target)
|
||||
|
||||
```bash
|
||||
rustup target add x86_64-pc-windows-gnu
|
||||
sudo pacman -S mingw-w64 # or your distro's mingw-w64 (gives dlltool + linker)
|
||||
cargo build --release --target x86_64-pc-windows-gnu --no-default-features --features gpu,cuda
|
||||
```
|
||||
|
||||
The MSVC target can’t be linked from Linux. The GUI config tool (`eframe`) is
|
||||
easiest to build natively on Windows.
|
||||
|
||||
## Status / caveats
|
||||
|
||||
- **Compilation for Windows is verified** here via `cargo check
|
||||
--target x86_64-pc-windows-gnu` (default features, and `gpu,cuda`).
|
||||
- The OpenCL **import-library linking** uses the standard `lib.exe`/`dlltool`
|
||||
technique; validate it with an actual Windows (or MinGW cross) build, which
|
||||
needs those tools present.
|
||||
- `relaunch_in_terminal` (reopen-in-a-terminal on GUI launch) is Linux-only;
|
||||
harmless on Windows, where double-clicking a console binary already opens a
|
||||
console.
|
||||
Generated
+1
@@ -2230,6 +2230,7 @@ version = "0.1.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake2b_simd",
|
||||
"cc",
|
||||
"clap",
|
||||
"core_affinity",
|
||||
"ctrlc",
|
||||
|
||||
@@ -45,6 +45,11 @@ name = "jackpotminer-config"
|
||||
path = "src/config_gui.rs"
|
||||
required-features = ["config-gui"]
|
||||
|
||||
[build-dependencies]
|
||||
# Locates the MSVC `lib.exe` (to generate the OpenCL import library on Windows);
|
||||
# unused on other platforms. See build.rs.
|
||||
cc = "1"
|
||||
|
||||
[profile.release]
|
||||
opt-level = 3
|
||||
lto = true
|
||||
|
||||
@@ -114,6 +114,13 @@ cargo build --release --no-default-features --features cuda # CUDA only
|
||||
cargo build --release --no-default-features # CPU-only (no GPU)
|
||||
```
|
||||
|
||||
### Windows
|
||||
|
||||
Builds with no external GPU SDKs — CUDA/NVML are loaded at runtime and the
|
||||
OpenCL import library is generated at build time. See
|
||||
[BUILD-windows.md](BUILD-windows.md) for the toolchain, build, and packaging
|
||||
steps.
|
||||
|
||||
### Portable / distributable builds
|
||||
|
||||
The miner's only runtime dependencies are the C library and the OpenCL ICD loader
|
||||
|
||||
@@ -0,0 +1,79 @@
|
||||
//! Build script — Windows OpenCL import library only.
|
||||
//!
|
||||
//! On Windows, the `ocl`/`cl-sys` crates link `OpenCL` (`#[link(name =
|
||||
//! "OpenCL")]`), which normally requires an `OpenCL.lib` import library from a
|
||||
//! vendor OpenCL SDK. To avoid that build dependency, we generate a
|
||||
//! vendor-neutral import library ourselves from `windows/OpenCL.def` (the list
|
||||
//! of OpenCL exports) using the toolchain's own tools — `lib.exe` for MSVC,
|
||||
//! `dlltool` for the GNU (MinGW) toolchain — and put it on the link search path.
|
||||
//! The real `OpenCL.dll` (the ICD loader) is supplied at runtime by the GPU
|
||||
//! driver, exactly like `libOpenCL.so.1` on Linux.
|
||||
//!
|
||||
//! Nothing here is needed on non-Windows targets (the system `libOpenCL` is used
|
||||
//! directly) or when the `gpu` (OpenCL) feature is off — the script no-ops. The
|
||||
//! CUDA driver / NVML are loaded at runtime via dlopen (see `src/dylib.rs`), so
|
||||
//! they need no build-script support.
|
||||
|
||||
use std::env;
|
||||
use std::path::PathBuf;
|
||||
use std::process::Command;
|
||||
|
||||
fn main() {
|
||||
println!("cargo:rerun-if-changed=build.rs");
|
||||
println!("cargo:rerun-if-changed=windows/OpenCL.def");
|
||||
|
||||
// Only Windows targets with the OpenCL backend need an import library.
|
||||
if env::var("CARGO_CFG_TARGET_OS").as_deref() != Ok("windows") {
|
||||
return;
|
||||
}
|
||||
if env::var_os("CARGO_FEATURE_GPU").is_none() {
|
||||
return;
|
||||
}
|
||||
|
||||
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
|
||||
let def = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("windows/OpenCL.def");
|
||||
let target = env::var("TARGET").unwrap_or_default();
|
||||
let is_msvc = env::var("CARGO_CFG_TARGET_ENV").as_deref() == Ok("msvc");
|
||||
|
||||
let (mut cmd, tool) = if is_msvc {
|
||||
// lib.exe /def:OpenCL.def /out:OpenCL.lib /machine:X64
|
||||
let mut c = cc::windows_registry::find(&target, "lib.exe")
|
||||
.unwrap_or_else(|| Command::new("lib.exe"));
|
||||
c.arg("/nologo")
|
||||
.arg(format!("/def:{}", def.display()))
|
||||
.arg(format!("/out:{}", out_dir.join("OpenCL.lib").display()))
|
||||
.arg("/machine:X64");
|
||||
(c, "lib.exe")
|
||||
} else {
|
||||
// GNU/MinGW: dlltool -d OpenCL.def -l libOpenCL.a -m i386:x86-64
|
||||
let dlltool = ["x86_64-w64-mingw32-dlltool", "dlltool"]
|
||||
.into_iter()
|
||||
.find(|t| Command::new(t).arg("--version").output().is_ok())
|
||||
.unwrap_or("dlltool");
|
||||
let mut c = Command::new(dlltool);
|
||||
c.arg("-d")
|
||||
.arg(&def)
|
||||
.arg("-l")
|
||||
.arg(out_dir.join("libOpenCL.a"))
|
||||
.arg("-m")
|
||||
.arg("i386:x86-64");
|
||||
(c, "dlltool")
|
||||
};
|
||||
|
||||
match cmd.status() {
|
||||
Ok(s) if s.success() => {
|
||||
// Resolve `#[link(name = "OpenCL")]` against the generated import lib.
|
||||
println!("cargo:rustc-link-search=native={}", out_dir.display());
|
||||
}
|
||||
Ok(s) => panic!("{tool} failed ({s}) generating the OpenCL import library from {}", def.display()),
|
||||
Err(e) => {
|
||||
// Tool not found: let `cargo check` (which doesn't link) still succeed;
|
||||
// a real build will fail at link with a clear "cannot find OpenCL".
|
||||
println!(
|
||||
"cargo:warning=could not run {tool} to generate the OpenCL import library ({e}); \
|
||||
ensure the toolchain tools are on PATH (MSVC dev prompt, or mingw-w64). \
|
||||
Linking the `gpu` feature will fail until then."
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
+15
-14
@@ -2,8 +2,6 @@
|
||||
|
||||
//#define PRINT 1
|
||||
|
||||
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
|
||||
|
||||
__constant ulong blake_iv[] =
|
||||
{
|
||||
0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
|
||||
@@ -507,14 +505,22 @@ uint compress2(uint in0, uint in1) {
|
||||
return tmp;
|
||||
}
|
||||
|
||||
// Exact round(sqrt(x)) without fp64: a single-precision estimate corrected to the
|
||||
// exact integer floor, then rounded. Lets the kernel build on OpenCL stacks that
|
||||
// lack cl_khr_fp64 (e.g. rusticl/Mesa) while staying bit-identical on ROCm. Inputs
|
||||
// are triangular indices (< ~2^26), well within range for the float estimate +
|
||||
// integer correction.
|
||||
inline uint isqrt_round(ulong x) {
|
||||
long m = (long) sqrt((float) x);
|
||||
while (m > 0 && (ulong)(m * m) > x) m--; // correct down to floor(sqrt)
|
||||
while ((ulong)((m + 1) * (m + 1)) <= x) m++; // correct up to floor(sqrt)
|
||||
// round to nearest: round up iff the fractional part is >= 0.5.
|
||||
return (uint)(((x - (ulong)(m * m)) > (ulong) m) ? (m + 1) : m);
|
||||
}
|
||||
|
||||
uint2 decompress(uint2 in) {
|
||||
double inFl = (double) (in.s0 >> 6);
|
||||
|
||||
inFl *= 2.0;
|
||||
inFl += 1.0;
|
||||
|
||||
uint2 res;
|
||||
res.s0 = (uint) round(sqrt(inFl));
|
||||
res.s0 = isqrt_round(2ul * (ulong)(in.s0 >> 6) + 1ul);
|
||||
|
||||
|
||||
uint tmp = res.s0 * (res.s0-1);
|
||||
@@ -535,13 +541,8 @@ uint2 decompress(uint2 in) {
|
||||
}
|
||||
|
||||
uint2 decompress2(uint in) {
|
||||
double inFl = (double) in;
|
||||
|
||||
inFl *= 2.0;
|
||||
inFl += 1.0;
|
||||
|
||||
uint2 res;
|
||||
res.s0 = (uint) round(sqrt(inFl));
|
||||
res.s0 = isqrt_round(2ul * (ulong) in + 1ul);
|
||||
|
||||
|
||||
uint tmp = res.s0 * (res.s0-1);
|
||||
|
||||
+204
@@ -0,0 +1,204 @@
|
||||
//! AMD GPU telemetry for [`crate::gpu_tune::GpuTuner`] via the Linux amdgpu
|
||||
//! sysfs interface (`/sys/class/drm/cardN/device/...`).
|
||||
//!
|
||||
//! Telemetry only: board power, edge temperature, and core/memory clocks come
|
||||
//! from the world-readable hwmon + DPM nodes, so it works unprivileged. The
|
||||
//! control nodes (`pp_od_clk_voltage`, power cap, performance level) are
|
||||
//! root-write-only and carry GPU-hang risk, so every setter returns
|
||||
//! [`SetOutcome::Unsupported`] — this backend never writes. The handle is matched
|
||||
//! to the physical card by PCI bus id (e.g. from OpenCL's `cl_khr_pci_bus_info`),
|
||||
//! so it lines up with whichever device the solver actually opened.
|
||||
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use crate::gpu_tune::{GpuTuner, SetOutcome};
|
||||
|
||||
/// One amdgpu card's sysfs telemetry handle.
|
||||
pub struct AmdTuner {
|
||||
/// Canonical `/sys/.../<PCI BDF>` device directory (holds `pp_dpm_*`).
|
||||
device_dir: PathBuf,
|
||||
/// `device_dir/hwmon/hwmonM` (the index M is not stable — resolved by glob).
|
||||
hwmon_dir: PathBuf,
|
||||
name: String,
|
||||
}
|
||||
|
||||
// Only `PathBuf`/`String` — `AmdTuner` is `Send` automatically; no `unsafe impl`.
|
||||
|
||||
/// Open a telemetry handle for the amdgpu card at `pci_bus_id` (e.g.
|
||||
/// "0000:03:00.0"). Returns `None` off Linux, or when no amdgpu card with
|
||||
/// telemetry nodes matches the bus id.
|
||||
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
|
||||
#[cfg(not(target_os = "linux"))]
|
||||
{
|
||||
let _ = pci_bus_id;
|
||||
None
|
||||
}
|
||||
#[cfg(target_os = "linux")]
|
||||
{
|
||||
let want = bdf_tail(pci_bus_id)?;
|
||||
for entry in std::fs::read_dir("/sys/class/drm").ok()?.flatten() {
|
||||
let fname = entry.file_name();
|
||||
let fname = fname.to_string_lossy();
|
||||
// Match the GPU nodes ("card0", "card1", …), not the per-connector
|
||||
// dirs ("card1-DP-1") or render nodes.
|
||||
if !fname.starts_with("card") || fname.contains('-') {
|
||||
continue;
|
||||
}
|
||||
// `cardN/device` symlinks to the PCI device dir `…/<BDF>`.
|
||||
let Ok(device_dir) = std::fs::canonicalize(entry.path().join("device")) else {
|
||||
continue;
|
||||
};
|
||||
let matches = device_dir
|
||||
.file_name()
|
||||
.and_then(|s| s.to_str())
|
||||
.and_then(bdf_tail)
|
||||
.map(|t| t == want)
|
||||
.unwrap_or(false);
|
||||
if !matches {
|
||||
continue;
|
||||
}
|
||||
let Some(hwmon_dir) = find_hwmon(&device_dir) else { continue };
|
||||
// Require a telemetry node so we don't attach to a card without
|
||||
// sensors (e.g. some virtual/headless devices).
|
||||
if !hwmon_dir.join("temp1_input").exists()
|
||||
&& !hwmon_dir.join("power1_average").exists()
|
||||
{
|
||||
continue;
|
||||
}
|
||||
let name = read_name(&device_dir);
|
||||
return Some(Box::new(AmdTuner { device_dir, hwmon_dir, name }));
|
||||
}
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// First `hwmon/hwmon*` subdir of a device dir (the index isn't stable).
|
||||
fn find_hwmon(device_dir: &Path) -> Option<PathBuf> {
|
||||
for e in std::fs::read_dir(device_dir.join("hwmon")).ok()?.flatten() {
|
||||
if e.file_name().to_string_lossy().starts_with("hwmon") {
|
||||
return Some(e.path());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// A display name for the card; amdgpu sysfs rarely exposes a marketing name, so
|
||||
/// fall back to a generic label.
|
||||
fn read_name(device_dir: &Path) -> String {
|
||||
read_trim(&device_dir.join("product_name"))
|
||||
.filter(|s| !s.is_empty())
|
||||
.unwrap_or_else(|| "AMD GPU".to_string())
|
||||
}
|
||||
|
||||
/// Normalize a PCI BDF to its `bus:device.function` tail (drop the domain), so
|
||||
/// e.g. "0000:03:00.0" and "03:00.0" compare equal.
|
||||
fn bdf_tail(bdf: &str) -> Option<String> {
|
||||
let bdf = bdf.trim().to_ascii_lowercase();
|
||||
match bdf.split(':').collect::<Vec<_>>().as_slice() {
|
||||
[_domain, bus, devfunc] => Some(format!("{bus}:{devfunc}")),
|
||||
[bus, devfunc] => Some(format!("{bus}:{devfunc}")),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn read_trim(p: &Path) -> Option<String> {
|
||||
std::fs::read_to_string(p).ok().map(|s| s.trim().to_string())
|
||||
}
|
||||
|
||||
fn read_u64(p: &Path) -> Option<u64> {
|
||||
read_trim(p)?.parse().ok()
|
||||
}
|
||||
|
||||
/// Parse the MHz value out of a `pp_dpm_*` line like `"2: 2700Mhz *"`.
|
||||
fn parse_mhz(line: &str) -> Option<u32> {
|
||||
let after_colon = line.split(':').nth(1)?;
|
||||
let token = after_colon.trim().split_whitespace().next()?;
|
||||
let digits: String = token.chars().take_while(|c| c.is_ascii_digit()).collect();
|
||||
digits.parse().ok()
|
||||
}
|
||||
|
||||
impl AmdTuner {
|
||||
/// MHz of the currently active DPM level (the line marked `*`) in `file`.
|
||||
fn dpm_active_mhz(&self, file: &str) -> Option<u32> {
|
||||
let s = read_trim(&self.device_dir.join(file))?;
|
||||
s.lines().find(|l| l.contains('*')).and_then(parse_mhz)
|
||||
}
|
||||
/// Highest DPM level (MHz) listed in `file`.
|
||||
fn dpm_max_mhz(&self, file: &str) -> Option<u32> {
|
||||
let s = read_trim(&self.device_dir.join(file))?;
|
||||
s.lines().filter_map(parse_mhz).max()
|
||||
}
|
||||
}
|
||||
|
||||
impl GpuTuner for AmdTuner {
|
||||
fn name(&self) -> String {
|
||||
self.name.clone()
|
||||
}
|
||||
|
||||
fn watts(&self) -> Option<f64> {
|
||||
// power1_average is µW; fall back to the instantaneous power1_input.
|
||||
let uw = read_u64(&self.hwmon_dir.join("power1_average"))
|
||||
.or_else(|| read_u64(&self.hwmon_dir.join("power1_input")))?;
|
||||
Some(uw as f64 / 1_000_000.0)
|
||||
}
|
||||
|
||||
fn temperature_c(&self) -> Option<u32> {
|
||||
// temp1 = edge (m°C).
|
||||
let mc = read_u64(&self.hwmon_dir.join("temp1_input"))?;
|
||||
Some(((mc + 500) / 1000) as u32)
|
||||
}
|
||||
|
||||
fn current_power_limit_w(&self) -> Option<u32> {
|
||||
// Absent on Navi 44 (RX 9060 XT); best-effort for cards that expose it.
|
||||
read_u64(&self.hwmon_dir.join("power1_cap")).map(|uw| (uw / 1_000_000) as u32)
|
||||
}
|
||||
|
||||
fn core_clock_mhz(&self) -> Option<u32> {
|
||||
// freq1_input (Hz) is the live sclk; the DPM active level is the fallback.
|
||||
if let Some(hz) = read_u64(&self.hwmon_dir.join("freq1_input")) {
|
||||
if hz > 0 {
|
||||
return Some((hz / 1_000_000) as u32);
|
||||
}
|
||||
}
|
||||
self.dpm_active_mhz("pp_dpm_sclk")
|
||||
}
|
||||
|
||||
fn mem_clock_mhz(&self) -> Option<u32> {
|
||||
self.dpm_active_mhz("pp_dpm_mclk")
|
||||
}
|
||||
|
||||
fn max_core_clock_mhz(&self) -> Option<u32> {
|
||||
self.dpm_max_mhz("pp_dpm_sclk")
|
||||
}
|
||||
|
||||
fn max_mem_clock_mhz(&self) -> Option<u32> {
|
||||
self.dpm_max_mhz("pp_dpm_mclk")
|
||||
}
|
||||
|
||||
fn power_limit_range_w(&self) -> Option<(u32, u32)> {
|
||||
let mn = read_u64(&self.hwmon_dir.join("power1_cap_min"))?;
|
||||
let mx = read_u64(&self.hwmon_dir.join("power1_cap_max"))?;
|
||||
Some(((mn / 1_000_000) as u32, (mx / 1_000_000) as u32))
|
||||
}
|
||||
|
||||
// Telemetry-only backend: never writes the root-only control nodes.
|
||||
fn set_persistence(&self, _on: bool) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn lock_core_clock_mhz(&self, _mhz: u32) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn lock_mem_clock_mhz(&self, _mhz: u32) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn set_power_limit_w(&self, _watts: u32) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn set_core_offset_mhz(&self, _mhz: i32) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn set_mem_offset_mhz(&self, _mhz: i32) -> SetOutcome {
|
||||
SetOutcome::Unsupported
|
||||
}
|
||||
fn reset(&self) {}
|
||||
}
|
||||
@@ -215,6 +215,92 @@ impl BatchHasher {
|
||||
}
|
||||
}
|
||||
|
||||
/// Hash eight consecutive indices `g0..g0+8`, writing each 48-byte digest
|
||||
/// into `out[0..8]`. Uses AVX-512 (one 8-wide BLAKE2b compression) when
|
||||
/// available, else two AVX2 `hash4`s. Modern AMD (Zen4+) and Intel CPUs run
|
||||
/// AVX-512 without the clock penalty older Intel parts had.
|
||||
#[inline]
|
||||
pub fn hash8(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
{
|
||||
if is_x86_feature_detected!("avx512f") {
|
||||
unsafe { self.hash8_avx512(g0, out) };
|
||||
return;
|
||||
}
|
||||
}
|
||||
// Fallback: two 4-lane batches into the two halves of `out`.
|
||||
let (lo, hi) = out.split_at_mut(4);
|
||||
self.hash4(g0, (&mut lo[..4]).try_into().unwrap());
|
||||
self.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
|
||||
}
|
||||
|
||||
/// Eight-lane BLAKE2b final-block compression (AVX-512). Same structure as
|
||||
/// [`Self::hash4_avx2`] but 512-bit lanes and native 64-bit rotates.
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx512f")]
|
||||
unsafe fn hash8_avx512(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 8]) {
|
||||
use core::arch::x86_64::*;
|
||||
|
||||
#[inline(always)]
|
||||
unsafe fn g8(
|
||||
v: &mut [__m512i; 16],
|
||||
a: usize, b: usize, c: usize, d: usize,
|
||||
x: __m512i, y: __m512i,
|
||||
) {
|
||||
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), x);
|
||||
v[d] = _mm512_ror_epi64::<32>(_mm512_xor_si512(v[d], v[a]));
|
||||
v[c] = _mm512_add_epi64(v[c], v[d]);
|
||||
v[b] = _mm512_ror_epi64::<24>(_mm512_xor_si512(v[b], v[c]));
|
||||
v[a] = _mm512_add_epi64(_mm512_add_epi64(v[a], v[b]), y);
|
||||
v[d] = _mm512_ror_epi64::<16>(_mm512_xor_si512(v[d], v[a]));
|
||||
v[c] = _mm512_add_epi64(v[c], v[d]);
|
||||
v[b] = _mm512_ror_epi64::<63>(_mm512_xor_si512(v[b], v[c]));
|
||||
}
|
||||
|
||||
// Only m0 and m1 are nonzero; m1's high 32 bits hold the per-lane `g`.
|
||||
let tail0 = u64::from_le_bytes(self.tail[0..8].try_into().unwrap());
|
||||
let tail_hi = u32::from_le_bytes(self.tail[8..12].try_into().unwrap()) as u64;
|
||||
let m1 = |g: u32| (tail_hi | ((g as u64) << 32)) as i64;
|
||||
let mut m = [_mm512_setzero_si512(); 16];
|
||||
m[0] = _mm512_set1_epi64(tail0 as i64);
|
||||
m[1] = _mm512_set_epi64(
|
||||
m1(g0 + 7), m1(g0 + 6), m1(g0 + 5), m1(g0 + 4),
|
||||
m1(g0 + 3), m1(g0 + 2), m1(g0 + 1), m1(g0),
|
||||
);
|
||||
|
||||
let mut v = [_mm512_setzero_si512(); 16];
|
||||
for i in 0..8 {
|
||||
v[i] = _mm512_set1_epi64(self.mid[i] as i64);
|
||||
v[i + 8] = _mm512_set1_epi64(IV[i] as i64);
|
||||
}
|
||||
v[12] = _mm512_xor_si512(v[12], _mm512_set1_epi64(FINAL_COUNT as i64));
|
||||
v[14] = _mm512_xor_si512(v[14], _mm512_set1_epi64(-1)); // last-block flag
|
||||
|
||||
for s in &SIGMA {
|
||||
g8(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
|
||||
g8(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
|
||||
g8(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
|
||||
g8(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
|
||||
g8(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
|
||||
g8(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
|
||||
g8(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
|
||||
g8(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
|
||||
}
|
||||
|
||||
// h[i] = mid[i] ^ v[i] ^ v[i+8]; first HASH_OUTPUT/8 words per lane, LE.
|
||||
let mut tmp = [0u64; 8];
|
||||
for i in 0..HASH_OUTPUT / 8 {
|
||||
let o = _mm512_xor_si512(
|
||||
_mm512_xor_si512(_mm512_set1_epi64(self.mid[i] as i64), v[i]),
|
||||
v[i + 8],
|
||||
);
|
||||
_mm512_storeu_si512(tmp.as_mut_ptr() as *mut _, o);
|
||||
for l in 0..8 {
|
||||
out[l][i * 8..i * 8 + 8].copy_from_slice(&tmp[l].to_le_bytes());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2")]
|
||||
unsafe fn hash4_avx2(&self, g0: u32, out: &mut [[u8; HASH_OUTPUT]; 4]) {
|
||||
@@ -316,5 +402,14 @@ mod tests {
|
||||
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash4 mismatch at g={}", base_g + l as u32);
|
||||
}
|
||||
}
|
||||
|
||||
// hash8 (AVX-512 or hash4 fallback) must match for several batches.
|
||||
for base_g in [0u32, 8, 64, 1_000_000] {
|
||||
let mut out = [[0u8; HASH_OUTPUT]; 8];
|
||||
hasher.hash8(base_g, &mut out);
|
||||
for l in 0..8 {
|
||||
assert_eq!(out[l], generate_hash(&base, base_g + l as u32), "hash8 mismatch at g={}", base_g + l as u32);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
+3
-3
@@ -175,15 +175,15 @@ mod tests {
|
||||
#[test]
|
||||
fn protocol_get_and_set() {
|
||||
let controls = Controls::new(1, 0, 0, 0, 0, true);
|
||||
let cpu = CpuMining::new((0..8).collect(), 4, false); // sizes [1,2,4,8]
|
||||
let cpu = CpuMining::new((0..16).collect(), 4, false); // 16 cores -> tier cap 4, sizes [1,2,4]
|
||||
let stats = Stats::for_test(vec!["GPU 0".into()]);
|
||||
|
||||
// get: one device, cpu rows reflect the grouping (8 cores / 4 = 2 rows).
|
||||
// get: one device, cpu rows reflect the grouping (16 cores / 4 = 4 rows).
|
||||
let v = process("{\"op\":\"get\"}", &controls, &cpu, &stats);
|
||||
assert_eq!(v["ok"], true);
|
||||
assert_eq!(v["devices"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["cpu"]["group_size"], 4);
|
||||
assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 2);
|
||||
assert_eq!(v["cpu"]["rows"].as_array().unwrap().len(), 4);
|
||||
|
||||
// set device 0 disabled.
|
||||
assert!(controls.device(0).enabled());
|
||||
|
||||
+70
-25
@@ -165,6 +165,21 @@ fn grouped(cores: &[usize], size: usize, enabled: &BTreeSet<usize>) -> Arc<CpuGr
|
||||
groups
|
||||
}
|
||||
|
||||
/// Largest cores-per-row the dashboard allows, by total core count. Finer
|
||||
/// control on small machines, coarser on big ones so the toggleable-row count
|
||||
/// stays manageable: **≤4 cores** toggle individually (size 1), **5-8 cores** in
|
||||
/// groups of up to 2, and **more than 8** in groups of up to 4. This caps both
|
||||
/// the starting group size and the 'g' cycle.
|
||||
fn max_group_size(cores: usize) -> usize {
|
||||
if cores <= 4 {
|
||||
1
|
||||
} else if cores <= 8 {
|
||||
2
|
||||
} else {
|
||||
4
|
||||
}
|
||||
}
|
||||
|
||||
/// Live controller for CPU mining. Owns the selected cores plus the current
|
||||
/// group size, and rebuilds the [`CpuGroups`] when the size is cycled from the
|
||||
/// dashboard (the worker supervisor watches `group_size()` and respawns to
|
||||
@@ -188,14 +203,16 @@ impl CpuMining {
|
||||
/// the starting group size (`--cpu-group-size`), and `start_enabled` whether
|
||||
/// mining begins on (`--cpu-mining`).
|
||||
pub fn new(cores: Vec<usize>, initial_size: usize, start_enabled: bool) -> Arc<Self> {
|
||||
let initial_size = initial_size.max(1);
|
||||
// Toggle granularity scales with core count so the row count stays
|
||||
// manageable (see [`max_group_size`]): the cap bounds both the starting
|
||||
// size and the dashboard 'g' cycle.
|
||||
let cap = max_group_size(cores.len());
|
||||
let initial_size = initial_size.clamp(1, cap);
|
||||
let enabled: BTreeSet<usize> = if start_enabled { cores.iter().copied().collect() } else { BTreeSet::new() };
|
||||
let groups = grouped(&cores, initial_size, &enabled);
|
||||
|
||||
// Cycle list: the usual powers of two plus the requested size, capped so a
|
||||
// group never exceeds the core count (unless the user explicitly asked for
|
||||
// a larger size), sorted and de-duplicated.
|
||||
let cap = cores.len().max(initial_size).max(1);
|
||||
// Cycle list: powers of two from 1 up to the cap, plus the (clamped)
|
||||
// requested size, sorted and de-duplicated.
|
||||
let mut sizes: Vec<usize> = [1usize, 2, 4, 8]
|
||||
.into_iter()
|
||||
.chain([initial_size])
|
||||
@@ -346,39 +363,67 @@ mod tests {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
|
||||
// 8 cores, start at size 4 fully enabled -> two groups [0-3],[4-7], both on.
|
||||
let m = CpuMining::new((0..8).collect(), 4, true);
|
||||
fn group_size_tier_by_core_count() {
|
||||
// ≤4 cores: individual cores only — size 1, no larger option.
|
||||
let m = CpuMining::new((0..4).collect(), 4, false); // requested 4 clamps to 1
|
||||
assert_eq!(m.group_size(), 1);
|
||||
assert_eq!(m.groups().len(), 4);
|
||||
m.cycle_group_size();
|
||||
assert_eq!(m.group_size(), 1); // only one size, cycle is a no-op
|
||||
|
||||
// 5-8 cores: groups of up to 2 (requested 4 clamps to 2).
|
||||
let m = CpuMining::new((0..8).collect(), 4, false);
|
||||
assert_eq!(m.group_size(), 2);
|
||||
assert_eq!(m.groups().len(), 4);
|
||||
m.cycle_group_size();
|
||||
assert_eq!(m.group_size(), 1); // cycle covers {1, 2}
|
||||
|
||||
// >8 cores: groups of up to 4 (the default).
|
||||
let m = CpuMining::new((0..16).collect(), 4, false);
|
||||
assert_eq!(m.group_size(), 4);
|
||||
assert_eq!(m.groups().len(), 2);
|
||||
assert_eq!(m.groups().len(), 4);
|
||||
// A larger explicit request is still capped at the tier max.
|
||||
let m = CpuMining::new((0..16).collect(), 8, false);
|
||||
assert_eq!(m.group_size(), 4);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cpu_mining_cycles_size_and_preserves_enabled_cores() {
|
||||
// 16 cores (tier cap 4): start at size 4 fully enabled -> four groups,
|
||||
// all on.
|
||||
let m = CpuMining::new((0..16).collect(), 4, true);
|
||||
assert_eq!(m.group_size(), 4);
|
||||
assert_eq!(m.groups().len(), 4);
|
||||
assert!(m.groups().iter().all(|g| g.enabled()));
|
||||
|
||||
// Disable the second group (cores 4-7): now only cores 0-3 are enabled.
|
||||
m.toggle_group(1);
|
||||
assert!(m.groups().group(0).enabled());
|
||||
assert!(!m.groups().group(1).enabled());
|
||||
// Disable the last group (cores 12-15): now only cores 0-11 are enabled.
|
||||
m.toggle_group(3);
|
||||
assert!(!m.groups().group(3).enabled());
|
||||
|
||||
// Cycle to size 8 and rebuild: the single [0-7] group is off, because not
|
||||
// all of its cores were enabled.
|
||||
while m.group_size() != 8 {
|
||||
// Cycle to size 1 (individual cores) and rebuild: 16 rows; cores 0-11 on,
|
||||
// 12-15 off.
|
||||
while m.group_size() != 1 {
|
||||
m.cycle_group_size();
|
||||
}
|
||||
let g = m.rebuild();
|
||||
assert_eq!(g.len(), 1);
|
||||
assert!(!g.group(0).enabled());
|
||||
assert_eq!(g.len(), 16);
|
||||
assert!(g.group(0).enabled()); // core 0
|
||||
assert!(g.group(11).enabled()); // core 11
|
||||
assert!(!g.group(12).enabled()); // core 12
|
||||
assert!(!g.group(15).enabled()); // core 15
|
||||
|
||||
// Cycle to size 2 and rebuild: cores 0-3 are still tracked as enabled, so
|
||||
// [0,1] and [2,3] come back on while [4,5] and [6,7] stay off — the choice
|
||||
// survived two regroups (not derived from the all-off size-8 grouping).
|
||||
// Cycle to size 2 and rebuild: cores 0-11 are still tracked as enabled, so
|
||||
// [0,1]..[10,11] come back on while [12,13],[14,15] stay off — the choice
|
||||
// survived two regroups.
|
||||
while m.group_size() != 2 {
|
||||
m.cycle_group_size();
|
||||
}
|
||||
let g = m.rebuild();
|
||||
assert_eq!(g.len(), 4);
|
||||
assert_eq!(g.len(), 8);
|
||||
assert!(g.group(0).enabled()); // [0,1]
|
||||
assert!(g.group(1).enabled()); // [2,3]
|
||||
assert!(!g.group(2).enabled()); // [4,5]
|
||||
assert!(!g.group(3).enabled()); // [6,7]
|
||||
assert!(g.group(5).enabled()); // [10,11]
|
||||
assert!(!g.group(6).enabled()); // [12,13]
|
||||
assert!(!g.group(7).enabled()); // [14,15]
|
||||
assert!(m.generation() >= 2);
|
||||
}
|
||||
|
||||
|
||||
+442
-119
@@ -131,23 +131,31 @@ fn leaf_row(base: &State, idx: u32) -> Row {
|
||||
// histogram that has to be zeroed, scatter-filled, and cloned every round). Two
|
||||
// entries can collide only when their *whole* leading block matches, so they
|
||||
// always land in the same partition — making the partitions independent
|
||||
// collision domains that are processed in parallel across the rayon pool the
|
||||
// single CPU worker otherwise leaves idle.
|
||||
// collision domains processed in parallel across the rayon pool.
|
||||
//
|
||||
// The data layout follows xenoncat's packed slots and targets AVX2 throughput:
|
||||
// * Entries live in fixed 32-byte `SLOT`s (one `__m256i`); producing a child
|
||||
// is a single load/xor/permute/store (`xor_child_avx2`).
|
||||
// * A dense parallel `keys[]` array mirrors each entry's leading block, so the
|
||||
// `partition_top`/`count_pairs`/`low_group` histogram sweeps stream over
|
||||
// 4 bytes/entry instead of striding the slots.
|
||||
// The data layout targets AVX2 throughput; the round is memory-bandwidth bound,
|
||||
// so every choice below exists to cut DRAM traffic / random access:
|
||||
// * `partition_top` is a parallel counting sort (per-chunk histograms ->
|
||||
// per-chunk base offsets -> disjoint-region scatter), not a serial pass.
|
||||
// * It also emits `keys_part` (the leading keys in partition order) so the
|
||||
// per-partition `count_pairs`/`low_group` sweeps read keys sequentially
|
||||
// instead of gathering `keys[order[..]]`; `low_group` likewise emits
|
||||
// `keys_sorted` so the emit group walk streams a dense local copy.
|
||||
// * Slots hold residual hash words in a *narrowing* packed slot: round 0 is
|
||||
// `SLOT` words; each round consumes the leading block, so round r packs at
|
||||
// `SLOT - r` words. The XOR child is one 256-bit load/xor/permute that
|
||||
// masked-stores the live lanes (`xor_child_avx2`); buffers carry a
|
||||
// `SLOT_SLACK` pad for the 256-bit over-read of a narrow tail slot.
|
||||
// * `collide` runs two passes — count pairs, then emit children directly into
|
||||
// one pre-sized arena at per-partition offsets — so there is no per-bucket
|
||||
// allocation and no final concatenation copy; the `keys`/`slots` buffers
|
||||
// ping-pong between rounds.
|
||||
// * Round 0 repacks BLAKE2b digests into 24-bit blocks with `pshufb`.
|
||||
// A further bandwidth step (not yet taken) is xenoncat's full radix scatter of
|
||||
// the payload into bucket-contiguous storage, turning the inner-loop slot
|
||||
// gathers into sequential reads.
|
||||
// The remaining bottleneck is the random slot gather in `emit_bucket` over the
|
||||
// ~1 GiB round buffer — the Equihash memory-hardness floor. A full radix scatter
|
||||
// of the payload into bucket-contiguous storage (to make that gather sequential)
|
||||
// was evaluated and loses: the wide-slot scatter costs more than the gather it
|
||||
// saves on the early, wide rounds.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// 24-bit collision blocks carried by a round-0 entry (= 8 for 192,7).
|
||||
@@ -164,36 +172,53 @@ const LOW_BUCKETS: usize = 1 << LOW_BITS;
|
||||
/// Mask isolating the low resolved bits.
|
||||
const LOW_MASK: u32 = (LOW_BUCKETS - 1) as u32;
|
||||
|
||||
/// Number of u32 words in a padded entry slot (32 bytes = one AVX2 register).
|
||||
/// Every round stores its residual hash words in a fixed 8-word slot so the XOR
|
||||
/// that produces a child is a single 256-bit load/xor/permute/store and every
|
||||
/// slot access is naturally aligned — xenoncat's packed-slot trick (and the same
|
||||
/// `uint4`-aligned-slot idea the CUDA backend already uses), on the CPU. The
|
||||
/// leading collision word lives in lane 0 and is mirrored into a dense parallel
|
||||
/// `keys[]` array so the histogram passes stream over 4 bytes/entry instead of
|
||||
/// striding the 32-byte slots.
|
||||
/// Number of u32 words in a round-0 entry slot (32 bytes = one AVX2 register).
|
||||
/// Round 0 stores all eight 24-bit blocks; the XOR that produces a child is a
|
||||
/// single 256-bit load/xor/permute — xenoncat's packed-slot trick (and the same
|
||||
/// `uint4`-aligned-slot idea the CUDA backend uses), on the CPU. The leading
|
||||
/// collision word lives in lane 0 and is mirrored into a dense parallel `keys[]`
|
||||
/// array so the histogram passes stream over 4 bytes/entry instead of striding
|
||||
/// the slots.
|
||||
///
|
||||
/// Later rounds use a *narrower* pitch: each collision round consumes the leading
|
||||
/// block, so a round-`r` output entry only carries `8 - r` meaningful words. The
|
||||
/// solver packs each round's slots at that width (see `collide`'s `w_out`),
|
||||
/// cutting the per-round slot-buffer DRAM traffic that bounds the collision rounds.
|
||||
/// The XOR producer still loads a full 256-bit register (over-reading up to
|
||||
/// `SLOT` words past a narrow tail slot, hence the `SLOT_SLACK` pad) but
|
||||
/// masked-stores only the `w_out` live lanes so packed neighbours aren't touched.
|
||||
const SLOT: usize = 8;
|
||||
|
||||
/// Scalar child producer: `out[0..8] = (a XOR b)` rotated left one lane; returns
|
||||
/// the child's new leading word (lane 1 of the XOR). Lane 0 of the XOR is the
|
||||
/// just-collided block (always zero) and is rotated out.
|
||||
/// Trailing pad (in u32 words) on every slot buffer so the XOR producer's 256-bit
|
||||
/// load over a narrow tail slot stays in bounds. The over-read reaches at most
|
||||
/// `(n-1)*w + SLOT` words for pitch `w`, i.e. `SLOT - w ≤ SLOT` words past the
|
||||
/// `n*w` payload; `SLOT` words always covers it.
|
||||
const SLOT_SLACK: usize = SLOT;
|
||||
|
||||
/// Scalar child producer: writes the `w_out` live words of `(a XOR b)` rotated
|
||||
/// left one lane into `out`, and returns the child's new leading word (lane 1 of
|
||||
/// the XOR). Lane 0 of the XOR is the just-collided block (always zero) and is
|
||||
/// rotated out. `a`/`b` are read a full `SLOT` words wide (the caller pads each
|
||||
/// slot buffer by `SLOT_SLACK`); only `out[0..w_out]` is written.
|
||||
#[inline]
|
||||
unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
||||
unsafe fn xor_child_scalar(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
|
||||
let mut x = [0u32; SLOT];
|
||||
for t in 0..SLOT {
|
||||
x[t] = *a.add(t) ^ *b.add(t);
|
||||
}
|
||||
for t in 0..SLOT {
|
||||
*out.add(t) = x[(t + 1) % SLOT];
|
||||
// out[t] = x[(t + 1) % SLOT]; for t < w_out <= SLOT-1 the modulo is a no-op.
|
||||
for t in 0..w_out {
|
||||
*out.add(t) = x[t + 1];
|
||||
}
|
||||
x[1]
|
||||
}
|
||||
|
||||
/// AVX2 child producer: one `vpxor` + one `vpermd` (rotate the 8 lanes left by
|
||||
/// one) + one store. Replaces the per-word scalar XOR loop.
|
||||
/// one), then a masked store of the low `w_out` lanes so the packed, `w_out`-
|
||||
/// pitched output never clobbers the next slot.
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2")]
|
||||
unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
||||
unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32, w_out: usize) -> u32 {
|
||||
use core::arch::x86_64::*;
|
||||
let x = _mm256_xor_si256(
|
||||
_mm256_loadu_si256(a as *const __m256i),
|
||||
@@ -201,7 +226,12 @@ unsafe fn xor_child_avx2(out: *mut u32, a: *const u32, b: *const u32) -> u32 {
|
||||
);
|
||||
// rotate left by one 32-bit lane: out[i] = x[(i + 1) % 8]
|
||||
let p = _mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 0));
|
||||
_mm256_storeu_si256(out as *mut __m256i, p);
|
||||
// mask = lanes [0, w_out) -> all-ones; maskstore writes only those.
|
||||
let mask = _mm256_cmpgt_epi32(
|
||||
_mm256_set1_epi32(w_out as i32),
|
||||
_mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7),
|
||||
);
|
||||
_mm256_maskstore_epi32(out as *mut i32, mask, p);
|
||||
_mm_cvtsi128_si32(_mm256_castsi256_si128(p)) as u32
|
||||
}
|
||||
|
||||
@@ -244,38 +274,113 @@ fn repack_index(src: &[u8], dst: &mut [u32]) {
|
||||
repack_index_scalar(src, dst);
|
||||
}
|
||||
|
||||
/// A raw `*mut u32` shared across rayon workers for the parallel scatter below.
|
||||
/// Sound only because the workers write provably disjoint position sets.
|
||||
#[derive(Clone, Copy)]
|
||||
struct OrderPtr(*mut u32);
|
||||
unsafe impl Send for OrderPtr {}
|
||||
unsafe impl Sync for OrderPtr {}
|
||||
|
||||
/// Partition the `n` entries into `TOP_BUCKETS` runs by the high `TOP_BITS` of
|
||||
/// their (dense) leading block. Returns `(starts, order)`, where partition `v`
|
||||
/// owns the input indices `order[starts[v]..starts[v + 1]]`. The histogram
|
||||
/// passes stream over `keys[]` (4 bytes/entry) instead of striding the slots.
|
||||
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>) {
|
||||
/// their (dense) leading block. Returns `(starts, order, keys_part)`: partition
|
||||
/// `v` owns the input indices `order[starts[v]..starts[v + 1]]`, and `keys_part`
|
||||
/// is the entries' leading words in the same partition-contiguous order
|
||||
/// (`keys_part[p] == keys[order[p]]`). Carrying `keys_part` lets the per-partition
|
||||
/// `count_pairs`/`low_group` sweeps read keys sequentially instead of gathering
|
||||
/// `keys[order[..]]` over the whole array — at the cost of one extra 4-byte/entry
|
||||
/// scatter here, folded into the (already parallel) phase-3 pass.
|
||||
///
|
||||
/// Parallel counting sort: the input is split into one contiguous chunk per
|
||||
/// rayon worker. Each chunk histograms its slice (phase 1), a small serial pass
|
||||
/// turns those into per-chunk base offsets within each bucket's output region
|
||||
/// (phase 2), and each chunk scatters its entries into `order`/`keys_part`
|
||||
/// (phase 3). Chunk `c`'s bucket-`b` writes land in `[off[c][b], off[c+1][b])`,
|
||||
/// disjoint from every other chunk and bucket, so the concurrent writes never
|
||||
/// alias. Entries within a bucket end up chunk-major rather than index-major;
|
||||
/// that reordering is immaterial — `count_pairs`/`low_group` depend only on the
|
||||
/// multiset of low keys, and final solutions are canonicalised, de-duplicated,
|
||||
/// and verified.
|
||||
fn partition_top(keys: &[u32], n: usize) -> (Vec<u32>, Vec<u32>, Vec<u32>) {
|
||||
let mut starts = vec![0u32; TOP_BUCKETS + 1];
|
||||
for k in 0..n {
|
||||
starts[(keys[k] >> LOW_BITS) as usize + 1] += 1;
|
||||
}
|
||||
for i in 0..TOP_BUCKETS {
|
||||
starts[i + 1] += starts[i];
|
||||
}
|
||||
let mut order = vec![0u32; n];
|
||||
let mut cur = starts.clone(); // small: TOP_BUCKETS + 1 entries
|
||||
for k in 0..n {
|
||||
let b = (keys[k] >> LOW_BITS) as usize;
|
||||
order[cur[b] as usize] = k as u32;
|
||||
let mut keys_part = vec![0u32; n];
|
||||
if n == 0 {
|
||||
return (starts, order, keys_part);
|
||||
}
|
||||
|
||||
let nthreads = rayon::current_num_threads().max(1);
|
||||
let chunk = n.div_ceil(nthreads);
|
||||
let nchunks = n.div_ceil(chunk);
|
||||
|
||||
// Phase 1 (parallel): per-chunk top-bucket histograms.
|
||||
let local_hists: Vec<Vec<u32>> = (0..nchunks)
|
||||
.into_par_iter()
|
||||
.map(|c| {
|
||||
let lo = c * chunk;
|
||||
let hi = ((c + 1) * chunk).min(n);
|
||||
let mut h = vec![0u32; TOP_BUCKETS];
|
||||
for &key in &keys[lo..hi] {
|
||||
h[(key >> LOW_BITS) as usize] += 1;
|
||||
}
|
||||
h
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Phase 2 (serial, nchunks * TOP_BUCKETS work): bucket starts, then each
|
||||
// chunk's per-bucket base offset within its bucket's output region.
|
||||
let mut totals = vec![0u32; TOP_BUCKETS];
|
||||
for h in &local_hists {
|
||||
for b in 0..TOP_BUCKETS {
|
||||
totals[b] += h[b];
|
||||
}
|
||||
}
|
||||
for b in 0..TOP_BUCKETS {
|
||||
starts[b + 1] = starts[b] + totals[b];
|
||||
}
|
||||
let mut offsets = vec![vec![0u32; TOP_BUCKETS]; nchunks];
|
||||
let mut running = starts[..TOP_BUCKETS].to_vec(); // running[b] starts at starts[b]
|
||||
for c in 0..nchunks {
|
||||
for b in 0..TOP_BUCKETS {
|
||||
offsets[c][b] = running[b];
|
||||
running[b] += local_hists[c][b];
|
||||
}
|
||||
}
|
||||
|
||||
// Phase 3 (parallel): each chunk scatters into its disjoint sub-ranges.
|
||||
let optr = OrderPtr(order.as_mut_ptr());
|
||||
let kptr = OrderPtr(keys_part.as_mut_ptr());
|
||||
offsets.into_par_iter().enumerate().for_each(|(c, mut cur)| {
|
||||
let (optr, kptr) = (optr, kptr); // capture whole (Sync) wrappers
|
||||
let lo = c * chunk;
|
||||
let hi = ((c + 1) * chunk).min(n);
|
||||
let (obase, kbase) = (optr.0, kptr.0);
|
||||
for k in lo..hi {
|
||||
let key = keys[k];
|
||||
let b = (key >> LOW_BITS) as usize;
|
||||
let pos = cur[b] as usize;
|
||||
// SAFETY: `pos` ranges over `[off[c][b], off[c+1][b])`, a range owned
|
||||
// exclusively by chunk `c` and within `order`/`keys_part` bounds.
|
||||
unsafe {
|
||||
*obase.add(pos) = k as u32;
|
||||
*kbase.add(pos) = key;
|
||||
}
|
||||
cur[b] += 1;
|
||||
}
|
||||
(starts, order)
|
||||
});
|
||||
|
||||
(starts, order, keys_part)
|
||||
}
|
||||
|
||||
/// Count the colliding pairs a partition will emit, from the low-bit histogram
|
||||
/// alone (no reordering). `clamp` caps each exact-collision group, matching the
|
||||
/// emit pass so the output offsets line up. `hist` is reusable `LOW_BUCKETS`
|
||||
/// scratch.
|
||||
fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
|
||||
/// of its (partition-contiguous) leading keys `keys_run`. `clamp` caps each
|
||||
/// exact-collision group, matching the emit pass so the output offsets line up.
|
||||
/// `hist` is reusable `LOW_BUCKETS` scratch.
|
||||
fn count_pairs(keys_run: &[u32], hist: &mut [u32], clamp: usize) -> usize {
|
||||
for h in hist.iter_mut() {
|
||||
*h = 0;
|
||||
}
|
||||
for &k in run {
|
||||
hist[(keys[k as usize] & LOW_MASK) as usize] += 1;
|
||||
for &key in keys_run {
|
||||
hist[(key & LOW_MASK) as usize] += 1;
|
||||
}
|
||||
let mut pairs = 0usize;
|
||||
for i in 0..LOW_BUCKETS {
|
||||
@@ -285,31 +390,44 @@ fn count_pairs(keys: &[u32], run: &[u32], hist: &mut [u32], clamp: usize) -> usi
|
||||
pairs
|
||||
}
|
||||
|
||||
/// Within one partition, group `run`'s entries by the low bits of their leading
|
||||
/// block, writing the grouped indices into `sorted`. `hist` is reusable
|
||||
/// `LOW_BUCKETS + 1` scratch. After this call `sorted` lists the run's indices
|
||||
/// with equal low keys contiguous, so callers recover each exact-collision
|
||||
/// group by walking adjacent equal keys.
|
||||
fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>) {
|
||||
let m = run.len();
|
||||
/// Within one partition, group its entries by the low bits of their leading
|
||||
/// block. Inputs are the partition-contiguous slabs `keys_run` (leading keys) and
|
||||
/// `order_run` (matching global indices). Outputs, in low-key-sorted order:
|
||||
/// `sorted` (the global indices, for the emit slot gather + back-refs) and
|
||||
/// `keys_sorted` (the leading keys, so the emit group walk streams a dense local
|
||||
/// array instead of gathering `keys[sorted[i]]`). `hist` is reusable
|
||||
/// `LOW_BUCKETS + 1` scratch. Both reads are sequential over the slabs.
|
||||
fn low_group(
|
||||
keys_run: &[u32],
|
||||
order_run: &[u32],
|
||||
hist: &mut [u32],
|
||||
sorted: &mut Vec<u32>,
|
||||
keys_sorted: &mut Vec<u32>,
|
||||
) {
|
||||
let m = order_run.len();
|
||||
sorted.clear();
|
||||
keys_sorted.clear();
|
||||
if m == 0 {
|
||||
return;
|
||||
}
|
||||
for h in hist.iter_mut() {
|
||||
*h = 0;
|
||||
}
|
||||
for &k in run {
|
||||
hist[(keys[k as usize] & LOW_MASK) as usize + 1] += 1;
|
||||
for &key in keys_run {
|
||||
hist[(key & LOW_MASK) as usize + 1] += 1;
|
||||
}
|
||||
for i in 0..LOW_BUCKETS {
|
||||
hist[i + 1] += hist[i];
|
||||
}
|
||||
sorted.resize(m, 0);
|
||||
keys_sorted.resize(m, 0);
|
||||
// hist[low] now holds the run-start offset; reuse it as the live cursor.
|
||||
for &k in run {
|
||||
let low = (keys[k as usize] & LOW_MASK) as usize;
|
||||
sorted[hist[low] as usize] = k;
|
||||
for i in 0..m {
|
||||
let key = keys_run[i];
|
||||
let low = (key & LOW_MASK) as usize;
|
||||
let pos = hist[low] as usize;
|
||||
sorted[pos] = order_run[i];
|
||||
keys_sorted[pos] = key;
|
||||
hist[low] += 1;
|
||||
}
|
||||
}
|
||||
@@ -320,26 +438,42 @@ fn low_group(keys: &[u32], run: &[u32], hist: &mut [u32], sorted: &mut Vec<u32>)
|
||||
/// Monomorphised over the XOR producer so the AVX2 intrinsics inline cleanly
|
||||
/// inside a `target_feature` wrapper while sharing one source of truth.
|
||||
macro_rules! emit_bucket_body {
|
||||
($keys:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $xor:path) => {{
|
||||
($keys_sorted:expr, $slots:expr, $sorted:expr, $kout:expr, $sout:expr, $pout:expr, $clamp:expr, $w_in:expr, $w_out:expr, $prefetch:expr, $xor:path) => {{
|
||||
let s = $sorted;
|
||||
let ks = $keys_sorted; // leading keys in `s`-order; group walk streams it
|
||||
let m = s.len();
|
||||
let w_in = $w_in;
|
||||
let w_out = $w_out;
|
||||
let mut w = 0usize;
|
||||
let mut i = 0;
|
||||
while i < m {
|
||||
let key = $keys[s[i] as usize] & LOW_MASK;
|
||||
let key = ks[i] & LOW_MASK;
|
||||
let mut j = i + 1;
|
||||
while j < m && ($keys[s[j] as usize] & LOW_MASK) == key {
|
||||
while j < m && (ks[j] & LOW_MASK) == key {
|
||||
j += 1;
|
||||
}
|
||||
let hi = j.min(i.saturating_add($clamp));
|
||||
// The pair loops below read `slots[s[a]]` / `slots[s[b]]` at random
|
||||
// global positions in the ~1 GB slot arena — each a likely cache
|
||||
// miss. Prefetch this group's member slots up front so the misses
|
||||
// overlap with the (L1-resident) pair XORs that follow.
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
if $prefetch {
|
||||
for a in i..hi {
|
||||
core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(
|
||||
$slots.as_ptr().add(s[a] as usize * w_in) as *const i8,
|
||||
);
|
||||
}
|
||||
}
|
||||
for a in i..hi {
|
||||
let l = s[a] as usize;
|
||||
for b in (a + 1)..hi {
|
||||
let mr = s[b] as usize;
|
||||
let nk = $xor(
|
||||
$sout.as_mut_ptr().add(w * SLOT),
|
||||
$slots.as_ptr().add(l * SLOT),
|
||||
$slots.as_ptr().add(mr * SLOT),
|
||||
$sout.as_mut_ptr().add(w * w_out),
|
||||
$slots.as_ptr().add(l * w_in),
|
||||
$slots.as_ptr().add(mr * w_in),
|
||||
w_out,
|
||||
);
|
||||
$kout[w] = nk;
|
||||
$pout[w] = ((l as u64) << 32) | mr as u64;
|
||||
@@ -353,69 +487,101 @@ macro_rules! emit_bucket_body {
|
||||
}
|
||||
|
||||
unsafe fn emit_bucket_scalar(
|
||||
keys: &[u32],
|
||||
keys_sorted: &[u32],
|
||||
slots: &[u32],
|
||||
sorted: &[u32],
|
||||
kout: &mut [u32],
|
||||
sout: &mut [u32],
|
||||
pout: &mut [u64],
|
||||
clamp: usize,
|
||||
w_in: usize,
|
||||
w_out: usize,
|
||||
prefetch: bool,
|
||||
) -> usize {
|
||||
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_scalar)
|
||||
emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_scalar)
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2")]
|
||||
unsafe fn emit_bucket_avx2(
|
||||
keys: &[u32],
|
||||
keys_sorted: &[u32],
|
||||
slots: &[u32],
|
||||
sorted: &[u32],
|
||||
kout: &mut [u32],
|
||||
sout: &mut [u32],
|
||||
pout: &mut [u64],
|
||||
clamp: usize,
|
||||
w_in: usize,
|
||||
w_out: usize,
|
||||
prefetch: bool,
|
||||
) -> usize {
|
||||
emit_bucket_body!(keys, slots, sorted, kout, sout, pout, clamp, xor_child_avx2)
|
||||
emit_bucket_body!(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch, xor_child_avx2)
|
||||
}
|
||||
|
||||
/// Emit a partition's children, dispatching to the AVX2 producer when available.
|
||||
/// `w_in`/`w_out` are the input/output slot pitches (`w_out == w_in - 1`).
|
||||
/// `prefetch` software-prefetches each group's randomly-gathered member slots.
|
||||
unsafe fn emit_bucket(
|
||||
keys: &[u32],
|
||||
keys_sorted: &[u32],
|
||||
slots: &[u32],
|
||||
sorted: &[u32],
|
||||
kout: &mut [u32],
|
||||
sout: &mut [u32],
|
||||
pout: &mut [u64],
|
||||
clamp: usize,
|
||||
w_in: usize,
|
||||
w_out: usize,
|
||||
prefetch: bool,
|
||||
) -> usize {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
{
|
||||
if is_x86_feature_detected!("avx2") {
|
||||
return emit_bucket_avx2(keys, slots, sorted, kout, sout, pout, clamp);
|
||||
return emit_bucket_avx2(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch);
|
||||
}
|
||||
}
|
||||
emit_bucket_scalar(keys, slots, sorted, kout, sout, pout, clamp)
|
||||
emit_bucket_scalar(keys_sorted, slots, sorted, kout, sout, pout, clamp, w_in, w_out, prefetch)
|
||||
}
|
||||
|
||||
/// Group `n` entries by their leading block, then emit one child per colliding
|
||||
/// pair: the XOR of the residual blocks (rotated into a fresh 8-word slot) plus
|
||||
/// a packed `(l << 32) | mr` back-reference. Two passes — count, then emit
|
||||
/// directly into one pre-sized arena — so there is no per-partition allocation
|
||||
/// or final concatenation copy. Returns `(keys_out, slots_out, parents)`.
|
||||
fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
|
||||
let (starts, order) = partition_top(keys, n);
|
||||
/// pair: the XOR of the residual blocks (rotated into a fresh slot) plus a packed
|
||||
/// `(l << 32) | mr` back-reference. Two passes — count, then emit directly into
|
||||
/// one pre-sized arena — so there is no per-partition allocation or final
|
||||
/// concatenation copy. Returns `(keys_out, slots_out, parents)`.
|
||||
///
|
||||
/// `w_in` is the input slot pitch; the round consumes one block, so the output is
|
||||
/// packed at `w_out = w_in - 1` words/slot. `slots_out` carries a `SLOT_SLACK`
|
||||
/// trailing pad so the next round's 256-bit over-read stays in bounds.
|
||||
fn collide(
|
||||
keys: &[u32],
|
||||
slots: &[u32],
|
||||
n: usize,
|
||||
clamp: usize,
|
||||
w_in: usize,
|
||||
) -> (Vec<u32>, Vec<u32>, Vec<u64>) {
|
||||
let w_out = w_in - 1;
|
||||
// Sub-phase timing, gated on `EQ_PROFILE`. Prints partition / count / alloc /
|
||||
// emit splits so we can see which part of the round dominates.
|
||||
let prof = std::env::var_os("EQ_PROFILE").is_some();
|
||||
// Software-prefetch the emit gather (on by default; `EQ_NO_PREFETCH` disables
|
||||
// it for A/B benchmarking). Read once, not per group.
|
||||
let do_prefetch = std::env::var_os("EQ_NO_PREFETCH").is_none();
|
||||
let t0 = std::time::Instant::now();
|
||||
|
||||
// Pass 1: per-partition child counts (histogram-derived, no reordering).
|
||||
let (starts, order, keys_part) = partition_top(keys, n);
|
||||
let t_part = std::time::Instant::now();
|
||||
|
||||
// Pass 1: per-partition child counts from each partition's contiguous keys.
|
||||
let counts: Vec<usize> = (0..TOP_BUCKETS)
|
||||
.into_par_iter()
|
||||
.map_init(
|
||||
|| vec![0u32; LOW_BUCKETS],
|
||||
|hist, v| {
|
||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
||||
count_pairs(keys, run, hist, clamp)
|
||||
let keys_run = &keys_part[starts[v] as usize..starts[v + 1] as usize];
|
||||
count_pairs(keys_run, hist, clamp)
|
||||
},
|
||||
)
|
||||
.collect();
|
||||
let t_count = std::time::Instant::now();
|
||||
|
||||
let mut out_starts = vec![0usize; TOP_BUCKETS + 1];
|
||||
for v in 0..TOP_BUCKETS {
|
||||
@@ -423,7 +589,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
||||
}
|
||||
let total = out_starts[TOP_BUCKETS];
|
||||
let mut keys_out = vec![0u32; total];
|
||||
let mut slots_out = vec![0u32; total * SLOT];
|
||||
// Packed at `w_out` words/slot, plus a slack pad for the next round's over-read.
|
||||
let mut slots_out = vec![0u32; total * w_out + SLOT_SLACK];
|
||||
let mut parents = vec![0u64; total];
|
||||
|
||||
// Carve the output arena into disjoint per-partition sub-slices so workers
|
||||
@@ -435,7 +602,7 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
||||
let (mut kr, mut sr, mut pr) = (&mut keys_out[..], &mut slots_out[..], &mut parents[..]);
|
||||
for &c in &counts {
|
||||
let (kh, kt) = kr.split_at_mut(c);
|
||||
let (sh, st) = sr.split_at_mut(c * SLOT);
|
||||
let (sh, st) = sr.split_at_mut(c * w_out);
|
||||
let (ph, pt) = pr.split_at_mut(c);
|
||||
kparts.push(kh);
|
||||
sparts.push(sh);
|
||||
@@ -446,6 +613,8 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
||||
}
|
||||
}
|
||||
|
||||
let t_alloc = std::time::Instant::now();
|
||||
|
||||
// Pass 2: group each partition and emit its colliding pairs in place.
|
||||
kparts
|
||||
.into_par_iter()
|
||||
@@ -453,37 +622,53 @@ fn collide(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> (Vec<u32>, Ve
|
||||
.zip(pparts)
|
||||
.enumerate()
|
||||
.for_each(|(v, ((kout, sout), pout))| {
|
||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
||||
let lo = starts[v] as usize;
|
||||
let hi = starts[v + 1] as usize;
|
||||
let mut hist = vec![0u32; LOW_BUCKETS + 1];
|
||||
let mut sorted = Vec::new();
|
||||
low_group(keys, run, &mut hist, &mut sorted);
|
||||
let w = unsafe { emit_bucket(keys, slots, &sorted, kout, sout, pout, clamp) };
|
||||
let mut keys_sorted = Vec::new();
|
||||
low_group(&keys_part[lo..hi], &order[lo..hi], &mut hist, &mut sorted, &mut keys_sorted);
|
||||
let w = unsafe {
|
||||
emit_bucket(&keys_sorted, slots, &sorted, kout, sout, pout, clamp, w_in, w_out, do_prefetch)
|
||||
};
|
||||
debug_assert_eq!(w, kout.len());
|
||||
});
|
||||
|
||||
if prof {
|
||||
let ms = |a: std::time::Instant, b: std::time::Instant| (b - a).as_secs_f64() * 1000.0;
|
||||
eprintln!(
|
||||
" [collide n={n}] partition {:6.1} count {:6.1} alloc {:6.1} emit(group+xor) {:6.1} ms",
|
||||
ms(t0, t_part),
|
||||
ms(t_part, t_count),
|
||||
ms(t_count, t_alloc),
|
||||
ms(t_alloc, std::time::Instant::now()),
|
||||
);
|
||||
}
|
||||
|
||||
(keys_out, slots_out, parents)
|
||||
}
|
||||
|
||||
/// Final round (slots hold `[w0, w1, …]`): among entries sharing leading block
|
||||
/// `w0`, a pair whose `w1` also matches XORs the last two blocks to zero — a
|
||||
/// candidate. Returns the `(l, mr)` parents of each candidate.
|
||||
fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u32, u32)> {
|
||||
let (starts, order) = partition_top(keys, n);
|
||||
/// Final round (slots hold `[w0, w1, …]` at pitch `w_in`): among entries sharing
|
||||
/// leading block `w0`, a pair whose `w1` also matches XORs the last two blocks to
|
||||
/// zero — a candidate. Returns the `(l, mr)` parents of each candidate.
|
||||
fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize, w_in: usize) -> Vec<(u32, u32)> {
|
||||
let (starts, order, keys_part) = partition_top(keys, n);
|
||||
|
||||
(0..TOP_BUCKETS)
|
||||
.into_par_iter()
|
||||
.map_init(
|
||||
|| (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new()),
|
||||
|(hist, sorted), v| {
|
||||
let run = &order[starts[v] as usize..starts[v + 1] as usize];
|
||||
low_group(keys, run, hist, sorted);
|
||||
|| (vec![0u32; LOW_BUCKETS + 1], Vec::<u32>::new(), Vec::<u32>::new()),
|
||||
|(hist, sorted, keys_sorted), v| {
|
||||
let lo = starts[v] as usize;
|
||||
let hi = starts[v + 1] as usize;
|
||||
low_group(&keys_part[lo..hi], &order[lo..hi], hist, sorted, keys_sorted);
|
||||
let m = sorted.len();
|
||||
let mut local = Vec::new();
|
||||
let mut i = 0;
|
||||
while i < m {
|
||||
let key = keys[sorted[i] as usize] & LOW_MASK;
|
||||
let key = keys_sorted[i] & LOW_MASK;
|
||||
let mut j = i + 1;
|
||||
while j < m && (keys[sorted[j] as usize] & LOW_MASK) == key {
|
||||
while j < m && (keys_sorted[j] & LOW_MASK) == key {
|
||||
j += 1;
|
||||
}
|
||||
let hi = j.min(i.saturating_add(clamp));
|
||||
@@ -491,7 +676,7 @@ fn collide_final(keys: &[u32], slots: &[u32], n: usize, clamp: usize) -> Vec<(u3
|
||||
let l = sorted[a] as usize;
|
||||
for b in (a + 1)..hi {
|
||||
let mr = sorted[b] as usize;
|
||||
if slots[l * SLOT + 1] == slots[mr * SLOT + 1] {
|
||||
if slots[l * w_in + 1] == slots[mr * w_in + 1] {
|
||||
local.push((l as u32, mr as u32));
|
||||
}
|
||||
}
|
||||
@@ -532,6 +717,22 @@ pub fn solve(header: &[u8]) -> Vec<Vec<u32>> {
|
||||
pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
||||
let clamp = clamp.unwrap_or(usize::MAX);
|
||||
|
||||
// Optional per-phase timing, gated on `EQ_PROFILE` (any value). Zero cost
|
||||
// when unset — the only overhead is one env lookup. Each `phase(label, n)`
|
||||
// prints the wall time since the previous call and the live entry count.
|
||||
let prof = std::env::var_os("EQ_PROFILE").is_some();
|
||||
let mut t_last = std::time::Instant::now();
|
||||
let mut phase = |label: &str, n: usize| {
|
||||
if prof {
|
||||
let now = std::time::Instant::now();
|
||||
eprintln!(
|
||||
" [profile] {label:<13} {:8.1} ms (n={n})",
|
||||
(now - t_last).as_secs_f64() * 1000.0
|
||||
);
|
||||
t_last = now;
|
||||
}
|
||||
};
|
||||
|
||||
// ---- round 0: hash every index into NBLK0 big-endian 24-bit blocks, stored
|
||||
// in padded 8-word slots with the leading block mirrored into `keys`. Entry
|
||||
// k corresponds to leaf index k (the gen order), so no leaf table is needed.
|
||||
@@ -541,18 +742,29 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
||||
let hasher = BatchHasher::new(header);
|
||||
let mut keys: Vec<u32> = vec![0u32; n0];
|
||||
let mut slots: Vec<u32> = vec![0u32; n0 * SLOT];
|
||||
let kgroup = 4 * INDICES_PER_HASH_OUTPUT; // eight entries
|
||||
// Hash eight `g` values per chunk via `hash8`, which uses one AVX-512
|
||||
// compression where available and falls back to two AVX2 `hash4`s (or scalar)
|
||||
// otherwise — so this single path covers every CPU.
|
||||
let kgroup = 8 * INDICES_PER_HASH_OUTPUT; // sixteen entries
|
||||
let sgroup = kgroup * SLOT;
|
||||
debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 4-g groups");
|
||||
debug_assert_eq!(n0 % kgroup, 0, "round-0 buffer must split into whole 8-g groups");
|
||||
// `EQ_NO_AVX512` forces the AVX2 fallback (two hash4) for A/B benchmarking.
|
||||
let use_avx512 = std::env::var_os("EQ_NO_AVX512").is_none();
|
||||
slots
|
||||
.par_chunks_mut(sgroup)
|
||||
.zip(keys.par_chunks_mut(kgroup))
|
||||
.enumerate()
|
||||
.for_each(|(c, (schunk, kchunk))| {
|
||||
let g0 = (c * 4) as u32;
|
||||
let mut hs = [[0u8; HASH_OUTPUT]; 4];
|
||||
hasher.hash4(g0, &mut hs);
|
||||
for j in 0..4 {
|
||||
let g0 = (c * 8) as u32;
|
||||
let mut hs = [[0u8; HASH_OUTPUT]; 8];
|
||||
if use_avx512 {
|
||||
hasher.hash8(g0, &mut hs);
|
||||
} else {
|
||||
let (lo, hi) = hs.split_at_mut(4);
|
||||
hasher.hash4(g0, (&mut lo[..4]).try_into().unwrap());
|
||||
hasher.hash4(g0 + 4, (&mut hi[..4]).try_into().unwrap());
|
||||
}
|
||||
for j in 0..8 {
|
||||
for i in 0..INDICES_PER_HASH_OUTPUT {
|
||||
let e = j * INDICES_PER_HASH_OUTPUT + i;
|
||||
let src = &hs[j][i * HASH_BYTES..i * HASH_BYTES + HASH_BYTES];
|
||||
@@ -567,20 +779,27 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
||||
// parents[t] (t = 0..K-2) maps a round-(t+1) entry to its two round-t parents.
|
||||
// `keys`/`slots` ping-pong between rounds (the previous buffers are freed as
|
||||
// the new ones replace them).
|
||||
phase("round0-hash", n0);
|
||||
let mut parents: Vec<Vec<u64>> = Vec::with_capacity(K - 1);
|
||||
let mut n = n0;
|
||||
for _ in 0..(K - 1) {
|
||||
let (ok, os, op) = collide(&keys, &slots, n, clamp);
|
||||
// Round-0 slots carry all SLOT words; each collision round consumes the
|
||||
// leading block, so the next round's slots are one word narrower.
|
||||
let mut width = SLOT;
|
||||
for r in 0..(K - 1) {
|
||||
let (ok, os, op) = collide(&keys, &slots, n, clamp, width);
|
||||
n = op.len();
|
||||
parents.push(op);
|
||||
keys = ok;
|
||||
slots = os;
|
||||
width -= 1;
|
||||
phase(&format!("collide r{}", r + 1), n);
|
||||
if n == 0 {
|
||||
return Vec::new();
|
||||
}
|
||||
}
|
||||
|
||||
let candidates = collide_final(&keys, &slots, n, clamp);
|
||||
let candidates = collide_final(&keys, &slots, n, clamp, width);
|
||||
phase("collide-final", candidates.len());
|
||||
if candidates.is_empty() {
|
||||
return Vec::new();
|
||||
}
|
||||
@@ -605,8 +824,11 @@ pub fn solve_with(header: &[u8], clamp: Option<usize>) -> Vec<Vec<u32>> {
|
||||
// refs are now round-0 indices == leaf indices.
|
||||
recovered.extend_from_slice(&refs);
|
||||
}
|
||||
phase("recover", recovered.len() / SOLUTION_INDICES);
|
||||
|
||||
filter_candidates(&base_state(header), &recovered)
|
||||
let result = filter_candidates(&base_state(header), &recovered);
|
||||
phase("filter+verify", result.len());
|
||||
result
|
||||
}
|
||||
|
||||
/// Re-order recovered leaf indices into the canonical solution ordering: at
|
||||
@@ -741,26 +963,34 @@ mod tests {
|
||||
assert_eq!(a[0], (src[0] as u32) << 16 | (src[1] as u32) << 8 | src[2] as u32);
|
||||
}
|
||||
|
||||
// AVX2 XOR-child (xor + rotate-left-one-lane) must match the scalar version.
|
||||
// The XOR-child producers (scalar + AVX2) must agree for every output width,
|
||||
// write exactly `w_out` words (rotate-left-one of the lane-wise XOR), and
|
||||
// leave the rest of the packed buffer untouched (the AVX2 path masked-stores).
|
||||
#[test]
|
||||
fn xor_child_matches_scalar() {
|
||||
const SENT: u32 = 0xDEAD_BEEF;
|
||||
let pa: [u32; SLOT] = [9, 8, 7, 6, 5, 4, 3, 2];
|
||||
let pb: [u32; SLOT] = [1, 2, 3, 4, 5, 6, 7, 8];
|
||||
let mut o1 = [0u32; SLOT];
|
||||
let mut o2 = [0u32; SLOT];
|
||||
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
|
||||
// Reference: rotate-left-one of the lane-wise XOR; new key = lane 1.
|
||||
let x: Vec<u32> = (0..SLOT).map(|i| pa[i] ^ pb[i]).collect();
|
||||
let expect: Vec<u32> = (0..SLOT).map(|i| x[(i + 1) % SLOT]).collect();
|
||||
assert_eq!(&o1[..], &expect[..]);
|
||||
for w_out in 1..SLOT {
|
||||
let mut o1 = [SENT; SLOT];
|
||||
let k1 = unsafe { xor_child_scalar(o1.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
|
||||
for t in 0..w_out {
|
||||
assert_eq!(o1[t], x[(t + 1) % SLOT], "scalar word {t} (w_out={w_out})");
|
||||
}
|
||||
for t in w_out..SLOT {
|
||||
assert_eq!(o1[t], SENT, "scalar wrote past w_out at {t} (w_out={w_out})");
|
||||
}
|
||||
assert_eq!(k1, x[1]);
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
if is_x86_feature_detected!("avx2") {
|
||||
let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr()) };
|
||||
assert_eq!(o1, o2, "avx2 xor_child != scalar");
|
||||
let mut o2 = [SENT; SLOT];
|
||||
let k2 = unsafe { xor_child_avx2(o2.as_mut_ptr(), pa.as_ptr(), pb.as_ptr(), w_out) };
|
||||
assert_eq!(o1, o2, "avx2 xor_child != scalar (w_out={w_out})");
|
||||
assert_eq!(k1, k2);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Clamped solve over a fixed header must find valid solutions (Equihash
|
||||
// 192,7 yields ~2 per nonce). The clamp bounds the naive algorithm's
|
||||
@@ -780,4 +1010,97 @@ mod tests {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A deterministic dense 140-byte header for benchmarking, filled by an
|
||||
/// xorshift64* stream keyed on `seed` (the same scheme as `main::pseudo_header`)
|
||||
/// so each nonce exercises a realistic, fully-populated header rather than the
|
||||
/// degenerate all-`0x42` one.
|
||||
#[cfg(test)]
|
||||
fn bench_header(seed: u64) -> Vec<u8> {
|
||||
let mut header = vec![0u8; HEADER_LEN];
|
||||
let mut x = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
|
||||
for b in header.iter_mut() {
|
||||
x ^= x >> 12;
|
||||
x ^= x << 25;
|
||||
x ^= x >> 27;
|
||||
*b = (x.wrapping_mul(0x2545F4914F6CDD1D) >> 33) as u8;
|
||||
}
|
||||
header
|
||||
}
|
||||
|
||||
/// Steady-state throughput baseline for the CPU solver. Ignored by default;
|
||||
/// run it in release (the only meaningful configuration — `debug_assert`s in
|
||||
/// the collision hot path are off and `opt-level=3`/LTO is on) with output
|
||||
/// visible:
|
||||
///
|
||||
/// ```text
|
||||
/// cargo test --release full_solve_baseline -- --ignored --nocapture
|
||||
/// ```
|
||||
///
|
||||
/// Tunables via env:
|
||||
/// * `EQ_BENCH_ITERS` — timed solves per clamp (default 8).
|
||||
/// * `EQ_BENCH_CLAMPS` — comma-separated clamp sweep (default `32`, the
|
||||
/// value the miner and `--selftest` use).
|
||||
///
|
||||
/// Each timed iteration uses a distinct nonce so the figures average over
|
||||
/// per-header variance; one warm-up solve per clamp (pages in the ~GB of
|
||||
/// round buffers, warms caches) is excluded. This measures one solve at a
|
||||
/// time across the whole rayon pool — i.e. solver latency and its memory
|
||||
/// behaviour, which is exactly what the planned radix-scatter change targets
|
||||
/// — not the aggregate multi-group mining rate `--bench` reports.
|
||||
#[test]
|
||||
#[ignore]
|
||||
fn full_solve_baseline() {
|
||||
use std::time::Instant;
|
||||
|
||||
let iters: usize = std::env::var("EQ_BENCH_ITERS")
|
||||
.ok()
|
||||
.and_then(|v| v.parse().ok())
|
||||
.unwrap_or(8);
|
||||
let clamps: Vec<usize> = std::env::var("EQ_BENCH_CLAMPS")
|
||||
.ok()
|
||||
.map(|v| v.split(',').filter_map(|s| s.trim().parse().ok()).collect())
|
||||
.filter(|v: &Vec<usize>| !v.is_empty())
|
||||
.unwrap_or_else(|| vec![32]);
|
||||
|
||||
eprintln!(
|
||||
"CPU solver baseline — {iters} timed solve(s)/clamp, clamps={clamps:?}, \
|
||||
rayon threads={}",
|
||||
rayon::current_num_threads()
|
||||
);
|
||||
|
||||
for &clamp in &clamps {
|
||||
// Warm-up (excluded): fault in buffers, warm caches.
|
||||
let _ = solve_with(&bench_header(0), Some(clamp));
|
||||
|
||||
let mut times = Vec::with_capacity(iters);
|
||||
let mut total_sols = 0usize;
|
||||
for i in 0..iters {
|
||||
let header = bench_header(i as u64 + 1);
|
||||
let base = base_state(&header);
|
||||
let t = Instant::now();
|
||||
let sols = solve_with(&header, Some(clamp));
|
||||
times.push(t.elapsed().as_secs_f64());
|
||||
for s in &sols {
|
||||
assert!(is_valid_solution(&base, s), "solver returned an invalid solution");
|
||||
}
|
||||
total_sols += sols.len();
|
||||
}
|
||||
|
||||
times.sort_by(|a, b| a.partial_cmp(b).unwrap());
|
||||
let n = times.len();
|
||||
let sum: f64 = times.iter().sum();
|
||||
let mean = 1000.0 * sum / n as f64;
|
||||
let median = 1000.0 * times[n / 2];
|
||||
let min = 1000.0 * times[0];
|
||||
let max = 1000.0 * times[n - 1];
|
||||
let solves_per_s = n as f64 / sum;
|
||||
eprintln!(
|
||||
"clamp={clamp:>3}: {solves_per_s:6.2} solve/s | mean {mean:6.0} ms \
|
||||
median {median:6.0} ms min {min:6.0} ms max {max:6.0} ms | \
|
||||
{total_sols} sol over {n} solve(s) ({:.2}/solve)",
|
||||
total_sols as f64 / n as f64,
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
+148
-50
@@ -411,6 +411,9 @@ impl LegacySolver {
|
||||
/// (`equihash.cl`) everywhere else. Forceable with `ZCL_OPENCL_KERNEL=amd|legacy`.
|
||||
pub struct GpuSolver {
|
||||
inner: SolverInner,
|
||||
/// Per-card telemetry handle (AMD amdgpu sysfs / NVML), matched to the device
|
||||
/// by PCI bus id. `None` when no telemetry backend matches.
|
||||
tuner: Option<Box<dyn crate::gpu_tune::GpuTuner>>,
|
||||
}
|
||||
|
||||
enum SolverInner {
|
||||
@@ -423,13 +426,16 @@ impl GpuSolver {
|
||||
/// device vendor (AMD → `equihash192_7.cl`).
|
||||
pub fn new(device_index: usize) -> Result<Self> {
|
||||
let (platform, device) = pick_device(device_index)?;
|
||||
// Resolve a telemetry handle (AMD sysfs / NVML) from the device's PCI bus
|
||||
// before `device` is consumed by the inner solver.
|
||||
let tuner = device_pci_bus_id(&device).and_then(|bus| crate::gpu_tune::open(&bus));
|
||||
let inner = if use_amd_kernel(&device) {
|
||||
log::info!("OpenCL: AMD device — using the equihash192_7 kernel");
|
||||
SolverInner::Amd(crate::gpu_amd::AmdSolver::new(platform, device)?)
|
||||
} else {
|
||||
SolverInner::Legacy(LegacySolver::new(platform, device)?)
|
||||
};
|
||||
Ok(Self { inner })
|
||||
Ok(Self { inner, tuner })
|
||||
}
|
||||
|
||||
/// This device's product name, if available.
|
||||
@@ -440,6 +446,34 @@ impl GpuSolver {
|
||||
}
|
||||
}
|
||||
|
||||
/// Current board power draw in watts (telemetry handle), if available.
|
||||
pub fn power_watts(&self) -> Option<f64> {
|
||||
self.tuner.as_ref().and_then(|t| t.watts())
|
||||
}
|
||||
|
||||
/// Current GPU temperature in °C, if available.
|
||||
pub fn temperature_c(&self) -> Option<u32> {
|
||||
self.tuner.as_ref().and_then(|t| t.temperature_c())
|
||||
}
|
||||
|
||||
/// Currently enforced power limit in watts, if available.
|
||||
pub fn current_power_limit_w(&self) -> Option<u32> {
|
||||
self.tuner.as_ref().and_then(|t| t.current_power_limit_w())
|
||||
}
|
||||
|
||||
/// (min, max) settable power limit in watts, if available.
|
||||
pub fn power_limit_range_w(&self) -> Option<(u32, u32)> {
|
||||
self.tuner.as_ref().and_then(|t| t.power_limit_range_w())
|
||||
}
|
||||
|
||||
/// Current (core, memory) clocks in MHz, each `None` if unavailable.
|
||||
pub fn current_clocks_mhz(&self) -> (Option<u32>, Option<u32>) {
|
||||
match &self.tuner {
|
||||
Some(t) => (t.core_clock_mhz(), t.mem_clock_mhz()),
|
||||
None => (None, None),
|
||||
}
|
||||
}
|
||||
|
||||
/// Solve the puzzle for `header` (140 bytes).
|
||||
pub fn solve(&self, header: &[u8]) -> Result<Vec<Vec<u32>>> {
|
||||
match &self.inner {
|
||||
@@ -475,11 +509,32 @@ impl GpuSolver {
|
||||
pub fn hash_all(&self, header: &[u8]) -> Result<Vec<u8>> {
|
||||
match &self.inner {
|
||||
SolverInner::Legacy(s) => s.hash_all(header),
|
||||
SolverInner::Amd(_) => {
|
||||
Err(anyhow!("hash_all is not supported by the AMD kernel"))
|
||||
SolverInner::Amd(_) => Err(anyhow!("hash_all is not supported by the AMD kernel")),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// The device's PCI address as `"DDDD:BB:DD.F"` (lowercase), for matching the
|
||||
/// physical card to a telemetry backend. Prefers `cl_khr_pci_bus_info`; falls
|
||||
/// back to `cl_device_topology_amd`. `None` if the device exposes neither.
|
||||
fn device_pci_bus_id(device: &ocl::Device) -> Option<String> {
|
||||
const CL_DEVICE_PCI_BUS_INFO_KHR: u32 = 0x10F2;
|
||||
const CL_DEVICE_TOPOLOGY_AMD: u32 = 0x4037;
|
||||
// cl_device_pci_bus_info_khr = { u32 pci_domain, pci_bus, pci_device, pci_function }.
|
||||
if let Ok(b) = device.info_raw(CL_DEVICE_PCI_BUS_INFO_KHR) {
|
||||
if b.len() >= 16 {
|
||||
let rd = |i: usize| u32::from_ne_bytes([b[i], b[i + 1], b[i + 2], b[i + 3]]);
|
||||
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", rd(0), rd(4), rd(8), rd(12)));
|
||||
}
|
||||
}
|
||||
// cl_device_topology_amd (PCIE branch): 24-byte struct, bus/device/function
|
||||
// are the last three bytes; domain isn't exposed (assume 0000).
|
||||
if let Ok(b) = device.info_raw(CL_DEVICE_TOPOLOGY_AMD) {
|
||||
if b.len() >= 24 {
|
||||
return Some(format!("0000:{:02x}:{:02x}.{:x}", b[21], b[22], b[23]));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Decide whether to drive `device` with the AMD `equihash192_7.cl` kernel.
|
||||
@@ -501,19 +556,81 @@ fn use_amd_kernel(device: &ocl::Device) -> bool {
|
||||
}
|
||||
}
|
||||
|
||||
/// A cross-platform key identifying the physical GPU, as a canonical PCI address
|
||||
/// `"DDDD:BB:DD.F"`. Vendor runtimes (ROCm/NVIDIA) expose `cl_khr_pci_bus_info`;
|
||||
/// rusticl/Mesa doesn't, but its `cl_khr_device_uuid` *encodes* the PCI address
|
||||
/// ({u32 domain LE, u8 bus, u8 device, u8 function, ...}), so the same physical
|
||||
/// card yields the same key on both platforms. `None` if neither is available
|
||||
/// (then the device is never deduped — safe).
|
||||
fn device_dedup_key(device: &ocl::Device) -> Option<String> {
|
||||
if let Some(pci) = device_pci_bus_id(device) {
|
||||
return Some(pci);
|
||||
}
|
||||
const CL_DEVICE_UUID_KHR: u32 = 0x106A;
|
||||
if let Ok(b) = device.info_raw(CL_DEVICE_UUID_KHR) {
|
||||
if b.len() >= 7 {
|
||||
let domain = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
|
||||
return Some(format!("{:04x}:{:02x}:{:02x}.{:x}", domain, b[4], b[5], b[6]));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Lower = preferred when the same physical GPU is exposed by multiple OpenCL
|
||||
/// platforms. De-prioritise the Mesa Gallium drivers (rusticl/clover) relative to
|
||||
/// the vendor runtimes (ROCm / NVIDIA / Intel), which are faster and complete.
|
||||
fn platform_rank(p: &ocl::Platform) -> u8 {
|
||||
let name = p.name().unwrap_or_default().to_ascii_lowercase();
|
||||
if name.contains("rusticl") || name.contains("clover") || name.contains("mesa") {
|
||||
1
|
||||
} else {
|
||||
0
|
||||
}
|
||||
}
|
||||
|
||||
/// All usable OpenCL `(platform, device)` pairs in a stable flat order, with each
|
||||
/// physical GPU de-duplicated across platforms by PCI bus id — a card exposed by
|
||||
/// both ROCm and rusticl appears once (the vendor runtime wins over Mesa), so
|
||||
/// mining "all" doesn't run the same card twice. This is the single source of
|
||||
/// truth for the flat device index used by `--devices`, `--list-devices`, and
|
||||
/// [`pick_device`]. Devices without a PCI bus id (CPU / PoCL) are never deduped.
|
||||
fn enumerate_devices() -> Vec<(ocl::Platform, ocl::Device)> {
|
||||
use ocl::{Device, Platform};
|
||||
let mut out: Vec<(Platform, Device)> = Vec::new();
|
||||
let mut by_pci: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
|
||||
for platform in Platform::list() {
|
||||
for device in Device::list_all(platform).unwrap_or_default() {
|
||||
match device_dedup_key(&device) {
|
||||
Some(pci) => match by_pci.get(&pci).copied() {
|
||||
// Same physical GPU already listed: keep the preferred platform.
|
||||
Some(existing) => {
|
||||
if platform_rank(&platform) < platform_rank(&out[existing].0) {
|
||||
out[existing] = (platform, device);
|
||||
}
|
||||
}
|
||||
None => {
|
||||
by_pci.insert(pci, out.len());
|
||||
out.push((platform, device));
|
||||
}
|
||||
},
|
||||
None => out.push((platform, device)), // no PCI id → can't dedup
|
||||
}
|
||||
}
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
/// List `(platform, device)` names so the user can choose `--device`.
|
||||
pub fn list_devices() -> Result<Vec<String>> {
|
||||
use ocl::{Device, Platform};
|
||||
let mut names = Vec::new();
|
||||
let mut idx = 0;
|
||||
for platform in Platform::list() {
|
||||
let names = enumerate_devices()
|
||||
.into_iter()
|
||||
.enumerate()
|
||||
.map(|(idx, (platform, device))| {
|
||||
let pname = platform.name().unwrap_or_else(|_| "?".into());
|
||||
for device in Device::list_all(platform).unwrap_or_default() {
|
||||
let dname = device.name().unwrap_or_else(|_| "?".into());
|
||||
names.push(format!("[{idx}] {pname} / {dname}"));
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
format!("[{idx}] {pname} / {dname}")
|
||||
})
|
||||
.collect();
|
||||
Ok(names)
|
||||
}
|
||||
|
||||
@@ -522,18 +639,15 @@ pub fn list_devices() -> Result<Vec<String>> {
|
||||
/// hand NVIDIA cards to CUDA (and mine only the non-NVIDIA OpenCL devices).
|
||||
pub fn device_is_nvidia() -> Vec<bool> {
|
||||
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
||||
use ocl::{Device, Platform};
|
||||
let mut out = Vec::new();
|
||||
for platform in Platform::list() {
|
||||
for device in Device::list_all(platform).unwrap_or_default() {
|
||||
let is_nv = matches!(
|
||||
enumerate_devices()
|
||||
.into_iter()
|
||||
.map(|(_, device)| {
|
||||
matches!(
|
||||
device.info(DeviceInfo::Vendor),
|
||||
Ok(DeviceInfoResult::Vendor(v)) if v.to_ascii_lowercase().contains("nvidia")
|
||||
);
|
||||
out.push(is_nv);
|
||||
}
|
||||
}
|
||||
out
|
||||
)
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// The flat OpenCL device index of the first CPU-type device (e.g. PoCL), if any.
|
||||
@@ -541,36 +655,20 @@ pub fn device_is_nvidia() -> Vec<bool> {
|
||||
/// [`list_devices`] / `--devices`.
|
||||
pub fn cpu_device_index() -> Option<usize> {
|
||||
use ocl::enums::{DeviceInfo, DeviceInfoResult};
|
||||
use ocl::{Device, Platform};
|
||||
let mut idx = 0;
|
||||
for platform in Platform::list() {
|
||||
for device in Device::list_all(platform).unwrap_or_default() {
|
||||
let is_cpu = matches!(
|
||||
enumerate_devices().into_iter().position(|(_, device)| {
|
||||
matches!(
|
||||
device.info(DeviceInfo::Type).ok(),
|
||||
Some(DeviceInfoResult::Type(t)) if t.contains(ocl::flags::DeviceType::CPU)
|
||||
);
|
||||
if is_cpu {
|
||||
return Some(idx);
|
||||
}
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
None
|
||||
)
|
||||
})
|
||||
}
|
||||
|
||||
/// Resolve a flat device index across all platforms, returning the device along
|
||||
/// with the platform it belongs to (needed to build the context against the
|
||||
/// right platform).
|
||||
fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
|
||||
use ocl::{Device, Platform};
|
||||
let mut idx = 0;
|
||||
for platform in Platform::list() {
|
||||
for device in Device::list_all(platform).unwrap_or_default() {
|
||||
if idx == index {
|
||||
return Ok((platform, device));
|
||||
}
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
Err(anyhow!("no OpenCL device with index {index}"))
|
||||
/// Resolve a flat device index (into the de-duplicated [`enumerate_devices`]
|
||||
/// list), returning the device along with the platform it belongs to (needed to
|
||||
/// build the context against the right platform).
|
||||
pub(crate) fn pick_device(index: usize) -> Result<(ocl::Platform, ocl::Device)> {
|
||||
enumerate_devices()
|
||||
.into_iter()
|
||||
.nth(index)
|
||||
.ok_or_else(|| anyhow!("no OpenCL device with index {index}"))
|
||||
}
|
||||
|
||||
+12
-6
@@ -128,19 +128,25 @@ pub trait GpuTuner: Send {
|
||||
/// Open a control handle for the GPU at `pci_bus_id` (matches the physical card
|
||||
/// regardless of CUDA-vs-driver index ordering). `None` if unavailable.
|
||||
///
|
||||
/// NVML is the backend on both Linux (`libnvidia-ml`) and Windows (`nvml.dll`);
|
||||
/// the C API is identical, so the same [`crate::nvml`] code serves both.
|
||||
/// Tries the NVIDIA backend first (NVML, `libnvidia-ml`/`nvml.dll`), then the AMD
|
||||
/// backend ([`crate::amd_smi`], Linux amdgpu sysfs). A non-matching bus id makes
|
||||
/// each backend return `None`, so probing both is safe on mixed-vendor hosts.
|
||||
pub fn open(pci_bus_id: &str) -> Option<Box<dyn GpuTuner>> {
|
||||
#[cfg(any(unix, windows))]
|
||||
#[cfg(feature = "cuda")]
|
||||
{
|
||||
crate::nvml::open(pci_bus_id)
|
||||
if let Some(t) = crate::nvml::open(pci_bus_id) {
|
||||
return Some(t);
|
||||
}
|
||||
#[cfg(not(any(unix, windows)))]
|
||||
}
|
||||
#[cfg(feature = "gpu")]
|
||||
{
|
||||
if let Some(t) = crate::amd_smi::open(pci_bus_id) {
|
||||
return Some(t);
|
||||
}
|
||||
}
|
||||
let _ = pci_bus_id;
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
static WARNED_PRIVS: AtomicBool = AtomicBool::new(false);
|
||||
|
||||
|
||||
+64
-12
@@ -18,6 +18,10 @@ mod gpu;
|
||||
#[cfg(feature = "gpu")]
|
||||
mod gpu_amd;
|
||||
|
||||
// AMD GPU telemetry via Linux amdgpu sysfs (a `gpu_tune::GpuTuner` backend).
|
||||
#[cfg(feature = "gpu")]
|
||||
mod amd_smi;
|
||||
|
||||
// Runtime dynamic-library loader (dlopen) for the CUDA driver + NVML.
|
||||
#[cfg(feature = "cuda")]
|
||||
mod dylib;
|
||||
@@ -28,7 +32,9 @@ mod cuda;
|
||||
#[cfg(feature = "cuda")]
|
||||
mod nvml;
|
||||
|
||||
#[cfg(feature = "cuda")]
|
||||
// Platform-agnostic GPU tuning/telemetry surface. The trait + policy compile for
|
||||
// either GPU backend; NVML (cuda) and amd_smi (gpu) are the implementations.
|
||||
#[cfg(any(feature = "cuda", feature = "gpu"))]
|
||||
mod gpu_tune;
|
||||
|
||||
use std::io::IsTerminal;
|
||||
@@ -125,11 +131,12 @@ struct Args {
|
||||
#[arg(long, value_name = "SPEC")]
|
||||
cpu_cores: Option<String>,
|
||||
|
||||
/// Cores per CPU mining row (default 4). Each row runs one shared solve
|
||||
/// across its cores; larger groups cut memory sharply: total RAM is ~4 GB ×
|
||||
/// (enabled cores / this size). Use 1 for one row (and one solve) per core.
|
||||
/// Rows are aligned to core-index blocks of this size, so a row never
|
||||
/// straddles a boundary. Cycle it live in the dashboard with 'g'.
|
||||
/// Cores per CPU mining row. Each row runs one shared solve across its
|
||||
/// cores; larger groups cut memory sharply: total RAM is ~4 GB × (enabled
|
||||
/// cores / this size). Rows align to core-index blocks of this size. Capped
|
||||
/// by core count so the row count stays manageable — ≤4 cores toggle
|
||||
/// individually (1), 5-8 cores in groups of ≤2, more than 8 in groups of ≤4
|
||||
/// — and the default is that cap. Cycle it live (within the cap) with 'g'.
|
||||
#[arg(long, value_name = "N", default_value_t = 4)]
|
||||
cpu_group_size: usize,
|
||||
|
||||
@@ -203,6 +210,13 @@ struct Args {
|
||||
#[arg(long)]
|
||||
auto_tune: bool,
|
||||
|
||||
/// Sustained-Sol/s governor: hold each GPU at/below this edge temperature (°C)
|
||||
/// by pacing the solve cadence (no hardware writes, no root). Trades a little
|
||||
/// throughput for lower temp/power; off by default (runs flat-out). Needs a
|
||||
/// backend that reports temperature (AMD amdgpu / NVIDIA).
|
||||
#[arg(long, value_name = "CELSIUS")]
|
||||
target_temp: Option<u32>,
|
||||
|
||||
/// Efficiency: cap each GPU's power limit in watts (default: card max).
|
||||
/// Lower power trades a little hashrate for much better Sol/W.
|
||||
#[arg(long, value_name = "WATTS")]
|
||||
@@ -619,6 +633,8 @@ fn main() -> Result<()> {
|
||||
args.power_limit.unwrap_or(0),
|
||||
args.unlock_controls,
|
||||
);
|
||||
// Software temp governor target (paces solve cadence; no hardware writes).
|
||||
miner::set_target_temp(args.target_temp);
|
||||
miner::run(client, specs, running, job_timeout, tui, format!("{host}:{port}"), controls, cpu_mining, cpu_clamp, args.control_port)
|
||||
}
|
||||
|
||||
@@ -1076,10 +1092,21 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
||||
use std::time::Instant;
|
||||
info!("benchmarking {runs} solve(s) per worker across {} worker(s)", specs.len());
|
||||
|
||||
/// Per-worker benchmark result, including a steady-state telemetry snapshot
|
||||
/// (sampled right after the timed loop, while the card is warm).
|
||||
struct WorkerResult {
|
||||
sols: usize,
|
||||
dt: f64,
|
||||
watts: Option<f64>,
|
||||
temp_c: Option<u32>,
|
||||
core_mhz: Option<u32>,
|
||||
mem_mhz: Option<u32>,
|
||||
}
|
||||
|
||||
let start = Instant::now();
|
||||
let mut handles = Vec::new();
|
||||
for (id, spec) in specs.into_iter().enumerate() {
|
||||
handles.push(std::thread::spawn(move || -> Result<(usize, f64)> {
|
||||
handles.push(std::thread::spawn(move || -> Result<WorkerResult> {
|
||||
let backend = spec.build()?;
|
||||
backend.solve(&pseudo_header(id as u64))?; // warm up (excluded)
|
||||
let t = Instant::now();
|
||||
@@ -1089,7 +1116,17 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
||||
let seed = ((id as u64) << 40) | (i as u64 + 1);
|
||||
sols += backend.solve(&pseudo_header(seed))?.len();
|
||||
}
|
||||
Ok((sols, t.elapsed().as_secs_f64()))
|
||||
let dt = t.elapsed().as_secs_f64();
|
||||
// Snapshot telemetry while the card is still under load.
|
||||
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
|
||||
Ok(WorkerResult {
|
||||
sols,
|
||||
dt,
|
||||
watts: backend.power_watts(),
|
||||
temp_c: backend.temperature_c(),
|
||||
core_mhz,
|
||||
mem_mhz,
|
||||
})
|
||||
}));
|
||||
}
|
||||
|
||||
@@ -1098,11 +1135,26 @@ fn benchmark(specs: Vec<BackendSpec>, runs: usize) -> Result<()> {
|
||||
let mut workers = 0usize;
|
||||
for h in handles {
|
||||
match h.join().unwrap() {
|
||||
Ok((sols, dt)) => {
|
||||
let sol_s = sols as f64 / dt;
|
||||
Ok(r) => {
|
||||
let sol_s = r.sols as f64 / r.dt;
|
||||
// Optional telemetry tail: " | 142 W, 41.7 Sol/W, 68°C, 2700/2500 MHz".
|
||||
let mut tail = String::new();
|
||||
if let Some(w) = r.watts {
|
||||
tail.push_str(&format!(" | {w:.0} W"));
|
||||
if w > 0.0 {
|
||||
tail.push_str(&format!(", {:.2} Sol/W", sol_s / w));
|
||||
}
|
||||
}
|
||||
if let Some(t) = r.temp_c {
|
||||
tail.push_str(&format!(", {t}°C"));
|
||||
}
|
||||
if let (Some(c), m) = (r.core_mhz, r.mem_mhz) {
|
||||
tail.push_str(&format!(", {c}/{} MHz", m.map(|m| m.to_string()).unwrap_or_else(|| "?".into())));
|
||||
}
|
||||
info!(
|
||||
" worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {sols} solutions",
|
||||
1000.0 * dt / runs as f64
|
||||
" worker {workers}: {sol_s:.2} Sol/s ({:.0} ms/solve), {} solutions{tail}",
|
||||
1000.0 * r.dt / runs as f64,
|
||||
r.sols
|
||||
);
|
||||
agg_sols += sol_s;
|
||||
workers += 1;
|
||||
|
||||
+105
@@ -14,6 +14,22 @@ use crate::equihash;
|
||||
use crate::params::{HEADER_LEN, SOLUTION_BYTES};
|
||||
use crate::stratum::{StratumClient, Work};
|
||||
|
||||
/// Process-wide target edge temperature (°C) for the software solve-cadence
|
||||
/// governor; `None` ⇒ run flat-out. Set once at startup from `--target-temp`.
|
||||
static TARGET_TEMP_C: OnceLock<Option<u32>> = OnceLock::new();
|
||||
|
||||
/// Install the governor's target temperature (call once, before workers start).
|
||||
pub fn set_target_temp(c: Option<u32>) {
|
||||
if let Some(t) = c {
|
||||
info!("temperature governor enabled: holding GPUs ≤{t}°C (paced cadence)");
|
||||
}
|
||||
let _ = TARGET_TEMP_C.set(c);
|
||||
}
|
||||
|
||||
fn target_temp_c() -> Option<u32> {
|
||||
TARGET_TEMP_C.get().copied().flatten()
|
||||
}
|
||||
|
||||
/// Double SHA-256, as used for the Zcash/ZClassic block PoW hash.
|
||||
fn sha256d(data: &[u8]) -> [u8; 32] {
|
||||
let first = Sha256::digest(data);
|
||||
@@ -136,6 +152,8 @@ impl Backend {
|
||||
match self {
|
||||
#[cfg(feature = "cuda")]
|
||||
Backend::Cuda(solver) => solver.power_watts(),
|
||||
#[cfg(feature = "gpu")]
|
||||
Backend::Gpu(solver) => solver.power_watts(),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
@@ -145,6 +163,8 @@ impl Backend {
|
||||
match self {
|
||||
#[cfg(feature = "cuda")]
|
||||
Backend::Cuda(solver) => solver.temperature_c(),
|
||||
#[cfg(feature = "gpu")]
|
||||
Backend::Gpu(solver) => solver.temperature_c(),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
@@ -154,6 +174,8 @@ impl Backend {
|
||||
match self {
|
||||
#[cfg(feature = "cuda")]
|
||||
Backend::Cuda(solver) => solver.current_power_limit_w(),
|
||||
#[cfg(feature = "gpu")]
|
||||
Backend::Gpu(solver) => solver.current_power_limit_w(),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
@@ -163,6 +185,8 @@ impl Backend {
|
||||
match self {
|
||||
#[cfg(feature = "cuda")]
|
||||
Backend::Cuda(solver) => solver.power_limit_range_w(),
|
||||
#[cfg(feature = "gpu")]
|
||||
Backend::Gpu(solver) => solver.power_limit_range_w(),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
@@ -190,6 +214,8 @@ impl Backend {
|
||||
match self {
|
||||
#[cfg(feature = "cuda")]
|
||||
Backend::Cuda(solver) => solver.current_clocks_mhz(),
|
||||
#[cfg(feature = "gpu")]
|
||||
Backend::Gpu(solver) => solver.current_clocks_mhz(),
|
||||
_ => (None, None),
|
||||
}
|
||||
}
|
||||
@@ -513,6 +539,14 @@ fn worker(
|
||||
let mut last_job = Instant::now();
|
||||
let mut paused = false;
|
||||
let mut disabled_pause = false;
|
||||
// Software temperature governor: pace the solve cadence to hold edge temp at
|
||||
// or below `--target-temp` (no hardware writes). `gov_sleep` is the per-pass
|
||||
// pause, nudged from the periodic temperature sample.
|
||||
let gov_target = target_temp_c();
|
||||
let mut gov_sleep = Duration::ZERO;
|
||||
if let Some(target) = gov_target {
|
||||
info!("worker {id}: temperature governor active — pacing cadence to hold ≤{target}°C");
|
||||
}
|
||||
|
||||
while running.load(Ordering::Relaxed) {
|
||||
if work_handle.epoch() != current.epoch {
|
||||
@@ -583,6 +617,9 @@ fn worker(
|
||||
}
|
||||
if let Some(t) = backend.temperature_c() {
|
||||
stats.workers[id].temp_c.store(t, Ordering::Relaxed);
|
||||
if let Some(target) = gov_target {
|
||||
gov_sleep = govern_cadence(gov_sleep, t, target);
|
||||
}
|
||||
}
|
||||
let (core_mhz, mem_mhz) = backend.current_clocks_mhz();
|
||||
if let Some(c) = core_mhz {
|
||||
@@ -614,12 +651,34 @@ fn worker(
|
||||
let ctx = inflight.pop_front().unwrap();
|
||||
process_results(id, &client, &ctx, &solutions, &stats);
|
||||
}
|
||||
|
||||
// Temperature governor: pace the cadence (held off entirely when flat-out).
|
||||
if !gov_sleep.is_zero() {
|
||||
std::thread::sleep(gov_sleep);
|
||||
}
|
||||
}
|
||||
|
||||
drain_pipeline(id, &mut backend, &mut inflight, &client, &stats)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Nudge the per-pass governor pause to hold edge temperature near `target` (°C).
|
||||
/// Over target → lengthen the pause (proportional to the overshoot); comfortably
|
||||
/// under → shorten it. Called from the periodic temperature sample (~1 Hz), so
|
||||
/// this is a slow integral controller — fine for the card's thermal time constant.
|
||||
fn govern_cadence(cur: Duration, temp: u32, target: u32) -> Duration {
|
||||
const MAX_SLEEP: Duration = Duration::from_millis(100);
|
||||
let (t, target) = (temp as i64, target as i64);
|
||||
if t > target {
|
||||
cur.saturating_add(Duration::from_micros(400 * (t - target) as u64))
|
||||
.min(MAX_SLEEP)
|
||||
} else if t < target - 1 {
|
||||
cur.saturating_sub(Duration::from_millis(1))
|
||||
} else {
|
||||
cur
|
||||
}
|
||||
}
|
||||
|
||||
/// Supervise CPU mining: spawn one worker per group for the current group size,
|
||||
/// and whenever the dashboard cycles the size, stop those workers (`gen_running`),
|
||||
/// rebuild the grouping, and respawn. Runs until `running` is cleared.
|
||||
@@ -843,3 +902,49 @@ fn build_nonce(nonce1: &[u8], counter: u64) -> Result<[u8; 32]> {
|
||||
tail[..n].copy_from_slice(&counter.to_le_bytes()[..n]);
|
||||
Ok(nonce)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod governor_tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn governor_idles_below_target() {
|
||||
// At/under target → no pause is introduced.
|
||||
assert_eq!(govern_cadence(Duration::ZERO, 60, 70), Duration::ZERO);
|
||||
assert_eq!(govern_cadence(Duration::ZERO, 70, 70), Duration::ZERO);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn governor_lengthens_over_target_and_clamps() {
|
||||
// Over target → pause grows proportional to the overshoot.
|
||||
let s = govern_cadence(Duration::ZERO, 75, 70); // +5°C
|
||||
assert_eq!(s, Duration::from_micros(400 * 5));
|
||||
// Far over target saturates at the cap, never beyond.
|
||||
let hot = govern_cadence(Duration::from_millis(99), 200, 70);
|
||||
assert_eq!(hot, Duration::from_millis(100));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn governor_eases_off_when_cool() {
|
||||
// 2°C+ under target → pause shrinks; floors at zero (no underflow).
|
||||
let s = govern_cadence(Duration::from_millis(3), 67, 70);
|
||||
assert_eq!(s, Duration::from_millis(2));
|
||||
assert_eq!(govern_cadence(Duration::ZERO, 50, 70), Duration::ZERO);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn governor_converges_to_hold_target() {
|
||||
// Simulate a card whose temp rises with duty: more pause → cooler. The
|
||||
// controller should settle near the target without runaway.
|
||||
let target = 70u32;
|
||||
let mut sleep = Duration::ZERO;
|
||||
let mut temp = 85i32; // starts hot
|
||||
for _ in 0..200 {
|
||||
sleep = govern_cadence(sleep, temp as u32, target);
|
||||
// crude plant: each ms of pause sheds ~1.5°C off a 90°C flat-out temp.
|
||||
let modeled = 90.0 - 1.5 * sleep.as_millis() as f64;
|
||||
temp = modeled.round() as i32;
|
||||
}
|
||||
assert!((target as i32 - 2..=target as i32 + 2).contains(&temp), "settled at {temp}°C, want ~{target}");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,124 @@
|
||||
; OpenCL import library definition for jackpotminer's Windows build.
|
||||
; Generated from the cl-sys 0.4.3 exports so a vendor-neutral OpenCL.lib /
|
||||
; libOpenCL.a can be produced at build time without an OpenCL SDK. The actual
|
||||
; OpenCL.dll (ICD loader) is provided at runtime by the GPU driver.
|
||||
LIBRARY OpenCL
|
||||
EXPORTS
|
||||
clBuildProgram
|
||||
clCloneKernel
|
||||
clCompileProgram
|
||||
clCreateBuffer
|
||||
clCreateCommandQueue
|
||||
clCreateCommandQueueWithProperties
|
||||
clCreateContext
|
||||
clCreateContextFromType
|
||||
clCreateFromGLBuffer
|
||||
clCreateFromGLRenderbuffer
|
||||
clCreateFromGLTexture
|
||||
clCreateFromGLTexture2D
|
||||
clCreateFromGLTexture3D
|
||||
clCreateImage
|
||||
clCreateImage2D
|
||||
clCreateImage3D
|
||||
clCreateKernel
|
||||
clCreateKernelsInProgram
|
||||
clCreatePipe
|
||||
clCreateProgramWithBinary
|
||||
clCreateProgramWithBuiltInKernels
|
||||
clCreateProgramWithIL
|
||||
clCreateProgramWithSource
|
||||
clCreateSampler
|
||||
clCreateSamplerWithProperties
|
||||
clCreateSubBuffer
|
||||
clCreateSubDevices
|
||||
clCreateUserEvent
|
||||
clEnqueueAcquireGLObjects
|
||||
clEnqueueBarrier
|
||||
clEnqueueBarrierWithWaitList
|
||||
clEnqueueCopyBuffer
|
||||
clEnqueueCopyBufferRect
|
||||
clEnqueueCopyBufferToImage
|
||||
clEnqueueCopyImage
|
||||
clEnqueueCopyImageToBuffer
|
||||
clEnqueueFillBuffer
|
||||
clEnqueueFillImage
|
||||
clEnqueueMapBuffer
|
||||
clEnqueueMapImage
|
||||
clEnqueueMarker
|
||||
clEnqueueMarkerWithWaitList
|
||||
clEnqueueMigrateMemObjects
|
||||
clEnqueueNativeKernel
|
||||
clEnqueueNDRangeKernel
|
||||
clEnqueueReadBuffer
|
||||
clEnqueueReadBufferRect
|
||||
clEnqueueReadImage
|
||||
clEnqueueReleaseGLObjects
|
||||
clEnqueueSVMFree
|
||||
clEnqueueSVMMap
|
||||
clEnqueueSVMMemcpy
|
||||
clEnqueueSVMMemFill
|
||||
clEnqueueSVMMigrateMem
|
||||
clEnqueueSVMUnmap
|
||||
clEnqueueTask
|
||||
clEnqueueUnmapMemObject
|
||||
clEnqueueWaitForEvents
|
||||
clEnqueueWriteBuffer
|
||||
clEnqueueWriteBufferRect
|
||||
clEnqueueWriteImage
|
||||
clFinish
|
||||
clFlush
|
||||
clGetCommandQueueInfo
|
||||
clGetContextInfo
|
||||
clGetDeviceAndHostTimer
|
||||
clGetDeviceIDs
|
||||
clGetDeviceInfo
|
||||
clGetEventInfo
|
||||
clGetEventProfilingInfo
|
||||
clGetExtensionFunctionAddress
|
||||
clGetExtensionFunctionAddressForPlatform
|
||||
clGetGLContextInfoKHR
|
||||
clGetGLObjectInfo
|
||||
clGetGLTextureInfo
|
||||
clGetHostTimer
|
||||
clGetImageInfo
|
||||
clGetKernelArgInfo
|
||||
clGetKernelInfo
|
||||
clGetKernelSubGroupInfo
|
||||
clGetKernelWorkGroupInfo
|
||||
clGetMemObjectInfo
|
||||
clGetPipeInfo
|
||||
clGetPlatformIDs
|
||||
clGetPlatformInfo
|
||||
clGetProgramBuildInfo
|
||||
clGetProgramInfo
|
||||
clGetSamplerInfo
|
||||
clGetSupportedImageFormats
|
||||
clLinkProgram
|
||||
clReleaseCommandQueue
|
||||
clReleaseContext
|
||||
clReleaseDevice
|
||||
clReleaseEvent
|
||||
clReleaseKernel
|
||||
clReleaseMemObject
|
||||
clReleaseProgram
|
||||
clReleaseSampler
|
||||
clRetainCommandQueue
|
||||
clRetainContext
|
||||
clRetainDevice
|
||||
clRetainEvent
|
||||
clRetainKernel
|
||||
clRetainMemObject
|
||||
clRetainProgram
|
||||
clRetainSampler
|
||||
clSetDefaultDeviceCommandQueue
|
||||
clSetEventCallback
|
||||
clSetKernelArg
|
||||
clSetKernelArgSVMPointer
|
||||
clSetKernelExecInfo
|
||||
clSetMemObjectDestructorCallback
|
||||
clSetUserEventStatus
|
||||
clSVMAlloc
|
||||
clSVMFree
|
||||
clUnloadCompiler
|
||||
clUnloadPlatformCompiler
|
||||
clWaitForEvents
|
||||
Reference in New Issue
Block a user