Remove collab/jmprcx-solver

Drop the standalone collaborator Equihash 192,7 solver crate; it is not
part of the main miner build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
jackpotincorporated
2026-06-05 23:34:18 -04:00
parent 4dd54cb839
commit f3ca6a1ee4
10 changed files with 0 additions and 1084 deletions
-12
View File
@@ -1,12 +0,0 @@
[package]
name = "jmprcx-solver"
version = "0.1.0"
edition = "2026"
description = "Load and drive the Equihash 192,7 GPU solver fatbin via the CUDA Driver API"
[[bin]]
name = "jmprcx-solver"
path = "src/main.rs"
[dependencies]
# none on purpose: raw FFI to the system CUDA driver (libcuda), no network needed
-85
View File
@@ -1,85 +0,0 @@
# miniz-solver-rs
Basic Rust program that **uses the extracted miniZ Equihash 192,7 GPU solver**.
It loads the captured CUDA fatbin (`../miniz-dump/solver_192_7/equihash192_7.fatbin`)
through the CUDA Driver API (raw FFI to `libcuda`, no external crates) and drives
its kernels on the GPU.
## Build & run
```sh
cargo build --release
./target/release/miniz-solver # load + enumerate all 57 kernels
./target/release/miniz-solver --launch # also execute a real solver kernel
./target/release/miniz-solver --round0 # replay round 0 (digit_f) with a captured midstate
./target/release/miniz-solver /path/to.fatbin # use a different fatbin
```
Requires an NVIDIA GPU + driver (`/usr/lib/libcuda.so`). The fatbin contains
`sm_80`/`sm_86`/`sm_120` cubins; the driver auto-picks the one for your GPU.
## What it does
- `cuInit` → context on GPU#0
- `cuModuleLoadData` on the raw fatbin (magic `0xBA55ED50`)
- `cuModuleEnumerateFunctions` + `cuFuncGetName` + `cuFuncGetAttribute`:
lists every kernel with regs / shared / local / max-threads and labels the
Wagner `n=192,k=7` pipeline:
`digit_f` (round 0: BLAKE2b + bucketing) → `digit_1..3`, `digit_4w/5w/6w`
(rounds 16) → `digit_l` (round 7: solution recovery) → `sort_and_compress`.
- with `--launch`: allocates a device buffer and launches the real
`cleanup<64>(void*, uint)` kernel, then `cuCtxSynchronize`.
- with `--round0`: drives the real **round 0** (`digit_f`) — allocates the four
buffers at their template sizes, launches the exact runtime variant
(grid=65536, block=256) with a BLAKE2b midstate captured from a live job, and
reads back the bucket counters. Verified output: **33,554,432 = 2^25** entries
bucketed into 12288 buckets (the correct 192,7 initial-entry count).
- with `--replay [rec.log]`: **runs the entire solver** — parses a recorded pass
(`recording.log`), allocates one arena, rebases every device pointer, and
executes all 10 kernels (`cleanup → digit_f → digit_1..6 → digit_l →
sort_and_compress`). All kernels complete; extracts a 128-index candidate.
- with `--header <hex>`: computes a BLAKE2b(192,7) midstate from a 140-byte
header, injects it, and runs the full pipeline (mint a new job).
- with `--selftest`: BLAKE2b-512 known-answer test (RFC 7693) — PASS.
- with `--verify-share`: verify a real pool-accepted share (BLAKE2b + Wagner) — VALID.
- with `--solve`: **the complete solver** — inject a known header's midstate+tail,
run the GPU pipeline, and harvest a solution from the container that the verifier
accepts. Reproducibly prints `SOLUTION HARVESTED FROM GPU — VALID ✓`.
See `../miniz-dump/solver_192_7/ORCHESTRATION.md` for the full pipeline + recovery.
### Status (honest)
- **Pipeline: complete.** All 10 kernels run standalone; round 0 verified bit-exact
(2^25 entries). Faithful end-to-end replay of miniZ's 192,7 solver.
- **Hash model + verification: SOLVED.** Captured live stratum (plaintext) via a
logging relay; a real pool-accepted share verifies exactly under
`hash(i) = BLAKE2b(header‖LE32(i/2), person="ZcashPoW"+LE32(192)+LE32(7),
digest=48)[(i%2)*24..]`. `--verify-share` reproduces VALID ✓ (192/192 zero bits,
all 7 Wagner levels) in Rust. So `--selftest`, `blake2b.rs`, `verify.rs` and the
solution decoder are all proven against ground truth.
- **Complete (`--solve`).** Container = 128 consecutive u32 indices at offset 0;
the midstate is textbook BLAKE2b-after-128B and the digit_f `uint` is the 4
varying header-tail bytes (nonce[28..31]; nonce[20..27] are constant 0). So:
`header → midstate+tail → GPU pipeline → container[0..128] → VALID solution`,
reproducibly. The miniZ Equihash 192,7 solver is fully reverse-engineered.
## What it does NOT do (scope)
It does **not** mine or produce valid Equihash solutions. A working solver also
needs miniZ's host orchestration, which is not part of the extracted kernels:
- exact device-buffer sizing per round (the kernels' template/array dims give the
bucket geometry, e.g. `uint4[180][6656][32]`, but the host owns allocation)
- the precise `digit_f → digit_1..6 → digit_l → sort_and_compress` launch
sequence with the correct grid/block dims and shared-mem config per round
- BLAKE2b midstate setup from the block header + nonce, and the `equi<...>` /
`ScontainerReal192` struct layouts passed between kernels
That host logic lives in miniZ's encrypted blob. Reconstructing it (from the SASS
in `../miniz-dump/solver_192_7/equihash192_7.sm_120.sass` plus the kernel
signatures in `kernels_demangled.txt`) is the next step toward a standalone miner.
## Files
- `src/cuda.rs` — minimal CUDA Driver API FFI bindings
- `src/main.rs` — loader / enumerator / launch demo
- `build.rs` — links `libcuda`
-8
View File
@@ -1,8 +0,0 @@
// Link against the system CUDA driver (libcuda.so -> libcuda.so.1).
// Falls back to the CUDA toolkit stub if the driver symlink isn't in /usr/lib.
fn main() {
println!("cargo:rustc-link-search=native=/usr/lib");
println!("cargo:rustc-link-search=native=/usr/lib64");
println!("cargo:rustc-link-search=native=/opt/cuda/targets/x86_64-linux/lib/stubs");
println!("cargo:rustc-link-lib=dylib=cuda");
}
-173
View File
@@ -1,173 +0,0 @@
//! BLAKE2b-512 with Equihash (ZcashPoW, n=192, k=7) personalization, plus the
//! "midstate" client hands to `digit_f`: the BLAKE2b state after compressing the
//! first 128-byte block of the block header.
//!
//! Equihash params for 192,7:
//! personalization = "ZcashPoW" || LE32(192) || LE32(7)
//! digest_length = (512/192)*192/8 = 48 bytes (2 indices x 24 bytes)
//!
//! The 140-byte header + 4-byte index = 144 bytes hashed per index. The first
//! 128 bytes are header-independent of the index, so the client precompresses them on
//! the CPU into the 64-byte (8x u64) midstate; the GPU finishes per index.
const IV: [u64; 8] = [
0x6a09e667f3bcc908, 0xbb67ae8584caa73b, 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1,
0x510e527fade682d1, 0x9b05688c2b3e6c1f, 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179,
];
const SIGMA: [[usize; 16]; 12] = [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3],
[11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4],
[7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8],
[9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13],
[2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9],
[12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11],
[13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10],
[6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5],
[10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3],
];
#[inline]
fn g(v: &mut [u64; 16], a: usize, b: usize, c: usize, d: usize, x: u64, y: u64) {
v[a] = v[a].wrapping_add(v[b]).wrapping_add(x);
v[d] = (v[d] ^ v[a]).rotate_right(32);
v[c] = v[c].wrapping_add(v[d]);
v[b] = (v[b] ^ v[c]).rotate_right(24);
v[a] = v[a].wrapping_add(v[b]).wrapping_add(y);
v[d] = (v[d] ^ v[a]).rotate_right(16);
v[c] = v[c].wrapping_add(v[d]);
v[b] = (v[b] ^ v[c]).rotate_right(63);
}
/// One BLAKE2b compression of a 128-byte block into state `h`.
fn compress(h: &mut [u64; 8], block: &[u8; 128], t: u128, last: bool) {
let mut m = [0u64; 16];
for i in 0..16 {
m[i] = u64::from_le_bytes(block[i * 8..i * 8 + 8].try_into().unwrap());
}
let mut v = [0u64; 16];
v[..8].copy_from_slice(h);
v[8..].copy_from_slice(&IV);
v[12] ^= t as u64;
v[13] ^= (t >> 64) as u64;
if last {
v[14] = !v[14];
}
for r in 0..12 {
let s = &SIGMA[r];
g(&mut v, 0, 4, 8, 12, m[s[0]], m[s[1]]);
g(&mut v, 1, 5, 9, 13, m[s[2]], m[s[3]]);
g(&mut v, 2, 6, 10, 14, m[s[4]], m[s[5]]);
g(&mut v, 3, 7, 11, 15, m[s[6]], m[s[7]]);
g(&mut v, 0, 5, 10, 15, m[s[8]], m[s[9]]);
g(&mut v, 1, 6, 11, 12, m[s[10]], m[s[11]]);
g(&mut v, 2, 7, 8, 13, m[s[12]], m[s[13]]);
g(&mut v, 3, 4, 9, 14, m[s[14]], m[s[15]]);
}
for i in 0..8 {
h[i] ^= v[i] ^ v[i + 8];
}
}
/// Known-answer self-test of the core compression: standard BLAKE2b-512("abc").
pub fn selftest() -> bool {
let mut h = IV;
h[0] ^= 0x0101_0000 ^ 64; // digest_length=64, fanout=1, depth=1, no personalization
let msg = b"abc";
let mut block = [0u8; 128];
block[..3].copy_from_slice(msg);
compress(&mut h, &block, 3, true);
let mut out = [0u8; 64];
for i in 0..8 {
out[i * 8..i * 8 + 8].copy_from_slice(&h[i].to_le_bytes());
}
// RFC 7693 test vector for BLAKE2b-512("abc")
let expect: [u8; 64] = [
0xba, 0x80, 0xa5, 0x3f, 0x98, 0x1c, 0x4d, 0x0d, 0x6a, 0x27, 0x97, 0xb6, 0x9f, 0x12, 0xf6, 0xe9,
0x4c, 0x21, 0x2f, 0x14, 0x68, 0x5a, 0xc4, 0xb7, 0x4b, 0x12, 0xbb, 0x6f, 0xdb, 0xff, 0xa2, 0xd1,
0x7d, 0x87, 0xc5, 0x39, 0x2a, 0xab, 0x79, 0x2d, 0xc2, 0x52, 0xd5, 0xde, 0x45, 0x33, 0xcc, 0x95,
0x18, 0xd3, 0x8a, 0xa8, 0xdb, 0xf1, 0x92, 0x5a, 0xb9, 0x23, 0x86, 0xed, 0xd4, 0x00, 0x99, 0x23,
];
out == expect
}
/// Initial BLAKE2b state for Equihash(192,7).
pub fn init_state() -> [u64; 8] {
let mut personal = [0u8; 16];
personal[..8].copy_from_slice(b"ZcashPoW");
personal[8..12].copy_from_slice(&192u32.to_le_bytes());
personal[12..16].copy_from_slice(&7u32.to_le_bytes());
let mut h = IV;
// param block: digest_length=48, key=0, fanout=1, depth=1
h[0] ^= 0x0101_0000 ^ 48;
// words 6,7 hold the 16-byte personalization
h[6] ^= u64::from_le_bytes(personal[0..8].try_into().unwrap());
h[7] ^= u64::from_le_bytes(personal[8..16].try_into().unwrap());
h
}
/// The 64-byte midstate digit_f expects: state after compressing header[0..128].
/// `header` must be the 140-byte block header.
pub fn midstate(header: &[u8]) -> [u8; 64] {
assert!(header.len() >= 128, "header must be >= 128 bytes");
let mut h = init_state();
let mut block = [0u8; 128];
block.copy_from_slice(&header[0..128]);
compress(&mut h, &block, 128, false);
let mut out = [0u8; 64];
for i in 0..8 {
out[i * 8..i * 8 + 8].copy_from_slice(&h[i].to_le_bytes());
}
out
}
/// Finalize a hash directly from a 64-byte midstate (h[0..8]) plus a final block
/// whose first `idx_len` bytes are LE(idx_word) and the rest zero, with the
/// total byte counter `t_total`. Returns the 48-byte digest (h[0..6]).
/// Used to test the GPU's per-index hash construction (midstate + index, no tail).
pub fn digest_from_midstate(mid: &[u8; 64], idx_word: u32, idx_len: usize, t_total: u128) -> [u8; 48] {
let mut h = [0u64; 8];
for i in 0..8 {
h[i] = u64::from_le_bytes(mid[i * 8..i * 8 + 8].try_into().unwrap());
}
let mut block = [0u8; 128];
let w = idx_word.to_le_bytes();
block[..idx_len.min(4)].copy_from_slice(&w[..idx_len.min(4)]);
compress(&mut h, &block, t_total, true);
let mut out = [0u8; 48];
for i in 0..6 {
out[i * 8..i * 8 + 8].copy_from_slice(&h[i].to_le_bytes());
}
out
}
/// Full Equihash(192,7) per-index hash: BLAKE2b(header || LE32(g)) -> 48 bytes,
/// where g = index / 2. Returns the 24-byte half selected by (index & 1).
/// Used for solution verification (reference path, midstate not required).
pub fn index_hash(header: &[u8], index: u32) -> [u8; 24] {
let mut h = init_state();
// header is 140 bytes; append LE32(g) -> 144 bytes total = one full block + 16
let g_word = (index / 2).to_le_bytes();
let mut input = Vec::with_capacity(144);
input.extend_from_slice(&header[..140]);
input.extend_from_slice(&g_word);
// block 1
let mut b0 = [0u8; 128];
b0.copy_from_slice(&input[0..128]);
compress(&mut h, &b0, 128, false);
// final block (16 bytes used, rest zero), t = 144, last
let mut b1 = [0u8; 128];
b1[..16].copy_from_slice(&input[128..144]);
compress(&mut h, &b1, 144, true);
let mut full = [0u8; 64];
for i in 0..8 {
full[i * 8..i * 8 + 8].copy_from_slice(&h[i].to_le_bytes());
}
// 48-byte digest -> two 24-byte index hashes
let half = (index & 1) as usize * 24;
full[half..half + 24].try_into().unwrap()
}
-93
View File
@@ -1,93 +0,0 @@
//! Minimal raw FFI bindings to the CUDA Driver API (libcuda) — only what we need
//! to load Equihash 192,7 fatbin and drive its kernels.
#![allow(non_camel_case_types, non_snake_case, dead_code)]
use std::ffi::{c_char, c_int, c_uint, c_void, CStr};
pub type CUresult = c_int;
pub type CUdevice = c_int;
pub type CUcontext = *mut c_void;
pub type CUmodule = *mut c_void;
pub type CUfunction = *mut c_void;
pub type CUstream = *mut c_void;
pub type CUdeviceptr = u64;
pub const CUDA_SUCCESS: CUresult = 0;
// CUfunction_attribute values (cuda.h)
pub const CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK: c_int = 0;
pub const CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES: c_int = 1;
pub const CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES: c_int = 2;
pub const CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES: c_int = 3;
pub const CU_FUNC_ATTRIBUTE_NUM_REGS: c_int = 4;
pub const CU_FUNC_ATTRIBUTE_PTX_VERSION: c_int = 5;
pub const CU_FUNC_ATTRIBUTE_BINARY_VERSION: c_int = 6;
pub const CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES: c_int = 8;
// cuLaunchKernel `extra` directives
pub const CU_LAUNCH_PARAM_END: usize = 0x00;
pub const CU_LAUNCH_PARAM_BUFFER_POINTER: usize = 0x01;
pub const CU_LAUNCH_PARAM_BUFFER_SIZE: usize = 0x02;
extern "C" {
pub fn cuInit(flags: c_uint) -> CUresult;
pub fn cuDriverGetVersion(version: *mut c_int) -> CUresult;
pub fn cuDeviceGet(device: *mut CUdevice, ordinal: c_int) -> CUresult;
pub fn cuDeviceGetName(name: *mut c_char, len: c_int, dev: CUdevice) -> CUresult;
pub fn cuCtxCreate_v2(pctx: *mut CUcontext, flags: c_uint, dev: CUdevice) -> CUresult;
pub fn cuCtxDestroy_v2(ctx: CUcontext) -> CUresult;
pub fn cuCtxSynchronize() -> CUresult;
// Module / kernel loading + introspection
pub fn cuModuleLoadData(module: *mut CUmodule, image: *const c_void) -> CUresult;
pub fn cuModuleUnload(module: CUmodule) -> CUresult;
pub fn cuModuleGetFunction(func: *mut CUfunction, module: CUmodule, name: *const c_char) -> CUresult;
pub fn cuModuleGetFunctionCount(count: *mut c_uint, module: CUmodule) -> CUresult;
pub fn cuModuleEnumerateFunctions(functions: *mut CUfunction, num: c_uint, module: CUmodule) -> CUresult;
pub fn cuFuncGetName(name: *mut *const c_char, func: CUfunction) -> CUresult;
pub fn cuFuncGetAttribute(pi: *mut c_int, attrib: c_int, func: CUfunction) -> CUresult;
pub fn cuFuncSetAttribute(func: CUfunction, attrib: c_int, value: c_int) -> CUresult;
// Memory + launch
pub fn cuMemAlloc_v2(dptr: *mut CUdeviceptr, bytesize: usize) -> CUresult;
pub fn cuMemsetD8_v2(dptr: CUdeviceptr, uc: u8, n: usize) -> CUresult;
pub fn cuMemsetD32_v2(dptr: CUdeviceptr, ui: u32, n: usize) -> CUresult;
pub fn cuMemFree_v2(dptr: CUdeviceptr) -> CUresult;
pub fn cuMemcpyDtoH_v2(dst: *mut c_void, src: CUdeviceptr, n: usize) -> CUresult;
pub fn cuMemGetInfo_v2(free: *mut usize, total: *mut usize) -> CUresult;
pub fn cuLaunchKernel(
f: CUfunction,
gx: c_uint, gy: c_uint, gz: c_uint,
bx: c_uint, by: c_uint, bz: c_uint,
shared_mem_bytes: c_uint,
stream: CUstream,
kernel_params: *mut *mut c_void,
extra: *mut *mut c_void,
) -> CUresult;
pub fn cuGetErrorName(error: CUresult, s: *mut *const c_char) -> CUresult;
pub fn cuGetErrorString(error: CUresult, s: *mut *const c_char) -> CUresult;
}
/// Human-readable "NAME: description" for a CUresult.
pub fn err_str(code: CUresult) -> String {
unsafe {
let mut name: *const c_char = std::ptr::null();
let mut desc: *const c_char = std::ptr::null();
cuGetErrorName(code, &mut name);
cuGetErrorString(code, &mut desc);
let n = if name.is_null() { "?".into() } else { CStr::from_ptr(name).to_string_lossy().into_owned() };
let d = if desc.is_null() { "".into() } else { CStr::from_ptr(desc).to_string_lossy().into_owned() };
format!("{n} ({code}): {d}")
}
}
/// Turn a CUresult into a Result, with context.
pub fn check(code: CUresult, what: &str) -> Result<(), String> {
if code == CUDA_SUCCESS {
Ok(())
} else {
Err(format!("{what} failed: {}", err_str(code)))
}
}
Binary file not shown.
-310
View File
@@ -1,310 +0,0 @@
//! jmprcx-solver — load the Equihash 192,7 GPU solver fatbin
//! and drive its kernels through the CUDA Driver API.
//!
//! What this does (always, safely):
//! * cuInit + create a context on the chosen GPU
//! * load the captured fatbin (auto-selects the cubin matching your GPU arch)
//! * enumerate every kernel, print attributes, and label the Wagner pipeline
//!
//! What this does with `--launch` (experimental):
//! * actually launches one real solver kernel (`cleanup<64>(void*, uint)`),
//! which has a simple, known signature, and synchronizes.
//!
mod cuda;
mod round0;
mod replay;
mod blake2b;
mod verify;
use cuda::*;
use std::ffi::{c_void, CStr, CString};
use std::ptr;
const DEFAULT_FATBIN: &str =
"/home/access/RustroverProjects/zclminer/collab/jmprcx-solver/src/equihash192_7.fatbin";
const CLEANUP_MANGLED: &str = "_Z7cleanupILj64EEvPvj"; // void cleanup<64u>(void*, unsigned)
/// Best-effort role label from the (mangled) kernel name.
fn role(name: &str) -> &'static str {
if name.contains("7digit_f") { "round 0 : BLAKE2b hash + initial bucketing" }
else if name.contains("7digit_1") { "round 1 : Wagner collision" }
else if name.contains("7digit_2") { "round 2 : Wagner collision" }
else if name.contains("7digit_3") { "round 3 : Wagner collision" }
else if name.contains("8digit_4w") { "round 4 : Wagner collision (wide)" }
else if name.contains("8digit_5w") { "round 5 : Wagner collision (wide)" }
else if name.contains("8digit_6w") { "round 6 : Wagner collision (wide)" }
else if name.contains("7digit_l") { "round 7 : final collision + solution recovery" }
else if name.contains("sort_and_compress") { "post : sort + compress solutions" }
else if name.contains("7cleanup") { "util : buffer cleanup" }
else { "other" }
}
fn family(name: &str) -> &'static str {
for k in ["7digit_f","7digit_1","7digit_2","7digit_3","8digit_4w","8digit_5w",
"8digit_6w","7digit_l","sort_and_compress","7cleanup"] {
if name.contains(k) { return k; }
}
"other"
}
/// A real, pool-accepted 192,7 block header (job 19ae0) captured from the wire.
/// Used by `--solve` as a known-good header so the GPU output can be verified.
const KNOWN_HEADER: &str = "040000002ba84c97ffc202b55a5843d55837d256fdc32410390b8e95502bd8f648040000cb560c7083a13e06273570350805668e83c3e2362e39e131612fead6f4ea9937a19ceba5b597e2217d7e0c53ba24de3d36b92cf97743550c2745c9464f4dc847ba9e1e6a34cf101e80032bb40ae5118877fccacf8d961e648f6a228d0000000000000000ce856809";
/// Scan a container dump for a 128-index group the verifier accepts, using the
/// proven per-index hash as an oracle. The range filter (128 consecutive u32 all
/// in (0, 2^25)) is effectively impossible for random GPU memory, so the
/// expensive XOR check runs only on real solution-shaped windows.
fn scan_container(header: &[u8], bytes: &[u8]) -> Option<Vec<u32>> {
let u: Vec<u32> = bytes.chunks_exact(4).map(|c| u32::from_le_bytes(c.try_into().unwrap())).collect();
if u.len() < 128 { return None; }
let mut checked = 0u64;
for start in 0..=u.len() - 128 {
let w = &u[start..start + 128];
if !w.iter().all(|&x| x > 0 && x < (1 << 25)) { continue; }
let mut d = w.to_vec(); d.sort_unstable(); d.dedup();
if d.len() != 128 { continue; }
checked += 1;
if verify::top_xor_zero_bits(w, |i| blake2b::index_hash(header, i)) >= 168 {
let (ok, _) = verify::verify(w, |i| blake2b::index_hash(header, i));
if ok {
println!(" found at u32 offset {start} (after {checked} solution-shaped windows)");
return Some(w.to_vec());
}
}
}
println!(" {checked} solution-shaped windows checked, none verified");
None
}
/// Decode an Equihash 192,7 stratum solution (varint length + 128 x 25-bit
/// big-endian indices) into 128 indices.
fn decode_solution(hex: &str) -> Vec<u32> {
let raw = parse_hex(hex);
// strip the compactsize/varint length prefix (0xfd => 2-byte LE length)
let body = if raw.first() == Some(&0xfd) { &raw[3..] } else { &raw[1..] };
let (mut acc, mut bits, mut out) = (0u64, 0u32, Vec::with_capacity(128));
for &b in body {
acc = (acc << 8) | b as u64;
bits += 8;
while bits >= 25 {
bits -= 25;
out.push(((acc >> bits) & 0x1ff_ffff) as u32);
}
}
out.truncate(128);
out
}
fn verify_share() {
const SOLUTION: &str = "fd900101420199f2d450c74cdec6d8f3437c5bb217e1e37cb50bacf43cb332bb3ded21346edbc173c868e724d1496f04f3f38bab5705abbb7b168e947bc16b75d4043ce7fb16c10f417c6de5ce8306b1aa5dcd02b7c9e49e6001193aae954c3a733f4f55ce5a9703692af8dea5014a587a1ba2d3a0cf03902cfd212fe5846bc9096bdc615a22e4c1f232d9b945de079c2f29aa3a9c87d0681612d8804a8ccf24c752df1837d4c31bb61b5266328dafeb46af26f96ecc74f2d59ad96c9bff231b4a5e7d87aa33bd916270e703c1d6f090ad8ad02cb86c0550f37585042135ae202f5848bb0b0e695cfe638dfdf89c325833a98125c0f765c6d535e886c915cc01f775b9a35a5972c4ecc40afeb4ff083a7493ab8c238f188b2231218771810cb907f02506020d8f2525a627573126d20955d552328cd1557e34e225b4a2f09c411377055c039163df1c499a4e92a011bf71fc4e58839d23f5822d0a200f65ef194d0a3cf0919b35091b681db6db5293d49e2e12960994436d15300bef5f53799ba98e9e752af7842374f4abc6b5eecd5775de07";
let header = parse_hex(KNOWN_HEADER);
let sol = decode_solution(SOLUTION);
println!("known-answer share (job 19ae0): header {} B, {} indices, {} distinct",
header.len(), sol.len(), { let mut s = sol.clone(); s.sort_unstable(); s.dedup(); s.len() });
let zb = verify::top_xor_zero_bits(&sol, |i| blake2b::index_hash(&header, i));
let (ok, msg) = verify::verify(&sol, |i| blake2b::index_hash(&header, i));
println!(" full 128-leaf XOR leading zero bits = {zb} / 192");
println!(" verify: {}{msg}", if ok { "VALID ✓ (matches pool)" } else { "INVALID" });
}
fn parse_hex(s: &str) -> Vec<u8> {
let s: String = s.chars().filter(|c| c.is_ascii_hexdigit()).collect();
(0..s.len() / 2).map(|i| u8::from_str_radix(&s[2 * i..2 * i + 2], 16).unwrap_or(0)).collect()
}
fn main() {
if let Err(e) = run() {
eprintln!("\nerror: {e}");
std::process::exit(1);
}
}
fn run() -> Result<(), String> {
let args: Vec<String> = std::env::args().collect();
let do_launch = args.iter().any(|a| a == "--launch");
let do_round0 = args.iter().any(|a| a == "--round0");
let do_replay = args.iter().any(|a| a == "--replay");
if args.iter().any(|a| a == "--selftest") {
println!("BLAKE2b-512 known-answer self-test: {}",
if blake2b::selftest() { "PASS" } else { "FAIL" });
return Ok(());
}
if args.iter().any(|a| a == "--verify-share") {
verify_share();
return Ok(());
}
let fatbin_path = args.iter().skip(1)
.find(|a| a.ends_with(".fatbin"))
.cloned()
.unwrap_or_else(|| DEFAULT_FATBIN.to_string());
// --- read the captured solver fatbin ---
let image = std::fs::read(&fatbin_path)
.map_err(|e| format!("reading fatbin {fatbin_path}: {e}"))?;
if image.len() < 4 || &image[0..4] != [0x50, 0xed, 0x55, 0xba] {
eprintln!("warning: {fatbin_path} does not start with the fatbin magic 0xBA55ED50");
}
println!("== jmprcx Equihash 192,7 solver loader ==");
println!("fatbin : {fatbin_path} ({} bytes)", image.len());
unsafe {
// --- init driver + device + context ---
check(cuInit(0), "cuInit")?;
let mut ver = 0;
cuDriverGetVersion(&mut ver);
println!("driver : CUDA {}.{}", ver / 1000, (ver % 1000) / 10);
let mut dev: CUdevice = 0;
check(cuDeviceGet(&mut dev, 0), "cuDeviceGet")?;
let mut name = [0i8; 128];
cuDeviceGetName(name.as_mut_ptr() as *mut _, 128, dev);
let gpu = CStr::from_ptr(name.as_ptr() as *const _).to_string_lossy().into_owned();
println!("device : GPU#0 {gpu}");
let mut ctx: CUcontext = ptr::null_mut();
check(cuCtxCreate_v2(&mut ctx, 0, dev), "cuCtxCreate")?;
// --- load the fatbin (driver picks the cubin matching this GPU's arch) ---
let mut module: CUmodule = ptr::null_mut();
check(cuModuleLoadData(&mut module, image.as_ptr() as *const c_void),
"cuModuleLoadData")
.map_err(|e| format!("{e}\n(the fatbin has sm_80/sm_86/sm_120; the driver needs the cubin matching this GPU)"))?;
println!("module : loaded OK\n");
// --- enumerate every kernel in the solver ---
let mut count: u32 = 0;
check(cuModuleGetFunctionCount(&mut count, module), "cuModuleGetFunctionCount")?;
let mut funcs: Vec<CUfunction> = vec![ptr::null_mut(); count as usize];
check(cuModuleEnumerateFunctions(funcs.as_mut_ptr(), count, module),
"cuModuleEnumerateFunctions")?;
println!("solver exposes {count} device kernels:\n");
println!(" {:<22} {:>5} {:>7} {:>7} {:>6} role", "name", "regs", "shared", "local", "maxT");
println!(" {}", "-".repeat(86));
use std::collections::BTreeMap;
let mut by_family: BTreeMap<&str, u32> = BTreeMap::new();
for &f in &funcs {
let mut np: *const std::ffi::c_char = ptr::null();
let fname = if cuFuncGetName(&mut np, f) == CUDA_SUCCESS && !np.is_null() {
CStr::from_ptr(np).to_string_lossy().into_owned()
} else { "<unknown>".into() };
let attr = |a: i32| -> i32 { let mut v = 0; cuFuncGetAttribute(&mut v, a, f); v };
let regs = attr(CU_FUNC_ATTRIBUTE_NUM_REGS);
let shared = attr(CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES);
let local = attr(CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES);
let maxt = attr(CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK);
*by_family.entry(family(&fname)).or_insert(0) += 1;
// show a short, stable handle: the C++ template head up to the first '<'/param block
let short: String = fname.chars().take(22).collect();
println!(" {:<22} {:>5} {:>7} {:>7} {:>6} {}", short, regs, shared, local, maxt, role(&fname));
}
println!("\nkernel families (Wagner n=192, k=7 pipeline):");
for (fam, n) in &by_family {
println!(" {:<20} x{:<3} {}", fam.trim_start_matches(char::is_numeric), n, role(fam));
}
// --- optional: actually launch one real solver kernel ---
if do_launch {
println!("\n--launch: running cleanup<64>(void*, uint) on the GPU ...");
let cname = CString::new(CLEANUP_MANGLED).unwrap();
let mut cf: CUfunction = ptr::null_mut();
match check(cuModuleGetFunction(&mut cf, module, cname.as_ptr()), "cuModuleGetFunction(cleanup)") {
Err(e) => println!(" skipped: {e}"),
Ok(()) => {
let bytes: usize = 64 * 1024 * 1024;
let mut dptr: CUdeviceptr = 0;
check(cuMemAlloc_v2(&mut dptr, bytes), "cuMemAlloc")?;
check(cuMemsetD8_v2(dptr, 0xCC, bytes), "cuMemset")?; // poison so we can see it run
let n: u32 = 1024;
let block: u32 = 64;
let grid: u32 = (n + block - 1) / block;
let mut p_buf: CUdeviceptr = dptr;
let mut p_n: u32 = n;
let mut params: [*mut c_void; 2] = [
&mut p_buf as *mut _ as *mut c_void,
&mut p_n as *mut _ as *mut c_void,
];
let rc = cuLaunchKernel(cf, grid, 1, 1, block, 1, 1, 0,
ptr::null_mut(), params.as_mut_ptr(), ptr::null_mut());
if rc != CUDA_SUCCESS {
println!(" launch returned: {}", err_str(rc));
} else {
let sync = cuCtxSynchronize();
if sync == CUDA_SUCCESS {
println!(" launch OK: grid={grid} block={block} — kernel executed and synchronized.");
} else {
println!(" launched, but sync error: {}", err_str(sync));
println!(" (expected-ish: exact element count/indexing for cleanup is unverified)");
}
}
cuMemFree_v2(dptr);
}
}
} else if !do_round0 {
println!("\n(tip: `--launch` runs cleanup<64>; `--round0` replays digit_f round 0)");
}
// --- optional: drive the real round-0 (digit_f) pipeline stage ---
if do_round0 {
if let Err(e) = round0::run(module) {
println!("round 0: {e}");
}
}
// --- replay the pipeline; optionally solve a known header via the verifier oracle ---
let header_hex = args.iter().position(|a| a == "--header").and_then(|i| args.get(i + 1)).cloned();
let do_solve = args.iter().any(|a| a == "--solve");
if do_replay || do_solve || header_hex.is_some() {
let rec_path = args.iter().skip(1).find(|a| a.ends_with(".log")).cloned()
.unwrap_or_else(|| "recording.log".to_string());
match replay::parse_recording(&rec_path) {
Err(e) => println!("replay: {e}"),
Ok(rec) => {
// header to solve: --solve uses the captured known-good job; --header is user-supplied
let header: Option<Vec<u8>> = if do_solve {
Some(parse_hex(KNOWN_HEADER))
} else {
header_hex.as_ref().map(|h| parse_hex(h)).filter(|h| h.len() >= 140)
};
let inject = header.as_ref().map(|h| {
let mid = blake2b::midstate(h);
replay::Inject { midstate: mid, tail4: [h[136], h[137], h[138], h[139]] }
});
if let Some(h) = &header {
println!("solving header ({} B); midstate=compress(header[0..128]), tail={:02x?}",
h.len(), &h[136..140]);
}
match replay::run(module, &rec, inject) {
Err(e) => println!("replay: {e}"),
Ok((_first, _mid, container)) => match &header {
None => println!("pipeline ran (no header to verify against)"),
Some(h) => {
println!("\nscanning container ({} MB) with the proven verifier as oracle...", container.len() / 1048576);
match scan_container(h, &container) {
Some(sol) => {
let (ok, msg) = verify::verify(&sol, |i| blake2b::index_hash(h, i));
println!("\n*** SOLUTION HARVESTED FROM GPU — {} ***", if ok { "VALID ✓" } else { "?" });
println!(" {msg}");
println!(" indices: {:?}{}", &sol[..8], " ...");
}
None => println!(" no verifying 128-index group in the dumped window"),
}
}
},
}
}
}
}
cuModuleUnload(module);
cuCtxDestroy_v2(ctx);
}
Ok(())
}
-213
View File
@@ -1,213 +0,0 @@
//! Full-pipeline replay of an Equihash 192,7 solve.
//!
//! The whole pipeline addresses a single ~16 GB arena, so here we:
//! 1. allocate our own arena,
//! 2. for each recorded launch, rebase every device pointer in its arg buffer
//! (arena_base + (ptr - recorded_arena_base)),
//! 3. launch the same kernel with the same grid/block/shared via the
//! `extra`/BUFFER_POINTER mechanism,
//! 4. run cleanup -> digit_f -> digit_1..6 -> digit_l -> sort_and_compress.
//!
//! `inject_midstate` (Some 64 bytes) overrides digit_f's midstate so a caller
//! can mint a new job from a header (see blake2b.rs).
use crate::cuda::*;
use std::ffi::{c_void, CString};
use std::ptr;
pub struct Launch {
pub name: String,
pub grid: (u32, u32, u32),
pub block: (u32, u32, u32),
pub shared: u32,
pub arg: Vec<u8>,
}
pub struct Recording {
pub allocs: Vec<(u64, u64)>, // (base, size)
pub pass: Vec<Launch>, // first full 10-kernel pass
}
fn triplet(s: &str) -> (u32, u32, u32) {
let v: Vec<u32> = s.split(',').filter_map(|x| x.parse().ok()).collect();
(v[0], v[1], v[2])
}
pub fn parse_recording(path: &str) -> Result<Recording, String> {
let text = std::fs::read_to_string(path).map_err(|e| format!("read {path}: {e}"))?;
let mut allocs = Vec::new();
let mut launches = Vec::new();
for line in text.lines() {
if let Some(rest) = line.strip_prefix("[alloc] ") {
// "<size> bytes @ 0x<base>"
let parts: Vec<&str> = rest.split_whitespace().collect();
if parts.len() >= 4 {
if let (Ok(size), Some(hex)) = (parts[0].parse::<u64>(), parts[3].strip_prefix("0x")) {
if let Ok(base) = u64::from_str_radix(hex, 16) {
allocs.push((base, size));
}
}
}
} else if let Some(rest) = line.strip_prefix("[REC] ") {
// "<name> g=.. b=.. sh=N sz=N arg=<hex>"
let mut name = "";
let (mut g, mut b, mut sh, mut arg) = ("", "", 0u32, "");
for (i, tok) in rest.split_whitespace().enumerate() {
if i == 0 { name = tok; }
else if let Some(v) = tok.strip_prefix("g=") { g = v; }
else if let Some(v) = tok.strip_prefix("b=") { b = v; }
else if let Some(v) = tok.strip_prefix("sh=") { sh = v.parse().unwrap_or(0); }
else if let Some(v) = tok.strip_prefix("arg=") { arg = v; }
}
let bytes = (0..arg.len() / 2)
.map(|i| u8::from_str_radix(&arg[2 * i..2 * i + 2], 16).unwrap_or(0))
.collect();
launches.push(Launch { name: name.to_string(), grid: triplet(g), block: triplet(b), shared: sh, arg: bytes });
}
}
// dedup consecutive duplicate allocs, take the first full pass (cleanup .. sort_and_compress)
let start = launches.iter().position(|l| l.name.contains("7cleanup")).ok_or("no cleanup launch in recording")?;
let end = launches[start..].iter().position(|l| l.name.contains("sort_and_compress")).ok_or("no sort_and_compress in recording")? + start;
let pass: Vec<Launch> = launches.drain(start..=end).collect();
Ok(Recording { allocs, pass })
}
/// number of bytes at the start of a kernel's arg buffer that are by-value
/// (not device pointers) and must NOT be rebased.
fn byval_prefix(name: &str) -> usize {
if name.contains("7digit_f") { 64 } // two ulonglong4 (BLAKE2b midstate)
else if name.contains("sort_and_compress") { 112 } // SHA256_CTX by value
else { 0 }
}
/// Optional injection to make the GPU solve a header we know:
/// the 64-byte BLAKE2b midstate (= compress(header[0..128])) and the 4 header
/// tail bytes header[136..140] (digit_f's trailing `uint` arg; header[128..135]
/// are constant zero.
pub struct Inject {
pub midstate: [u8; 64],
pub tail4: [u8; 4],
}
pub unsafe fn run(module: CUmodule, rec: &Recording, inject: Option<Inject>) -> Result<(Vec<u32>, [u8; 64], Vec<u8>), String> {
println!("\n== full-pipeline replay ({} kernels) ==", rec.pass.len());
// identify the arena: the alloc that the most pass pointers fall into
let in_dev = |v: u64| (0x7000_0000_0000..0x8000_0000_0000).contains(&v);
let mut votes = vec![0u32; rec.allocs.len()];
for l in &rec.pass {
let skip = byval_prefix(&l.name);
let mut off = skip;
while off + 8 <= l.arg.len() {
let v = u64::from_le_bytes(l.arg[off..off + 8].try_into().unwrap());
if in_dev(v) {
if let Some(i) = rec.allocs.iter().position(|&(b, s)| v >= b && v < b + s) {
votes[i] += 1;
}
}
off += 8;
}
}
let ai = votes.iter().enumerate().max_by_key(|(_, &v)| v).map(|(i, _)| i).ok_or("no arena found")?;
let (arena_base, arena_size) = rec.allocs[ai];
println!("arena : recorded base=0x{arena_base:x} size={} ({:.2} GB), {} ptrs", arena_size, arena_size as f64 / 1e9, votes[ai]);
// allocate our arena: as much as fits (pipeline only touches the low ~7 GB)
let mut free = 0usize; let mut total = 0usize;
cuMemGetInfo_v2(&mut free, &mut total);
let alloc_size = (arena_size as usize).min(free.saturating_sub(1_500_000_000));
let mut arena: CUdeviceptr = 0;
check(cuMemAlloc_v2(&mut arena, alloc_size), "alloc arena")?;
cuMemsetD8_v2(arena, 0, alloc_size);
println!("arena : allocated {:.2} GB at 0x{arena:x} (vram free {:.2} GB)", alloc_size as f64 / 1e9, free as f64 / 1e9);
let rebase = |v: u64| -> u64 { arena + (v - arena_base) };
// replay every kernel
for (idx, l) in rec.pass.iter().enumerate() {
let cname = CString::new(l.name.clone()).unwrap();
let mut f: CUfunction = ptr::null_mut();
check(cuModuleGetFunction(&mut f, module, cname.as_ptr()), &format!("get {}", short(&l.name)))?;
if l.shared > 0 {
// opt in to large dynamic shared memory (>48 KB)
cuFuncSetAttribute(f, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, l.shared as i32);
}
// rebase pointers in a copy of the arg buffer
let mut arg = l.arg.clone();
if l.name.contains("7digit_f") {
if let Some(inj) = &inject {
arg[0..64].copy_from_slice(&inj.midstate); // midstate
arg[96..100].copy_from_slice(&inj.tail4); // trailing uint = header[136..140]
}
}
let skip = byval_prefix(&l.name);
let mut off = skip;
let mut rebased = 0;
while off + 8 <= arg.len() {
let v = u64::from_le_bytes(arg[off..off + 8].try_into().unwrap());
if in_dev(v) && v >= arena_base && v < arena_base + arena_size {
arg[off..off + 8].copy_from_slice(&rebase(v).to_le_bytes());
rebased += 1;
}
off += 8;
}
// launch via the extra / BUFFER_POINTER mechanism
let mut argsz = arg.len();
let mut extra: [*mut c_void; 5] = [
CU_LAUNCH_PARAM_BUFFER_POINTER as *mut c_void,
arg.as_mut_ptr() as *mut c_void,
CU_LAUNCH_PARAM_BUFFER_SIZE as *mut c_void,
&mut argsz as *mut _ as *mut c_void,
CU_LAUNCH_PARAM_END as *mut c_void,
];
let rc = cuLaunchKernel(
f, l.grid.0, l.grid.1, l.grid.2, l.block.0, l.block.1, l.block.2,
l.shared, ptr::null_mut(), ptr::null_mut(), extra.as_mut_ptr(),
);
if rc != CUDA_SUCCESS {
cuMemFree_v2(arena);
return Err(format!("launch #{idx} {} failed: {}", short(&l.name), err_str(rc)));
}
let s = cuCtxSynchronize();
if s != CUDA_SUCCESS {
cuMemFree_v2(arena);
return Err(format!("kernel #{idx} {} sync error: {}", short(&l.name), err_str(s)));
}
println!(" [{idx}] {:<18} grid={:<6} block={:<5} shmem={:<6} rebased {rebased} ptr(s) OK",
short(&l.name), l.grid.0, l.block.0, l.shared);
}
// dump digit_l's container (+ first candidate) for oracle scanning
println!("\nreading digit_l container:");
let mut sol: Vec<u32> = Vec::new();
let mut container_bytes: Vec<u8> = Vec::new();
if let Some(dl) = rec.pass.iter().find(|l| l.name.contains("7digit_l")) {
let p = |off: usize| u64::from_le_bytes(dl.arg[off..off + 8].try_into().unwrap());
let counter = rebase(p(8));
let container = rebase(p(16));
let mut cnt = [0u32; 8];
cuMemcpyDtoH_v2(cnt.as_mut_ptr() as *mut c_void, counter, 32);
let dump = 32 * 1024 * 1024usize; // 32 MB window of the container
container_bytes = vec![0u8; dump];
cuMemcpyDtoH_v2(container_bytes.as_mut_ptr() as *mut c_void, container, dump);
sol = container_bytes[..512].chunks_exact(4).map(|c| u32::from_le_bytes(c.try_into().unwrap())).collect();
println!(" counter[0]={} container[0..4]={:?} (dumped {} MB)", cnt[0], &sol[..4], dump / 1048576);
}
// the midstate actually used by digit_f (injected, or from the recording)
let mut midstate = [0u8; 64];
if let Some(df) = rec.pass.iter().find(|l| l.name.contains("7digit_f")) {
midstate.copy_from_slice(&df.arg[0..64]);
}
if let Some(inj) = &inject { midstate = inj.midstate; }
cuMemFree_v2(arena);
Ok((sol, midstate, container_bytes))
}
fn short(name: &str) -> String {
name.split(['I', 'E']).next().unwrap_or(name).trim_start_matches('_').trim_start_matches("Z7").trim_start_matches("Z8").trim_start_matches("Z17").to_string()
}
-109
View File
@@ -1,109 +0,0 @@
//! Round 0 (`digit_f`) standalone driver for the Equihash 192,7 solver.
//!
//! * launch config: grid=65536, block=256, shmem=0
//! * argument layout: (ulonglong4 mid0, ulonglong4 mid1, uint4* A, uint4* B,
//! uchar* C, uint* counters, uint nonce)
//! * a real 64-byte BLAKE2b midstate + nonce captured from one job
//! * buffer sizes derived from the kernel template array dims
//!
//! We replay that exact job's round 0: hash + bucket on the GPU, then read back
//! the per-bucket counters to prove the round executed and distributed entries.
use crate::cuda::*;
use std::ffi::{c_void, CString};
use std::ptr;
// Exact runtime variant (from the fatbin); demangled:
// void digit_f<656825858919744ul,2u,14u,12288u,3392u,1u,5498900316166ul,
// uint4[106][12288][32], uint4[106][12288][32], unsigned char[53][12288][64]>
// (ulonglong4, ulonglong4, uint4(*)[106][12288][32], uint4(*)[106][12288][32],
// unsigned char(*)[53][12288][64], unsigned int*, unsigned int)
const DIGIT_F: &str = "_Z7digit_fILm656825858919744ELj2ELj14ELj12288ELj3392ELj1ELm5498900316166EA106_A12288_A32_5uint4S3_A53_A12288_A64_hEv10ulonglong4S7_PT6_PT7_PT8_Pjj";
// 64-byte BLAKE2b midstate (8x u64 state) captured from a live job, passed as
// two ulonglong4 by value.
const MIDSTATE0: [u8; 32] = [
0x2d, 0xc6, 0x4e, 0x32, 0xef, 0x89, 0x19, 0x16, 0x30, 0xe1, 0x2d, 0x16, 0x17, 0xb9, 0xeb, 0xee,
0x33, 0x8a, 0x63, 0xc6, 0xbb, 0xb3, 0x96, 0x33, 0xf1, 0x79, 0x25, 0x9a, 0x7a, 0x26, 0xae, 0x67,
];
const MIDSTATE1: [u8; 32] = [
0x37, 0x5f, 0x85, 0x39, 0x46, 0x27, 0x08, 0xc0, 0xad, 0x3c, 0x08, 0xe3, 0xda, 0x65, 0xdf, 0xdd,
0x27, 0x73, 0x1f, 0x13, 0x4d, 0x6f, 0xea, 0x58, 0x96, 0x0d, 0x8b, 0xf3, 0x7c, 0x29, 0x29, 0x9a,
];
const NONCE_ARG: u32 = 1_508_556_231;
// Buffer sizes from the template array dimensions.
const BUF_A: usize = 106 * 12288 * 32 * 16; // uint4[106][12288][32] ≈ 636 MB
const BUF_C: usize = 53 * 12288 * 64; // uchar[53][12288][64] ≈ 40 MB
const COUNTERS: usize = 64 * 1024 * 1024; // generous (observed array ≈ 1.5 MB)
const COUNT_READBACK: usize = 12288 * 32; // per-bucket-slot counters to inspect
pub unsafe fn run(module: CUmodule) -> Result<(), String> {
println!("\n== round 0 (digit_f) standalone replay ==");
let mut free: usize = 0;
let mut total: usize = 0;
cuMemGetInfo_v2(&mut free, &mut total);
println!(
"vram : {} MB free / {} MB total; need ~{} MB",
free / 1048576, total / 1048576, (2 * BUF_A + BUF_C + COUNTERS) / 1048576
);
let cname = CString::new(DIGIT_F).unwrap();
let mut f: CUfunction = ptr::null_mut();
check(cuModuleGetFunction(&mut f, module, cname.as_ptr()), "cuModuleGetFunction(digit_f)")?;
println!("kernel : digit_f<...12288...> resolved, launching grid=65536 block=256");
// allocate the four device buffers
let (mut a, mut b, mut c, mut cnt): (CUdeviceptr, CUdeviceptr, CUdeviceptr, CUdeviceptr) = (0, 0, 0, 0);
check(cuMemAlloc_v2(&mut a, BUF_A), "alloc bufA")?;
check(cuMemAlloc_v2(&mut b, BUF_A), "alloc bufB")?;
check(cuMemAlloc_v2(&mut c, BUF_C), "alloc bufC")?;
check(cuMemAlloc_v2(&mut cnt, COUNTERS), "alloc counters")?;
cuMemsetD8_v2(a, 0, BUF_A);
cuMemsetD8_v2(b, 0, BUF_A);
cuMemsetD8_v2(c, 0, BUF_C);
cuMemsetD32_v2(cnt, 0, COUNTERS / 4); // cleanup<64> does this in the real pipeline
let mut mid0 = MIDSTATE0;
let mut mid1 = MIDSTATE1;
let (mut pa, mut pb, mut pc, mut pcnt) = (a, b, c, cnt);
let mut nonce = NONCE_ARG;
let mut params: [*mut c_void; 7] = [
mid0.as_mut_ptr() as *mut c_void,
mid1.as_mut_ptr() as *mut c_void,
&mut pa as *mut _ as *mut c_void,
&mut pb as *mut _ as *mut c_void,
&mut pc as *mut _ as *mut c_void,
&mut pcnt as *mut _ as *mut c_void,
&mut nonce as *mut _ as *mut c_void,
];
let rc = cuLaunchKernel(f, 65536, 1, 1, 256, 1, 1, 0, ptr::null_mut(), params.as_mut_ptr(), ptr::null_mut());
let result = if rc != CUDA_SUCCESS {
Err(format!("launch failed: {}", err_str(rc)))
} else {
let s = cuCtxSynchronize();
if s != CUDA_SUCCESS {
Err(format!("kernel sync error: {}", err_str(s)))
} else {
// read back the bucket counters and summarize
let mut host = vec![0u32; COUNT_READBACK];
cuMemcpyDtoH_v2(host.as_mut_ptr() as *mut c_void, cnt, COUNT_READBACK * 4);
let nz = host.iter().filter(|&&x| x != 0).count();
let sum: u64 = host.iter().map(|&x| x as u64).sum();
let mx = host.iter().copied().max().unwrap_or(0);
println!("result : round 0 executed OK");
println!(" {nz}/{COUNT_READBACK} counter slots non-zero");
println!(" total bucketed entries = {sum} (max per slot = {mx})");
println!(" (2^24 = {} threads each hashed; ~2^25 entries expected)", 1u64 << 24);
Ok(())
}
};
cuMemFree_v2(a);
cuMemFree_v2(b);
cuMemFree_v2(c);
cuMemFree_v2(cnt);
result
}
-81
View File
@@ -1,81 +0,0 @@
//! Equihash (n=192, k=7) solution verification (Wagner tree).
//!
//! A solution is 2^k = 128 indices. With collision length c = n/(k+1) = 24 bits
//! and each per-index hash being n=192 bits (24 bytes):
//! * all indices distinct
//! * canonical ordering: at every tree node, the smallest index of the left
//! subtree < that of the right subtree
//! * at level r (1..=k), each block of 2^r leaves XORs to zero in its first
//! r*24 bits; the full 128-leaf XOR is zero over all 192 bits.
const N_BITS: usize = 192;
const K: usize = 7;
const COLL: usize = N_BITS / (K + 1); // 24
/// number of leading zero bits in a 24-byte big-endian-ish hash (byte 0 = MSB).
fn leading_zero_bits(h: &[u8; 24]) -> usize {
let mut n = 0;
for &b in h {
if b == 0 { n += 8; } else { n += b.leading_zeros() as usize; break; }
}
n
}
fn xor24(a: &[u8; 24], b: &[u8; 24]) -> [u8; 24] {
let mut o = [0u8; 24];
for i in 0..24 { o[i] = a[i] ^ b[i]; }
o
}
/// Verify a 128-index solution given a per-index hash function.
/// Returns (valid, diagnostic_string).
pub fn verify(indices: &[u32], hash: impl Fn(u32) -> [u8; 24]) -> (bool, String) {
if indices.len() != 128 {
return (false, format!("expected 128 indices, got {}", indices.len()));
}
// distinctness
let mut sorted = indices.to_vec();
sorted.sort_unstable();
sorted.dedup();
if sorted.len() != 128 {
return (false, format!("indices not distinct ({} unique)", sorted.len()));
}
// leaf hashes
let leaves: Vec<[u8; 24]> = indices.iter().map(|&i| hash(i)).collect();
// bottom-up: each level halves; check collision prefix grows by COLL bits
let mut level: Vec<[u8; 24]> = leaves.clone();
let mut worst_zero = usize::MAX;
for r in 1..=K {
let need = r * COLL;
let mut next = Vec::with_capacity(level.len() / 2);
for pair in level.chunks(2) {
let x = xor24(&pair[0], &pair[1]);
let z = leading_zero_bits(&x);
worst_zero = worst_zero.min(z);
if z < need {
return (false, format!("level {r}: only {z} leading zero bits, need {need}"));
}
next.push(x);
}
level = next;
}
let full_zero = level.len() == 1 && level[0].iter().all(|&b| b == 0);
let msg = format!(
"all {K} levels pass collision checks; final XOR {} (min prefix zeros seen = {})",
if full_zero { "= 0 (VALID)" } else { "!= 0" }, worst_zero
);
(full_zero, msg)
}
/// Quick diagnostic when the hash model may be off: report the max leading-zero
/// bits of the full 128-leaf XOR (≈168+ means the hash model is correct).
pub fn top_xor_zero_bits(indices: &[u32], hash: impl Fn(u32) -> [u8; 24]) -> usize {
let mut acc = [0u8; 24];
for &i in indices {
let h = hash(i);
for j in 0..24 { acc[j] ^= h[j]; }
}
leading_zero_bits(&acc)
}