DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. More...

Detailed Description

DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.

Overview: Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the i^th output reduction incorporates the i^th input. The term exclusive indicates the i^th input is not incorporated into the i^th output reduction.

: As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]. The central idea is to leverage a small, constant factor of redundant work in order to overlap the latencies of global prefix propagation with local computation. As such, our algorithm requires only ~2n data movement (n inputs are read, n outputs are written), and typically proceeds at "memcpy" speeds.

: [1] Duane Merrill and Michael Garland. "Single-pass Parallel Prefix Scan with Decoupled Look-back", NVIDIA Technical Report NVR-2016-002, 2016.

Usage Considerations: \cdp_class{DeviceScan}

Performance: \linear_performance{prefix scan}

: The following chart illustrates DeviceScan::ExclusiveSum performance across different CUDA architectures for int32 keys. \plots_below

Definition at line 89 of file device_scan.cuh.

Static Public Member Functions
Exclusive scans
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	ExclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value, and is assigned to *d_out.

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT , typename InitValueT >
static CUB_RUNTIME_FUNCTION cudaError_t	ExclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide exclusive prefix scan using the specified binary `scan_op` functor. The `init_value` value is applied as the initial value, and is assigned to *d_out.

Inclusive scans
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	InclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide inclusive prefix sum.

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT >
static CUB_RUNTIME_FUNCTION cudaError_t	InclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide inclusive prefix scan using the specified binary `scan_op` functor.

Member Function Documentation

◆ ExclusiveScan()

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT , typename InitValueT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveScan	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		ScanOpT	scan_op,
		InitValueT	init_value,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is applied as the initial value, and is assigned to *d_out.

Supports non-commutative scan operators.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the exclusive prefix min-scan of an int device vector

: #include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// CustomMin functor

struct CustomMin

{

template <typename T>

CUB_RUNTIME_FUNCTION __forceinline__

T operator()(const T &a, const T &b) const {

return (b < a) ? b : a;

}

};

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [ , , , , , , ]

CustomMin min_op

...

// Determine temporary device storage requirements for exclusive prefix scan

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);

// Allocate temporary storage for exclusive prefix scan

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan

cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);

// d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]

cub::num_items
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
Definition dispatch_radix_sort.cuh:168

cub::d_out
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
Definition dispatch_reduce.cuh:71

cub::DeviceScan::ExclusiveScan
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide exclusive prefix scan using the specified binary scan_op functor....
Definition device_scan.cuh:243

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing scan outputs \iterator
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`
Identity	[inferred] Type of the `identity` value used Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of data items
[in]	scan_op	Binary scan functor
[in]	init_value	Initial value to seed the exclusive scan (and is assigned to *d_out)
[in]	num_items	Total number of input items (i.e., the length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 243 of file device_scan.cuh.

◆ ExclusiveSum()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveSum	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value, and is assigned to *d_out.

Supports non-commutative sum operators.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Performance: The following charts illustrate saturated exclusive sum performance across different CUDA architectures for int32 and int64 items, respectively.

Snippet: The code snippet below illustrates the exclusive prefix sum of an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [ , , , , , , ]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum

cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out s<-- [0, 8, 14, 21, 26, 29, 29]

cub::DeviceScan::ExclusiveSum
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value,...
Definition device_scan.cuh:149

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing scan outputs \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of data items
[in]	num_items	Total number of input items (i.e., the length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 149 of file device_scan.cuh.

◆ InclusiveScan()

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::InclusiveScan	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		ScanOpT	scan_op,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.

Supports non-commutative scan operators.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the inclusive prefix min-scan of an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// CustomMin functor

struct CustomMin

{

template <typename T>

CUB_RUNTIME_FUNCTION __forceinline__

T operator()(const T &a, const T &b) const {

return (b < a) ? b : a;

}

};

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [ , , , , , , ]

CustomMin min_op;

...

// Determine temporary device storage requirements for inclusive prefix scan

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);

// Allocate temporary storage for inclusive prefix scan

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix min-scan

cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);

// d_out <-- [8, 6, 6, 5, 3, 0, 0]

cub::DeviceScan::InclusiveScan
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.
Definition device_scan.cuh:407

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing scan outputs \iterator
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of data items
[in]	scan_op	Binary scan functor
[in]	num_items	Total number of input items (i.e., the length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 407 of file device_scan.cuh.

◆ InclusiveSum()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::InclusiveSum	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide inclusive prefix sum.

Supports non-commutative sum operators.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the inclusive prefix sum of an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [ , , , , , , ]

...

// Determine temporary device storage requirements for inclusive prefix sum

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage for inclusive prefix sum

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix sum

cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out <-- [8, 14, 21, 26, 29, 29, 38]

cub::DeviceScan::InclusiveSum
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide inclusive prefix sum.
Definition device_scan.cuh:323

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing scan outputs \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of data items
[in]	num_items	Total number of input items (i.e., the length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 323 of file device_scan.cuh.

The documentation for this struct was generated from the following file:

openfpm_data/src/util/cuda/cub_old/device/device_scan.cuh

Detailed Description

Static Public Member Functions

Member Function Documentation

◆ ExclusiveScan()

◆ ExclusiveSum()

◆ InclusiveScan()

◆ InclusiveSum()