The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. More...

Detailed Description

template<typename T, int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.

Template Parameters

T	Data type being scanned
BLOCK_DIM_X	The thread block length in threads along the X dimension
ALGORITHM	[optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
BLOCK_DIM_Y	[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z	[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH	[optional] \ptxversion

Overview

Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the i^th output reduction incorporates the i^th input. The term exclusive indicates the i^th input is not incorporated into the i^th output reduction.
\rowmajor
BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:
1. cub::BLOCK_SCAN_RAKING. An efficient (high throughput) "raking reduce-then-scan" prefix scan algorithm. More...
2. cub::BLOCK_SCAN_RAKING_MEMOIZE. Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage. More...
3. cub::BLOCK_SCAN_WARP_SCANS. A quick (low latency) "tiled warpscans" prefix scan algorithm. More...

Performance Considerations

\granularity
Uses special instructions when applicable (e.g., warp SHFL)
Uses synchronization-free communication between warp lanes when applicable
Invokes a minimal number of minimal block-wide synchronization barriers (only one or two depending on algorithm selection)
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- Prefix sum variants (vs. generic scan)
- \blocksize
See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives

A Simple Example: \blockcollective{BlockScan}

: The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide exclusive prefix sum

BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

cub::BlockScan
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items par...
Definition block_scan.cuh:194

cub::BlockScan::BlockScan
__device__ __forceinline__ BlockScan()
Collective constructor using a private static allocation of shared memory as temporary storage.
Definition block_scan.cuh:271

cub::BlockScan::temp_storage
_TempStorage & temp_storage
Shared storage reference.
Definition block_scan.cuh:236

cub::BlockScan::TempStorage
\smemstorage{BlockScan}
Definition block_scan.cuh:260

: Suppose the set of input thread_data across the block of threads is {[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}. The corresponding output thread_data in those threads will be {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}.

Definition at line 193 of file block_scan.cuh.

Data Structures
struct	TempStorage
	\smemstorage{BlockScan} More...

Public Types
typedef std::array< T, dim >	TempStorage

Public Member Functions
	BlockScan (TempStorage &tmp)

void	ExclusiveSum (T &in, T &out)

Collective constructors
__device__ __forceinline__	BlockScan ()
	Collective constructor using a private static allocation of shared memory as temporary storage.

__device__ __forceinline__	BlockScan (TempStorage &temp_storage)
	Collective constructor using the specified memory allocation as temporary storage.

Exclusive prefix sum operations
__device__ __forceinline__ void	ExclusiveSum (T input, T &output)
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to `output` in thread₀.

__device__ __forceinline__ void	ExclusiveSum (T input, T &output, T &block_aggregate)
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to `output` in thread₀. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	ExclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Exclusive prefix sum operations (multiple data per thread)
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void	ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD])
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to `output`[0] in thread₀.

template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void	ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate)
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to `output`[0] in thread₀. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Exclusive prefix scan operations
template<typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element.

template<typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op, T &block_aggregate)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<typename ScanOp , typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	ExclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Exclusive prefix scan operations (multiple data per thread)
template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements.

template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Exclusive prefix scan operations (no initial value, single datum per thread)
template<typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T input, T &output, ScanOp scan_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. With no initial value, the output computed for thread₀ is undefined.

template<typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. Also provides every thread with the block-wide `block_aggregate` of all inputs. With no initial value, the output computed for thread₀ is undefined.

Exclusive prefix scan operations (no initial value, multiple data per thread)
template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. With no initial value, the output computed for thread₀ is undefined.

template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate)
	Computes an exclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide `block_aggregate` of all inputs. With no initial value, the output computed for thread₀ is undefined.

Inclusive prefix sum operations
__device__ __forceinline__ void	InclusiveSum (T input, T &output)
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.

__device__ __forceinline__ void	InclusiveSum (T input, T &output, T &block_aggregate)
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	InclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Inclusive prefix sum operations (multiple data per thread)
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void	InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD])
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.

template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void	InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate)
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Inclusive prefix scan operations
template<typename ScanOp >
__device__ __forceinline__ void	InclusiveScan (T input, T &output, ScanOp scan_op)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element.

template<typename ScanOp >
__device__ __forceinline__ void	InclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<typename ScanOp , typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	InclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes one input element. the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Inclusive prefix scan operations (multiple data per thread)
template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements.

template<int ITEMS_PER_THREAD, typename ScanOp >
__device__ __forceinline__ void	InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide `block_aggregate` of all inputs.

template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp >
__device__ __forceinline__ void	InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)
	Computes an inclusive block-wide prefix scan using the specified binary `scan_op` functor. Each thread contributes an array of consecutive input elements. the call-back functor `block_prefix_callback_op` is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide `block_aggregate` of all inputs.

Private Types
enum	{ BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z }
	Constants. More...

typedef BlockScanWarpScans< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >	WarpScans

typedef BlockScanRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z,(SAFE_ALGORITHM==BLOCK_SCAN_RAKING_MEMOIZE), PTX_ARCH >	Raking

typedef If<(SAFE_ALGORITHM==BLOCK_SCAN_WARP_SCANS), WarpScans, Raking >::Type	InternalBlockScan
	Define the delegate type for the desired algorithm.

typedef InternalBlockScan::TempStorage	_TempStorage
	Shared memory storage layout type for BlockScan.

Private Member Functions
__device__ __forceinline__ _TempStorage &	PrivateStorage ()
	Internal storage allocator.

Private Attributes
_TempStorage &	temp_storage
	Shared storage reference.

unsigned int	linear_tid
	Linear thread-id.

TempStorage &	tmp

Static Private Attributes
static const BlockScanAlgorithm	SAFE_ALGORITHM

Member Typedef Documentation

◆ _TempStorage

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef InternalBlockScan::TempStorage cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::_TempStorage

private

Shared memory storage layout type for BlockScan.

Definition at line 228 of file block_scan.cuh.

◆ InternalBlockScan

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef If<(SAFE_ALGORITHM==BLOCK_SCAN_WARP_SCANS),WarpScans,Raking>::Type cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InternalBlockScan

private

Define the delegate type for the desired algorithm.

Definition at line 225 of file block_scan.cuh.

◆ Raking

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef BlockScanRaking<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, (SAFE_ALGORITHM == BLOCK_SCAN_RAKING_MEMOIZE), PTX_ARCH> cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Raking

private

Definition at line 220 of file block_scan.cuh.

◆ TempStorage

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef std::array<T,dim> cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::TempStorage

Definition at line 81 of file cudify_alpaka.hpp.

◆ WarpScans

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef BlockScanWarpScans<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpScans

private

Definition at line 219 of file block_scan.cuh.

Member Enumeration Documentation

◆ anonymous enum

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

anonymous enum

private

Constants.

Enumerator
BLOCK_THREADS	The thread block size in threads.

Definition at line 202 of file block_scan.cuh.

Constructor & Destructor Documentation

◆ BlockScan() [1/3]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockScan ( )

inline

Collective constructor using a private static allocation of shared memory as temporary storage.

Definition at line 271 of file block_scan.cuh.

◆ BlockScan() [2/3]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockScan ( TempStorage & temp_storage )

inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters

[in] temp_storage Reference to memory allocation having layout type TempStorage

Definition at line 281 of file block_scan.cuh.

◆ BlockScan() [3/3]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockScan ( TempStorage & tmp )

inline

Definition at line 90 of file cudify_alpaka.hpp.

Member Function Documentation

◆ ExclusiveScan() [1/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. With no initial value, the output computed for thread₀ is undefined.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor

Definition at line 1136 of file block_scan.cuh.

◆ ExclusiveScan() [2/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp , typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockScan for a 1D block of 128 threads

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(INT_MIN);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data = d_data[block_offset];

// Collectively compute the block-wide exclusive prefix max scan

BlockScan(temp_storage).ExclusiveScan(

thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);

CTA_SYNC();

// Store scanned items to output segment

d_data[block_offset] = thread_data;

}

cub::CTA_SYNC
CTA_SYNC()
Definition util_ptx.cuh:255

cub::num_items
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
Definition dispatch_radix_sort.cuh:168

block_offset
Definition SparseGridGpu_ker.cuh:14

cub::Max
Default max functor.
Definition thread_operators.cuh:124

: Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126. The output for the second segment will be 126, 128, 128, 130, ..., 252, 254.

Template Parameters

ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 867 of file block_scan.cuh.

◆ ExclusiveScan() [3/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. With no initial value, the output computed for thread₀ is undefined.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1156 of file block_scan.cuh.

◆ ExclusiveScan() [4/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T	input,
		T &	output,
		T	initial_value,
		ScanOp	scan_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide exclusive prefix max scan

BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());

: Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126.

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	initial_value	Initial value to seed the exclusive scan (and is assigned to `output`[0] in thread₀)
[in]	scan_op	Binary scan functor

Definition at line 728 of file block_scan.cuh.

◆ ExclusiveScan() [5/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T	input,
		T &	output,
		T	initial_value,
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide exclusive prefix max scan

int block_aggregate;

BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);

: Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. Furthermore the value 126 will be stored in block_aggregate for all threads.

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	initial_value	Initial value to seed the exclusive scan (and is assigned to `output`[0] in thread₀)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 778 of file block_scan.cuh.

◆ ExclusiveScan() [6/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. With no initial value, the output computed for thread₀ is undefined.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor

Definition at line 1187 of file block_scan.cuh.

◆ ExclusiveScan() [7/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread

typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;

typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan

__shared__ union {

typename BlockLoad::TempStorage load;

typename BlockScan::TempStorage scan;

typename BlockStore::TempStorage store;

} temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data[4];

BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);

CTA_SYNC();

// Collectively compute the block-wide exclusive prefix max scan

BlockScan(temp_storage.scan).ExclusiveScan(

thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);

CTA_SYNC();

// Store scanned items to output segment

BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);

CTA_SYNC();

}

cub::BlockLoad
The BlockLoad class provides collective data movement methods for loading a linear segment of items f...
Definition block_load.cuh:641

cub::BlockStore
The BlockStore class provides collective data movement methods for writing a blocked arrangement of i...
Definition block_store.cuh:516

: Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510. The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 1099 of file block_scan.cuh.

◆ ExclusiveScan() [8/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. With no initial value, the output computed for thread₀ is undefined.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1218 of file block_scan.cuh.

◆ ExclusiveScan() [9/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		T	initial_value,
		ScanOp	scan_op
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide exclusive prefix max scan

BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());

: Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	initial_value	Initial value to seed the exclusive scan (and is assigned to `output`[0] in thread₀)
[in]	scan_op	Binary scan functor

Definition at line 929 of file block_scan.cuh.

◆ ExclusiveScan() [10/10]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		T	initial_value,
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide exclusive prefix max scan

int block_aggregate;

BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);

: Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. Furthermore the value 510 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	initial_value	Initial value to seed the exclusive scan (and is assigned to `output`[0] in thread₀)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 991 of file block_scan.cuh.

◆ ExclusiveSum() [1/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T &	in,
		T &	out
	)

inline

Definition at line 94 of file cudify_alpaka.hpp.

◆ ExclusiveSum() [2/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T	input,
		T &	output
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread₀.

\identityzero
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide exclusive prefix sum

BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

: Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 0, 1, ..., 127.

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)

Definition at line 333 of file block_scan.cuh.

◆ ExclusiveSum() [3/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T	input,
		T &	output,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

\identityzero
The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total += block_aggregate;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockScan for a 1D block of 128 threads

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data = d_data[block_offset];

// Collectively compute the block-wide exclusive prefix sum

BlockScan(temp_storage).ExclusiveSum(

thread_data, thread_data, prefix_op);

CTA_SYNC();

// Store scanned items to output segment

d_data[block_offset] = thread_data;

}

: Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 0, 1, ..., 127. The output for the second segment will be 128, 129, ..., 255.

Template Parameters

BlockPrefixCallbackOp [inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 465 of file block_scan.cuh.

◆ ExclusiveSum() [4/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T	input,
		T &	output,
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread₀. Also provides every thread with the block-wide block_aggregate of all inputs.

\identityzero
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide exclusive prefix sum

int block_aggregate;

BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);

: Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 0, 1, ..., 127. Furthermore the value 128 will be stored in block_aggregate for all threads.

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 380 of file block_scan.cuh.

◆ ExclusiveSum() [5/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD]
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output[0] in thread₀.

\identityzero
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide exclusive prefix sum

BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

: Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }.

Template Parameters

ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread.

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)

Definition at line 521 of file block_scan.cuh.

◆ ExclusiveSum() [6/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

\identityzero
The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total += block_aggregate;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread

typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;

typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan

__shared__ union {

typename BlockLoad::TempStorage load;

typename BlockScan::TempStorage scan;

typename BlockStore::TempStorage store;

} temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data[4];

BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);

CTA_SYNC();

// Collectively compute the block-wide exclusive prefix sum

int block_aggregate;

BlockScan(temp_storage.scan).ExclusiveSum(

thread_data, thread_data, prefix_op);

CTA_SYNC();

// Store scanned items to output segment

BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);

CTA_SYNC();

}

: Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511. The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 673 of file block_scan.cuh.

◆ ExclusiveSum() [7/7]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		T &	block_aggregate
	)

inline

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output[0] in thread₀. Also provides every thread with the block-wide block_aggregate of all inputs.

\identityzero
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide exclusive prefix sum

int block_aggregate;

BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);

: Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. Furthermore the value 512 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread.

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 572 of file block_scan.cuh.

◆ InclusiveScan() [1/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide inclusive prefix max scan

BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());

: Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126.

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor

Definition at line 1711 of file block_scan.cuh.

◆ InclusiveScan() [2/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp , typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockScan for a 1D block of 128 threads

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(INT_MIN);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data = d_data[block_offset];

// Collectively compute the block-wide inclusive prefix max scan

BlockScan(temp_storage).InclusiveScan(

thread_data, thread_data, cub::Max(), prefix_op);

CTA_SYNC();

// Store scanned items to output segment

d_data[block_offset] = thread_data;

}

: Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126. The output for the second segment will be 128, 128, 130, 130, ..., 254, 254.

Template Parameters

ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 1848 of file block_scan.cuh.

◆ InclusiveScan() [3/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T	input,
		T &	output,
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

Supports non-commutative scan operators.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide inclusive prefix max scan

int block_aggregate;

BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);

: Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. Furthermore the value 126 will be stored in block_aggregate for all threads.

Template Parameters

ScanOp [inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1760 of file block_scan.cuh.

◆ InclusiveScan() [4/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide inclusive prefix max scan

BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());

: Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor

Definition at line 1908 of file block_scan.cuh.

◆ InclusiveScan() [5/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread

typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;

typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan

__shared__ union {

typename BlockLoad::TempStorage load;

typename BlockScan::TempStorage scan;

typename BlockStore::TempStorage store;

} temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data[4];

BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);

CTA_SYNC();

// Collectively compute the block-wide inclusive prefix max scan

BlockScan(temp_storage.scan).InclusiveScan(

thread_data, thread_data, cub::Max(), prefix_op);

CTA_SYNC();

// Store scanned items to output segment

BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);

CTA_SYNC();

}

: Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510. The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 2092 of file block_scan.cuh.

◆ InclusiveScan() [6/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ScanOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveScan	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		ScanOp	scan_op,
		T &	block_aggregate
	)

inline

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

Supports non-commutative scan operators.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide inclusive prefix max scan

int block_aggregate;

BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);

: Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }. Furthermore the value 510 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[in]	scan_op	Binary scan functor
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1978 of file block_scan.cuh.

◆ InclusiveSum() [1/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T	input,
		T &	output
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.

\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide inclusive prefix sum

BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);

: Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 1, 2, ..., 128.

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)

Definition at line 1279 of file block_scan.cuh.

◆ InclusiveSum() [2/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T	input,
		T &	output,
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total += block_aggregate;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockScan for a 1D block of 128 threads

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data = d_data[block_offset];

// Collectively compute the block-wide inclusive prefix sum

BlockScan(temp_storage).InclusiveSum(

thread_data, thread_data, prefix_op);

CTA_SYNC();

// Store scanned items to output segment

d_data[block_offset] = thread_data;

}

: Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 1, 2, ..., 128. The output for the second segment will be 129, 130, ..., 256.

Template Parameters

BlockPrefixCallbackOp [inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 1408 of file block_scan.cuh.

◆ InclusiveSum() [3/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T	input,
		T &	output,
		T &	block_aggregate
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

\rowmajor
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain input item for each thread

int thread_data;

...

// Collectively compute the block-wide inclusive prefix sum

int block_aggregate;

BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);

: Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 1, 2, ..., 128. Furthermore the value 128 will be stored in block_aggregate for all threads.

Parameters

[in]	input	Calling thread's input item
[out]	output	Calling thread's output item (may be aliased to `input`)
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1324 of file block_scan.cuh.

◆ InclusiveSum() [4/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD]
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.

\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide inclusive prefix sum

BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);

: Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }.

Template Parameters

ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread.

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)

Definition at line 1463 of file block_scan.cuh.

◆ InclusiveSum() [5/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp >

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		BlockPrefixCallbackOp &	block_prefix_callback_op
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane₀ in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor's input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane₀ is applied as the block-wide prefix. Can be stateful.
\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple "tiles" of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied

// during consecutive scan operations.

struct BlockPrefixCallbackOp

{

// Running prefix

int running_total;

// Constructor

__device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

// Callback operator to be entered by the first warp of threads in the block.

// Thread-0 is responsible for returning a value for seeding the block-wide scan.

__device__ int operator()(int block_aggregate)

{

int old_prefix = running_total;

running_total += block_aggregate;

return old_prefix;

}

};

__global__ void ExampleKernel(int *d_data, int num_items, ...)

{

// Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread

typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;

typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan

__shared__ union {

typename BlockLoad::TempStorage load;

typename BlockScan::TempStorage scan;

typename BlockStore::TempStorage store;

} temp_storage;

// Initialize running total

BlockPrefixCallbackOp prefix_op(0);

// Have the block iterate over segments of items

for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)

{

// Load a segment of consecutive items that are blocked across threads

int thread_data[4];

BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);

CTA_SYNC();

// Collectively compute the block-wide inclusive prefix sum

BlockScan(temp_storage.scan).IncluisveSum(

thread_data, thread_data, prefix_op);

CTA_SYNC();

// Store scanned items to output segment

BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);

CTA_SYNC();

}

: Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512. The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
BlockPrefixCallbackOp	[inferred] Call-back functor type having member `T operator()(T block_aggregate)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
	block_prefix_callback_op	[in-out] *[warp₀ only]* Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Definition at line 1642 of file block_scan.cuh.

◆ InclusiveSum() [6/6]

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD>

__device__ __forceinline__ void cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InclusiveSum	(	T(&)	input[ITEMS_PER_THREAD],
		T(&)	output[ITEMS_PER_THREAD],
		T &	block_aggregate
	)

inline

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

\blocked
\granularity
\smemreuse

Snippet: The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockScan for a 1D block of 128 threads on type int

typedef cub::BlockScan<int, 128> BlockScan;

// Allocate shared memory for BlockScan

__shared__ typename BlockScan::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Collectively compute the block-wide inclusive prefix sum

int block_aggregate;

BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);

: Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. Furthermore the value 512 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ScanOp	[inferred] Binary scan functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	input	Calling thread's input items
[out]	output	Calling thread's output items (may be aliased to `input`)
[out]	block_aggregate	block-wide aggregate reduction of input items

Definition at line 1530 of file block_scan.cuh.

◆ PrivateStorage()

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ _TempStorage & cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::PrivateStorage ( )

inlineprivate

Internal storage allocator.

Definition at line 247 of file block_scan.cuh.

Field Documentation

◆ linear_tid

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

unsigned int cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::linear_tid

private

Linear thread-id.

Definition at line 239 of file block_scan.cuh.

◆ SAFE_ALGORITHM

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

const BlockScanAlgorithm cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SAFE_ALGORITHM

staticprivate

Initial value:

=
        ((ALGORITHM == BLOCK_SCAN_WARP_SCANS) && (BLOCK_THREADS % CUB_WARP_THREADS(PTX_ARCH) != 0)) ?
            BLOCK_SCAN_RAKING :
            ALGORITHM

Ensure the template parameterization meets the requirements of the specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy cannot be used with thread block sizes not a multiple of the architectural warp size.

Definition at line 214 of file block_scan.cuh.

◆ temp_storage

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

_TempStorage& cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::temp_storage

private

Shared storage reference.

Definition at line 236 of file block_scan.cuh.

◆ tmp

template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

TempStorage& cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::tmp

private

Definition at line 84 of file cudify_alpaka.hpp.

The documentation for this class was generated from the following files:

openfpm_data/src/util/cuda/cub_old/block/block_scan.cuh
openfpm_devices/src/util/cudify/alpaka/cudify_alpaka.hpp

Detailed Description

Data Structures

Public Types

Public Member Functions

Private Types

Private Member Functions

Private Attributes

Static Private Attributes

Member Typedef Documentation

◆ _TempStorage

◆ InternalBlockScan

◆ Raking

◆ TempStorage

◆ WarpScans

Member Enumeration Documentation

◆ anonymous enum

Constructor & Destructor Documentation

◆ BlockScan() [1/3]

◆ BlockScan() [2/3]

◆ BlockScan() [3/3]

Member Function Documentation

◆ ExclusiveScan() [1/10]

◆ ExclusiveScan() [2/10]

◆ ExclusiveScan() [3/10]

◆ ExclusiveScan() [4/10]

◆ ExclusiveScan() [5/10]

◆ ExclusiveScan() [6/10]

◆ ExclusiveScan() [7/10]

◆ ExclusiveScan() [8/10]

◆ ExclusiveScan() [9/10]

◆ ExclusiveScan() [10/10]

◆ ExclusiveSum() [1/7]

◆ ExclusiveSum() [2/7]

◆ ExclusiveSum() [3/7]

◆ ExclusiveSum() [4/7]

◆ ExclusiveSum() [5/7]

◆ ExclusiveSum() [6/7]

◆ ExclusiveSum() [7/7]

◆ InclusiveScan() [1/6]

◆ InclusiveScan() [2/6]

◆ InclusiveScan() [3/6]

◆ InclusiveScan() [4/6]

◆ InclusiveScan() [5/6]

◆ InclusiveScan() [6/6]

◆ InclusiveSum() [1/6]

◆ InclusiveSum() [2/6]

◆ InclusiveSum() [3/6]

◆ InclusiveSum() [4/6]

◆ InclusiveSum() [5/6]

◆ InclusiveSum() [6/6]

◆ PrivateStorage()

Field Documentation

◆ linear_tid

◆ SAFE_ALGORITHM

◆ temp_storage

◆ tmp