The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. More...
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
T | Data type being scanned |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ALGORITHM | [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] \ptxversion |
SHFL
)thread_data
across the block of threads is {[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}
. The corresponding output thread_data
in those threads will be {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}
. Definition at line 193 of file block_scan.cuh.
Data Structures | |
struct | TempStorage |
\smemstorage{BlockScan} More... | |
Public Types | |
typedef std::array< T, dim > | TempStorage |
Public Member Functions | |
BlockScan (TempStorage &tmp) | |
void | ExclusiveSum (T &in, T &out) |
Collective constructors | |
__device__ __forceinline__ | BlockScan () |
Collective constructor using a private static allocation of shared memory as temporary storage. | |
__device__ __forceinline__ | BlockScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. | |
Exclusive prefix sum operations | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0. | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Exclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output [0] in thread0. | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output [0] in thread0. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Exclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Exclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Exclusive prefix scan operations (no initial value, single datum per thread) | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. With no initial value, the output computed for thread0 is undefined. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. With no initial value, the output computed for thread0 is undefined. | |
Exclusive prefix scan operations (no initial value, multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. With no initial value, the output computed for thread0 is undefined. | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. With no initial value, the output computed for thread0 is undefined. | |
Inclusive prefix sum operations | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Inclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Inclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Inclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. | |
Private Types | |
enum | { BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z } |
Constants. More... | |
typedef BlockScanWarpScans< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > | WarpScans |
typedef BlockScanRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z,(SAFE_ALGORITHM==BLOCK_SCAN_RAKING_MEMOIZE), PTX_ARCH > | Raking |
typedef If<(SAFE_ALGORITHM==BLOCK_SCAN_WARP_SCANS), WarpScans, Raking >::Type | InternalBlockScan |
Define the delegate type for the desired algorithm. | |
typedef InternalBlockScan::TempStorage | _TempStorage |
Shared memory storage layout type for BlockScan. | |
Private Member Functions | |
__device__ __forceinline__ _TempStorage & | PrivateStorage () |
Internal storage allocator. | |
Private Attributes | |
_TempStorage & | temp_storage |
Shared storage reference. | |
unsigned int | linear_tid |
Linear thread-id. | |
TempStorage & | tmp |
Static Private Attributes | |
static const BlockScanAlgorithm | SAFE_ALGORITHM |
|
private |
Shared memory storage layout type for BlockScan.
Definition at line 228 of file block_scan.cuh.
|
private |
Define the delegate type for the desired algorithm.
Definition at line 225 of file block_scan.cuh.
|
private |
Definition at line 220 of file block_scan.cuh.
typedef std::array<T,dim> cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::TempStorage |
Definition at line 81 of file cudify_alpaka.hpp.
|
private |
Definition at line 219 of file block_scan.cuh.
|
private |
Constants.
Enumerator | |
---|---|
BLOCK_THREADS | The thread block size in threads. |
Definition at line 202 of file block_scan.cuh.
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
Definition at line 271 of file block_scan.cuh.
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 281 of file block_scan.cuh.
|
inline |
Definition at line 90 of file cudify_alpaka.hpp.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. With no initial value, the output computed for thread0 is undefined.
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
Definition at line 1136 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126
. The output for the second segment will be 126, 128, 128, 130, ..., 252, 254
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 867 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs. With no initial value, the output computed for thread0 is undefined.
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1156 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
Definition at line 728 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 778 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. With no initial value, the output computed for thread0 is undefined.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
Definition at line 1187 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510
. The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 1099 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs. With no initial value, the output computed for thread0 is undefined.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1218 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
Definition at line 929 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 991 of file block_scan.cuh.
|
inline |
Definition at line 94 of file cudify_alpaka.hpp.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output
in thread0.
thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
Definition at line 333 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, ..., 127
. The output for the second segment will be 128, 129, ..., 255
.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 465 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output
in thread0. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 380 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output
[0] in thread0.
thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
Definition at line 521 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511
. The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 673 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output
[0] in thread0. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 572 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
Definition at line 1711 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126
. The output for the second segment will be 128, 128, 130, 130, ..., 254, 254
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 1848 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1760 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
Definition at line 1908 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510
. The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 2092 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1978 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
Definition at line 1279 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, ..., 128
. The output for the second segment will be 129, 130, ..., 256
.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 1408 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1324 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
Definition at line 1463 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512
. The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
block_prefix_callback_op | [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
Definition at line 1642 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1530 of file block_scan.cuh.
|
inlineprivate |
Internal storage allocator.
Definition at line 247 of file block_scan.cuh.
|
private |
Linear thread-id.
Definition at line 239 of file block_scan.cuh.
|
staticprivate |
Ensure the template parameterization meets the requirements of the specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy cannot be used with thread block sizes not a multiple of the architectural warp size.
Definition at line 214 of file block_scan.cuh.
|
private |
Shared storage reference.
Definition at line 236 of file block_scan.cuh.
|
private |
Definition at line 84 of file cudify_alpaka.hpp.