BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More...
BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators.
Supports non-commutative binary reduction operators. Unlike commutative reduction operators (e.g., addition), the application of a non-commutative reduction operator (e.g, string concatenation) across a sequence of inputs must honor the relative ordering of items and partial reductions when applying the reduction operator.
Compared to the implementation of BlockReduceRaking (which does not support non-commutative operators), this implementation requires a few extra rounds of inter-thread communication. < The PTX compute capability for which to to specialize this collective
Definition at line 68 of file block_reduce_raking.cuh.
Data Structures | |
union | _TempStorage |
Shared memory storage layout type. More... | |
struct | TempStorage |
Alias wrapper allowing storage to be unioned. More... | |
Public Types | |
enum | { BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z } |
Constants. More... | |
enum | { RAKING_THREADS = BlockRakingLayout::RAKING_THREADS , SEGMENT_LENGTH = BlockRakingLayout::SEGMENT_LENGTH , WARP_SYNCHRONOUS = (RAKING_THREADS == BLOCK_THREADS) , WARP_SYNCHRONOUS_UNGUARDED = PowerOfTwo<RAKING_THREADS>::VALUE , RAKING_UNGUARDED = BlockRakingLayout::UNGUARDED } |
Constants. More... | |
typedef BlockRakingLayout< T, BLOCK_THREADS, PTX_ARCH > | BlockRakingLayout |
Layout type for padded thread block raking grid. | |
typedef WarpReduce< T, BlockRakingLayout::RAKING_THREADS, PTX_ARCH >::InternalWarpReduce | WarpReduce |
WarpReduce utility type. | |
Public Member Functions | |
__device__ __forceinline__ | BlockReduceRaking (TempStorage &temp_storage) |
Constructor. | |
template<bool IS_FULL_TILE, typename ReductionOp , int ITERATION> | |
__device__ __forceinline__ T | RakingReduction (ReductionOp reduction_op, T *raking_segment, T partial, int num_valid, Int2Type< ITERATION >) |
template<bool IS_FULL_TILE, typename ReductionOp > | |
__device__ __forceinline__ T | RakingReduction (ReductionOp, T *, T partial, int, Int2Type< SEGMENT_LENGTH >) |
template<bool IS_FULL_TILE, typename ReductionOp > | |
__device__ __forceinline__ T | Reduce (T partial, int num_valid, ReductionOp reduction_op) |
Computes a thread block-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread0. | |
template<bool IS_FULL_TILE> | |
__device__ __forceinline__ T | Sum (T partial, int num_valid) |
Computes a thread block-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread0. | |
Data Fields | |
_TempStorage & | temp_storage |
unsigned int | linear_tid |
typedef BlockRakingLayout<T, BLOCK_THREADS, PTX_ARCH> cub::BlockReduceRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockRakingLayout |
Layout type for padded thread block raking grid.
Definition at line 78 of file block_reduce_raking.cuh.
typedef WarpReduce<T,BlockRakingLayout::RAKING_THREADS,PTX_ARCH>::InternalWarpReduce cub::BlockReduceRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpReduce |
WarpReduce utility type.
Definition at line 81 of file block_reduce_raking.cuh.
anonymous enum |
Constants.
Enumerator | |
---|---|
BLOCK_THREADS | The thread block size in threads. |
Definition at line 71 of file block_reduce_raking.cuh.
anonymous enum |
Constants.
Definition at line 84 of file block_reduce_raking.cuh.
|
inline |
Constructor.
Definition at line 122 of file block_reduce_raking.cuh.
|
inline |
[in] | reduction_op | Binary scan operator |
[in] | partial | [lane0 only] Warp-wide aggregate reduction of input items |
[in] | num_valid | Number of valid elements (may be less than BLOCK_THREADS) |
Definition at line 131 of file block_reduce_raking.cuh.
|
inline |
[in] | partial | [lane0 only] Warp-wide aggregate reduction of input items |
Definition at line 148 of file block_reduce_raking.cuh.
|
inline |
Computes a thread block-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread0.
[in] | partial | Calling thread's input partial reductions |
[in] | num_valid | Number of valid elements (may be less than BLOCK_THREADS) |
[in] | reduction_op | Binary reduction operator |
Definition at line 164 of file block_reduce_raking.cuh.
|
inline |
Computes a thread block-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread0.
[in] | partial | Calling thread's input partial reductions |
[in] | num_valid | Number of valid elements (may be less than BLOCK_THREADS) |
Definition at line 211 of file block_reduce_raking.cuh.
unsigned int cub::BlockReduceRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::linear_tid |
Definition at line 118 of file block_reduce_raking.cuh.
_TempStorage& cub::BlockReduceRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::temp_storage |
Definition at line 117 of file block_reduce_raking.cuh.