The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More...
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
T | The reduction input/output element type |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM20). |
PTX_ARCH | [optional] \ptxversion |
LOGICAL_WARP_THREADS
SHFL
instructions)LOGICAL_WARP_THREADS
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 496
, 1520
, 2544
, and 3568
, respectively (and is undefined in other threads).thread_data
across the warp of threads is {0, 1, 2, 3, ..., 31}
. The corresponding output aggregate
in thread0 will be 496
(and is undefined in other threads). Definition at line 141 of file warp_reduce.cuh.
Data Structures | |
struct | TempStorage |
\smemstorage{WarpReduce} More... | |
Public Types | |
typedef If<(PTX_ARCH >=300)&&(IS_POW_OF_TWO), WarpReduceShfl< T, LOGICAL_WARP_THREADS, PTX_ARCH >, WarpReduceSmem< T, LOGICAL_WARP_THREADS, PTX_ARCH > >::Type | InternalWarpReduce |
Internal specialization. Use SHFL-based reduction if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two) | |
Public Member Functions | |
Collective constructors | |
__device__ __forceinline__ | WarpReduce (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x . | |
Summation reductions | |
__device__ __forceinline__ T | Sum (T input) |
Computes a warp-wide sum in the calling warp. The output is valid in warp lane0. | |
__device__ __forceinline__ T | Sum (T input, int valid_items) |
Computes a partially-full warp-wide sum in the calling warp. The output is valid in warp lane0. | |
template<typename FlagT > | |
__device__ __forceinline__ T | HeadSegmentedSum (T input, FlagT head_flag) |
Computes a segmented sum in the calling warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). | |
template<typename FlagT > | |
__device__ __forceinline__ T | TailSegmentedSum (T input, FlagT tail_flag) |
Computes a segmented sum in the calling warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). | |
Generic reductions | |
template<typename ReductionOp > | |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op) |
Computes a warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. | |
template<typename ReductionOp > | |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op, int valid_items) |
Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. | |
template<typename ReductionOp , typename FlagT > | |
__device__ __forceinline__ T | HeadSegmentedReduce (T input, FlagT head_flag, ReductionOp reduction_op) |
Computes a segmented reduction in the calling warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). | |
template<typename ReductionOp , typename FlagT > | |
__device__ __forceinline__ T | TailSegmentedReduce (T input, FlagT tail_flag, ReductionOp reduction_op) |
Computes a segmented reduction in the calling warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). | |
Private Types | |
enum | { IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)) , IS_POW_OF_TWO = PowerOfTwo<LOGICAL_WARP_THREADS>::VALUE } |
typedef InternalWarpReduce::TempStorage | _TempStorage |
Shared memory storage layout type for WarpReduce. | |
Private Attributes | |
_TempStorage & | temp_storage |
Shared storage reference. | |
|
private |
Shared memory storage layout type for WarpReduce.
Definition at line 173 of file warp_reduce.cuh.
typedef If<(PTX_ARCH>=300)&&(IS_POW_OF_TWO),WarpReduceShfl<T,LOGICAL_WARP_THREADS,PTX_ARCH>,WarpReduceSmem<T,LOGICAL_WARP_THREADS,PTX_ARCH>>::Type cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::InternalWarpReduce |
Internal specialization. Use SHFL-based reduction if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two)
Definition at line 165 of file warp_reduce.cuh.
|
private |
Enumerator | |
---|---|
IS_ARCH_WARP | Whether the logical warp size and the PTX warp size coincide. |
IS_POW_OF_TWO | Whether the logical warp size is a power-of-two. |
Definition at line 149 of file warp_reduce.cuh.
|
inline |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x
.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 203 of file warp_reduce.cuh.
|
inline |
Computes a segmented reduction in the calling warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0).
Supports non-commutative reduction operators
\smemreuse
thread_data
and head_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 3
, 7
, 11
, etc. (and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | head_flag | Head flag denoting whether or not input is the start of a new segment |
[in] | reduction_op | Reduction operator |
Definition at line 545 of file warp_reduce.cuh.
|
inline |
Computes a segmented sum in the calling warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0).
\smemreuse
thread_data
and head_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 6
, 22
, 38
, etc. (and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | head_flag | Head flag denoting whether or not input is the start of a new segment |
Definition at line 344 of file warp_reduce.cuh.
|
inline |
Computes a warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0.
Supports non-commutative reduction operators
\smemreuse
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 31
, 63
, 95
, and 127
, respectively (and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction operator |
Definition at line 445 of file warp_reduce.cuh.
|
inline |
Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0.
All threads across the calling warp must agree on the same value for valid_items
. Otherwise the result is undefined.
Supports non-commutative reduction operators
\smemreuse
d_data
is {0, 1, 2, 3, 4, ...
and valid_items
is 4
. The corresponding output aggregate
in thread0 is 3
(and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction operator |
[in] | valid_items | Total number of valid items in the calling thread's logical warp (may be less than LOGICAL_WARP_THREADS ) |
Definition at line 494 of file warp_reduce.cuh.
|
inline |
Computes a warp-wide sum in the calling warp. The output is valid in warp lane0.
\smemreuse
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 496
, 1520
, 2544
, and 3568
, respectively (and is undefined in other threads). [in] | input | Calling thread's input |
Definition at line 251 of file warp_reduce.cuh.
|
inline |
Computes a partially-full warp-wide sum in the calling warp. The output is valid in warp lane0.
All threads across the calling warp must agree on the same value for valid_items
. Otherwise the result is undefined.
\smemreuse
d_data
is {0, 1, 2, 3, 4, ...
and valid_items
is 4
. The corresponding output aggregate
in thread0 is 6
(and is undefined in other threads). [in] | input | Calling thread's input |
[in] | valid_items | Total number of valid items in the calling thread's logical warp (may be less than LOGICAL_WARP_THREADS ) |
Definition at line 295 of file warp_reduce.cuh.
|
inline |
Computes a segmented reduction in the calling warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0).
Supports non-commutative reduction operators
\smemreuse
thread_data
and tail_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 3
, 7
, 11
, etc. (and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | tail_flag | Tail flag denoting whether or not input is the end of the current segment |
[in] | reduction_op | Reduction operator |
Definition at line 596 of file warp_reduce.cuh.
|
inline |
Computes a segmented sum in the calling warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0).
\smemreuse
thread_data
and tail_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 6
, 22
, 38
, etc. (and is undefined in other threads).ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input |
[in] | tail_flag | Head flag denoting whether or not input is the start of a new segment |
Definition at line 391 of file warp_reduce.cuh.
|
private |
Shared storage reference.
Definition at line 181 of file warp_reduce.cuh.