The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More...
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
T | The scan input/output element type |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20). |
PTX_ARCH | [optional] \ptxversion |
LOGICAL_WARP_THREADS
SHFL
)LOGICAL_WARP_THREADS
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, 3, ..., 31}
.thread_data
across the warp of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
will be {0, 1, 2, 3, ..., 31}
. Definition at line 146 of file warp_scan.cuh.
Data Structures | |
struct | TempStorage |
\smemstorage{WarpScan} More... | |
Public Member Functions | |
Collective constructors | |
__device__ __forceinline__ | WarpScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x . | |
Inclusive prefix sums | |
__device__ __forceinline__ void | InclusiveSum (T input, T &inclusive_output) |
Computes an inclusive prefix sum across the calling warp. | |
__device__ __forceinline__ void | InclusiveSum (T input, T &inclusive_output, T &warp_aggregate) |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. | |
Exclusive prefix sums | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &exclusive_output) |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in thread0. | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &exclusive_output, T &warp_aggregate) |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in thread0. Also provides every thread with the warp-wide warp_aggregate of all inputs. | |
Inclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &inclusive_output, ScanOp scan_op) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &inclusive_output, ScanOp scan_op, T &warp_aggregate) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. | |
Exclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for warp-lane0 is undefined. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, T initial_value, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate of all inputs. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. | |
Combination (inclusive & exclusive) prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the exclusive_output computed for warp-lane0 is undefined. | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, T initial_value, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. | |
Data exchange | |
__device__ __forceinline__ T | Broadcast (T input, unsigned int src_lane) |
Broadcast the value input from warp-lanesrc_lane to all lanes in the warp. | |
Private Types | |
enum | { IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)) , IS_POW_OF_TWO = ((LOGICAL_WARP_THREADS & (LOGICAL_WARP_THREADS - 1)) == 0) , IS_INTEGER = ((Traits<T>::CATEGORY == SIGNED_INTEGER) || (Traits<T>::CATEGORY == UNSIGNED_INTEGER)) } |
typedef If<(PTX_ARCH >=300)&&(IS_POW_OF_TWO), WarpScanShfl< T, LOGICAL_WARP_THREADS, PTX_ARCH >, WarpScanSmem< T, LOGICAL_WARP_THREADS, PTX_ARCH > >::Type | InternalWarpScan |
Internal specialization. Use SHFL-based scan if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two) | |
typedef InternalWarpScan::TempStorage | _TempStorage |
Shared memory storage layout type for WarpScan. | |
Private Attributes | |
_TempStorage & | temp_storage |
Shared storage reference. | |
unsigned int | lane_id |
|
private |
Shared memory storage layout type for WarpScan.
Definition at line 172 of file warp_scan.cuh.
|
private |
Internal specialization. Use SHFL-based scan if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two)
Definition at line 169 of file warp_scan.cuh.
|
private |
Definition at line 154 of file warp_scan.cuh.
|
inline |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x
.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 203 of file warp_scan.cuh.
|
inline |
Broadcast the value input
from warp-lanesrc_lane
to all lanes in the warp.
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output thread_data
will be {0, 0, ..., 0}
in warp0, {32, 32, ..., 32}
in warp1, {64, 64, ..., 64}
in warp2, etc. [in] | input | The value to broadcast |
[in] | src_lane | Which warp lane is to do the broadcasting |
Definition at line 922 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output
computed for warp-lane0 is undefined.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
Definition at line 551 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output
computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.) Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 668 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
Definition at line 607 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 729 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output
in thread0.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. [in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
Definition at line 349 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output
in thread0. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 394 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
Definition at line 447 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 497 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix sum across the calling warp.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. [in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
Definition at line 254 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 297 of file warp_scan.cuh.
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the exclusive_output
computed for warp-lane0 is undefined.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | scan_op | Binary scan operator |
Definition at line 799 of file warp_scan.cuh.
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
Definition at line 858 of file warp_scan.cuh.
|
private |
Definition at line 181 of file warp_scan.cuh.
|
private |
Shared storage reference.
Definition at line 180 of file warp_scan.cuh.