Data Structures | |
| class | cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH > |
| The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More... | |
| class | cub::WarpScan< T, LOGICAL_WARP_THREADS, PTX_ARCH > |
| The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More... | |
Functions | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | cub::ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread. | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | cub::ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread. | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | cub::ShuffleIndex (T input, int src_lane, unsigned int member_mask) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread. | |
| __device__ __forceinline__ T cub::ShuffleDown | ( | T | input, |
| int | src_offset, | ||
| int | last_thread, | ||
| unsigned int | member_mask | ||
| ) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

| LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
| T | [inferred] The input/output element type |
double value from the successor of its successor. thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
| [in] | input | The value to broadcast |
| [in] | src_offset | The relative up-offset of the peer to read from |
| [in] | last_thread | Index of last thread in logical warp (typically 31 for a 32-thread warp) |
| [in] | member_mask | 32-bit mask of participating warp lanes |
Definition at line 585 of file util_ptx.cuh.
| __device__ __forceinline__ T cub::ShuffleIndex | ( | T | input, |
| int | src_lane, | ||
| unsigned int | member_mask | ||
| ) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

| LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
| T | [inferred] The input/output element type |
double value from warp-lane0.thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
| [in] | input | The value to broadcast |
| [in] | src_lane | Which warp lane is to do the broadcasting |
| [in] | member_mask | 32-bit mask of participating warp lanes |
Definition at line 656 of file util_ptx.cuh.
| __device__ __forceinline__ T cub::ShuffleUp | ( | T | input, |
| int | src_offset, | ||
| int | first_thread, | ||
| unsigned int | member_mask | ||
| ) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

| LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
| T | [inferred] The input/output element type |
double value from the predecessor of its predecessor. thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
| [in] | input | The value to broadcast |
| [in] | src_offset | The relative down-offset of the peer to read from |
| [in] | first_thread | Index of first lane in logical warp (typically 0) |
| [in] | member_mask | 32-bit mask of participating warp lanes |
Definition at line 517 of file util_ptx.cuh.