|
template<int LOGICAL_WARP_THREADS, typename T > |
__device__ __forceinline__ T | cub::ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) |
| Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset . For thread lanes i < src_offset, the thread's own input is returned to the thread.
|
|
template<int LOGICAL_WARP_THREADS, typename T > |
__device__ __forceinline__ T | cub::ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) |
| Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset . For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
|
|
template<int LOGICAL_WARP_THREADS, typename T > |
__device__ __forceinline__ T | cub::ShuffleIndex (T input, int src_lane, unsigned int member_mask) |
| Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane . For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
|
|
template<
int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleIndex |
( |
T |
input, |
|
|
int |
src_lane, |
|
|
unsigned int |
member_mask |
|
) |
| |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input
contributed by warp-lanesrc_lane
. For src_lane
< 0 or src_lane
>= WARP_THREADS, then the thread's own input
is returned to the thread.
- Template Parameters
-
LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
T | [inferred] The input/output element type |
- Available only for SM3.0 or newer
- Snippet
- The code snippet below illustrates each thread obtaining a
double
value from warp-lane0.
#include <cub/cub.cuh>
__global__ void ExampleKernel(...)
{
double thread_data = ...
double peer_data = ShuffleIndex<32>(thread_data, 0, 0xffffffff);
- Suppose the set of input
thread_data
across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}
. The corresponding output peer_data
will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}
.
The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
- Parameters
-
[in] | input | The value to broadcast |
[in] | src_lane | Which warp lane is to do the broadcasting |
[in] | member_mask | 32-bit mask of participating warp lanes |
Definition at line 656 of file util_ptx.cuh.