The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block. More...
The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block.
T | The data type to be exchanged. |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] \ptxversion |
Definition at line 70 of file block_shuffle.cuh.
Data Structures | |
struct | _TempStorage |
Shared memory storage layout type (last element from each thread's input) More... | |
struct | TempStorage |
\smemstorage{BlockShuffle} More... | |
Public Member Functions | |
Collective constructors | |
__device__ __forceinline__ | BlockShuffle () |
Collective constructor using a private static allocation of shared memory as temporary storage. | |
__device__ __forceinline__ | BlockShuffle (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. | |
Shuffle movement | |
__device__ __forceinline__ void | Offset (T input, T &output, int distance=1) |
Each threadi obtains the input provided by threadi+distance . The offset distance may be negative. | |
__device__ __forceinline__ void | Rotate (T input, T &output, unsigned int distance=1) |
Each threadi obtains the input provided by threadi+distance . | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | Up (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD]) |
The thread block rotates its blocked arrangement of input items, shifting it up by one item. | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | Up (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD], T &block_suffix) |
The thread block rotates its blocked arrangement of input items, shifting it up by one item. All threads receive the input provided by threadBLOCK_THREADS-1 . | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | Down (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD]) |
The thread block rotates its blocked arrangement of input items, shifting it down by one item. | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | Down (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD], T &block_prefix) |
The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive input [0] provided by thread0 . | |
Private Member Functions | |
__device__ __forceinline__ _TempStorage & | PrivateStorage () |
Internal storage allocator. | |
Private Attributes | |
_TempStorage & | temp_storage |
Shared storage reference. | |
unsigned int | linear_tid |
Linear thread-id. | |
|
private |
Definition at line 78 of file block_shuffle.cuh.
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
Definition at line 140 of file block_shuffle.cuh.
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 150 of file block_shuffle.cuh.
|
inline |
The thread block rotates its blocked arrangement of input
items, shifting it down by one item.
[in] | input | The calling thread's input items |
[out] | prev | The corresponding predecessor items (may be aliased to input ). The value prev [0] is not updated for threadBLOCK_THREADS-1. |
Definition at line 263 of file block_shuffle.cuh.
|
inline |
The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive input
[0] provided by thread0
.
[in] | input | The calling thread's input items |
[out] | prev | The corresponding predecessor items (may be aliased to input ). The value prev [0] is not updated for threadBLOCK_THREADS-1. |
[out] | block_prefix | The item input [0] from thread0 , provided to all threads |
Definition at line 289 of file block_shuffle.cuh.
|
inline |
Each threadi obtains the input
provided by threadi+distance
. The offset distance
may be negative.
[in] | input | The input item from the calling thread (threadi) |
[out] | output | The input item from the successor (or predecessor) thread threadi+distance (may be aliased to input ). This value is only updated for for threadi when 0 <= (i + distance ) < BLOCK_THREADS-1 |
[in] | distance | Offset distance (may be negative) |
Definition at line 171 of file block_shuffle.cuh.
|
inlineprivate |
Internal storage allocator.
Definition at line 123 of file block_shuffle.cuh.
|
inline |
Each threadi obtains the input
provided by threadi+distance
.
[in] | input | The calling thread's input item |
[out] | output | The input item from thread thread(i+distance> )%<BLOCK_THREADS> (may be aliased to input ). This value is not updated for threadBLOCK_THREADS-1 |
[in] | distance | Offset distance (0 < distance < BLOCK_THREADS ) |
Definition at line 191 of file block_shuffle.cuh.
|
inline |
The thread block rotates its blocked arrangement of input
items, shifting it up by one item.
[in] | input | The calling thread's input items |
[out] | prev | The corresponding predecessor items (may be aliased to input ). The item prev [0] is not updated for thread0. |
Definition at line 217 of file block_shuffle.cuh.
|
inline |
The thread block rotates its blocked arrangement of input
items, shifting it up by one item. All threads receive the input
provided by threadBLOCK_THREADS-1
.
[in] | input | The calling thread's input items |
[out] | prev | The corresponding predecessor items (may be aliased to input ). The item prev [0] is not updated for thread0. |
[out] | block_suffix | The item input [ITEMS_PER_THREAD-1] from threadBLOCK_THREADS-1 , provided to all threads |
Definition at line 244 of file block_shuffle.cuh.
|
private |
Linear thread-id.
Definition at line 115 of file block_shuffle.cuh.
|
private |
Shared storage reference.
Definition at line 112 of file block_shuffle.cuh.