The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

| InputT | The data type to read into (which must be convertible from the input iterator's value type). |
| BLOCK_DIM_X | The thread block length in threads along the X dimension |
| ITEMS_PER_THREAD | The number of consecutive items partitioned onto each thread. |
| ALGORITHM | [optional] cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT. |
| WARP_TIME_SLICING | [optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false) |
| BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
| BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
| PTX_ARCH | [optional] \ptxversion |
BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads). d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. Definition at line 640 of file block_load.cuh.
Data Structures | |
| struct | LoadInternal |
| Load helper. More... | |
| struct | LoadInternal< BLOCK_LOAD_DIRECT, DUMMY > |
| struct | LoadInternal< BLOCK_LOAD_TRANSPOSE, DUMMY > |
| struct | LoadInternal< BLOCK_LOAD_VECTORIZE, DUMMY > |
| struct | LoadInternal< BLOCK_LOAD_WARP_TRANSPOSE, DUMMY > |
| struct | LoadInternal< BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, DUMMY > |
| struct | TempStorage |
| \smemstorage{BlockLoad} More... | |
Public Member Functions | |
Collective constructors | |
| __device__ __forceinline__ | BlockLoad () |
| Collective constructor using a private static allocation of shared memory as temporary storage. | |
| __device__ __forceinline__ | BlockLoad (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. | |
Data movement | |
| template<typename InputIteratorT > | |
| __device__ __forceinline__ void | Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
| Load a linear segment of items from memory. | |
| template<typename InputIteratorT > | |
| __device__ __forceinline__ void | Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
| Load a linear segment of items from memory, guarded by range. | |
| template<typename InputIteratorT , typename DefaultT > | |
| __device__ __forceinline__ void | Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
| Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements. | |
Private Types | |
| enum | { BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z } |
| Constants. More... | |
| typedef LoadInternal< ALGORITHM, 0 > | InternalLoad |
| Internal load implementation to use. | |
| typedef InternalLoad::TempStorage | _TempStorage |
| Shared memory storage layout type. | |
Private Member Functions | |
| __device__ __forceinline__ _TempStorage & | PrivateStorage () |
| Internal storage allocator. | |
Private Attributes | |
| _TempStorage & | temp_storage |
| Thread reference to shared storage. | |
| int | linear_tid |
| Linear thread-id. | |