The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...

Detailed Description

template<typename InputT, int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

Template Parameters

InputT	The data type to read into (which must be convertible from the input iterator's value type).
BLOCK_DIM_X	The thread block length in threads along the X dimension
ITEMS_PER_THREAD	The number of consecutive items partitioned onto each thread.
ALGORITHM	[optional] cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT.
WARP_TIME_SLICING	[optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
BLOCK_DIM_Y	[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z	[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH	[optional] \ptxversion

Overview

The BlockLoad class provides a single data movement abstraction that can be specialized to implement different cub::BlockLoadAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
BlockLoad can be optionally specialized by different data movement strategies:
1. cub::BLOCK_LOAD_DIRECT. A blocked arrangement of data is read directly from memory. More...
2. cub::BLOCK_LOAD_VECTORIZE. A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. More...
3. cub::BLOCK_LOAD_TRANSPOSE. A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement. More...
4. cub::BLOCK_LOAD_WARP_TRANSPOSE. A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement. More...
5. cub::BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,. A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time. More...
\rowmajor

A Simple Example: \blockcollective{BlockLoad}

: The code snippet below illustrates the loading of a linear segment of 512 integers into a "blocked" arrangement across 128 threads where each thread owns 4 consecutive items. The load is specialized for BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).

: #include <cub/cub.cuh> // or equivalently <cub/block/block_load.cuh>

__global__ void ExampleKernel(int *d_data, ...)

{

// Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each

typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;

// Allocate shared memory for BlockLoad

__shared__ typename BlockLoad::TempStorage temp_storage;

// Load a segment of consecutive items that are blocked across threads

int thread_data[4];

BlockLoad(temp_storage).Load(d_data, thread_data);

cub::BlockLoad
The BlockLoad class provides collective data movement methods for loading a linear segment of items f...
Definition block_load.cuh:641

cub::BlockLoad::temp_storage
_TempStorage & temp_storage
Thread reference to shared storage.
Definition block_load.cuh:1047

cub::BlockLoad::BlockLoad
__device__ __forceinline__ BlockLoad()
Collective constructor using a private static allocation of shared memory as temporary storage.
Definition block_load.cuh:1066

cub::BlockLoad::TempStorage
\smemstorage{BlockLoad}
Definition block_load.cuh:1055

: Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }.

Definition at line 640 of file block_load.cuh.

Data Structures
struct	LoadInternal
	Load helper. More...

struct	LoadInternal< BLOCK_LOAD_DIRECT, DUMMY >

struct	LoadInternal< BLOCK_LOAD_TRANSPOSE, DUMMY >

struct	LoadInternal< BLOCK_LOAD_VECTORIZE, DUMMY >

struct	LoadInternal< BLOCK_LOAD_WARP_TRANSPOSE, DUMMY >

struct	LoadInternal< BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, DUMMY >

struct	TempStorage
	\smemstorage{BlockLoad} More...

Public Member Functions
Collective constructors
__device__ __forceinline__	BlockLoad ()
	Collective constructor using a private static allocation of shared memory as temporary storage.

__device__ __forceinline__	BlockLoad (TempStorage &temp_storage)
	Collective constructor using the specified memory allocation as temporary storage.

Data movement
template<typename InputIteratorT >
__device__ __forceinline__ void	Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
	Load a linear segment of items from memory.

template<typename InputIteratorT >
__device__ __forceinline__ void	Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
	Load a linear segment of items from memory, guarded by range.

template<typename InputIteratorT , typename DefaultT >
__device__ __forceinline__ void	Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
	Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements.

Private Types
enum	{ BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z }
	Constants. More...

typedef LoadInternal< ALGORITHM, 0 >	InternalLoad
	Internal load implementation to use.

typedef InternalLoad::TempStorage	_TempStorage
	Shared memory storage layout type.

Private Member Functions
__device__ __forceinline__ _TempStorage &	PrivateStorage ()
	Internal storage allocator.

Private Attributes
_TempStorage &	temp_storage
	Thread reference to shared storage.

int	linear_tid
	Linear thread-id.

The documentation for this class was generated from the following file:

openfpm_data/src/util/cuda/cub_old/block/block_load.cuh

Detailed Description

Data Structures

Public Member Functions

Private Types

Private Member Functions

Private Attributes