OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory. More...

Detailed Description

template<typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.

Template Parameters
TThe type of data to be written.
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
ALGORITHM[optional] cub::BlockStoreAlgorithm tuning policy enumeration. default: cub::BLOCK_STORE_DIRECT.
WARP_TIME_SLICING[optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] \ptxversion
Overview
A Simple Example
\blockcollective{BlockStore}
The code snippet below illustrates the storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for BLOCK_STORE_WARP_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockStore
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data);
The BlockStore class provides collective data movement methods for writing a blocked arrangement of i...
_TempStorage & temp_storage
Thread reference to shared storage.
__device__ __forceinline__ BlockStore()
Collective constructor using a private static allocation of shared memory as temporary storage.
\smemstorage{BlockStore}
Suppose the set of thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Definition at line 515 of file block_store.cuh.

Data Structures

struct  StoreInternal
 Store helper. More...
 
struct  StoreInternal< BLOCK_STORE_DIRECT, DUMMY >
 
struct  StoreInternal< BLOCK_STORE_TRANSPOSE, DUMMY >
 
struct  StoreInternal< BLOCK_STORE_VECTORIZE, DUMMY >
 
struct  StoreInternal< BLOCK_STORE_WARP_TRANSPOSE, DUMMY >
 
struct  StoreInternal< BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, DUMMY >
 
struct  TempStorage
 \smemstorage{BlockStore} More...
 

Public Member Functions

Collective constructors
__device__ __forceinline__ BlockStore ()
 Collective constructor using a private static allocation of shared memory as temporary storage.
 
__device__ __forceinline__ BlockStore (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage.
 
Data movement
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store items into a linear segment of memory.
 
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store items into a linear segment of memory, guarded by range.
 

Private Types

enum  { BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z }
 Constants. More...
 
typedef StoreInternal< ALGORITHM, 0 > InternalStore
 Internal load implementation to use.
 
typedef InternalStore::TempStorage _TempStorage
 Shared memory storage layout type.
 

Private Member Functions

__device__ __forceinline__ _TempStoragePrivateStorage ()
 Internal storage allocator.
 

Private Attributes

_TempStoragetemp_storage
 Thread reference to shared storage.
 
int linear_tid
 Linear thread-id.
 

The documentation for this class was generated from the following file: