OpenFPM_pdata  3.0.0
Project that contain the implementation of distributed structures
cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block. More...

Detailed Description

template<typename InputT, int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.

Template Parameters
TThe data type to be exchanged.
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of items partitioned onto each thread.
WARP_TIME_SLICING[optional] When true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds. Yields a smaller memory footprint at the expense of decreased parallelism. (Default: false)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] \ptxversion
Overview
  • It is commonplace for blocks of threads to rearrange data items between threads. For example, the device-accessible memory subsystem prefers access patterns where data items are "striped" across threads (where consecutive threads access consecutive items), yet most block-wide operations prefer a "blocked" partitioning of items across threads (where consecutive items belong to a single thread).
  • BlockExchange supports the following types of data exchanges:
  • \rowmajor
A Simple Example
\blockcollective{BlockExchange}
The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement of 512 integer items partitioned across 128 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockExchange
__shared__ typename BlockExchange::TempStorage temp_storage;
// Load a tile of data striped across threads
int thread_data[4];
cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
// Collectively exchange data into a blocked arrangement across threads
BlockExchange(temp_storage).StripedToBlocked(thread_data);
Suppose the set of striped input thread_data across the block of threads is { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
Performance Considerations
  • Proper device-specific padding ensures zero bank conflicts for most types.

Definition at line 116 of file block_exchange.cuh.

Data Structures

struct  TempStorage
 \smemstorage{BlockExchange} More...
 

Public Member Functions

__device__ __forceinline__ void StripedToBlocked (InputT items[ITEMS_PER_THREAD])
 
__device__ __forceinline__ void BlockedToStriped (InputT items[ITEMS_PER_THREAD])
 
__device__ __forceinline__ void WarpStripedToBlocked (InputT items[ITEMS_PER_THREAD])
 
__device__ __forceinline__ void BlockedToWarpStriped (InputT items[ITEMS_PER_THREAD])
 
template<typename OffsetT >
__device__ __forceinline__ void ScatterToBlocked (InputT items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 
template<typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (InputT items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 
template<typename OffsetT >
__device__ __forceinline__ void ScatterToStripedGuarded (InputT items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 
template<typename OffsetT , typename ValidFlag >
__device__ __forceinline__ void ScatterToStripedFlagged (InputT items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], ValidFlag is_valid[ITEMS_PER_THREAD])
 
Collective constructors
__device__ __forceinline__ BlockExchange ()
 Collective constructor using a private static allocation of shared memory as temporary storage.
 
__device__ __forceinline__ BlockExchange (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Structured exchanges
template<typename OutputT >
__device__ __forceinline__ void StripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD])
 Transposes data items from striped arrangement to blocked arrangement. More...
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD])
 Transposes data items from blocked arrangement to striped arrangement. More...
 
template<typename OutputT >
__device__ __forceinline__ void WarpStripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD])
 Transposes data items from warp-striped arrangement to blocked arrangement. More...
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToWarpStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD])
 Transposes data items from blocked arrangement to warp-striped arrangement. More...
 
Scatter exchanges
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 Exchanges data items annotated by rank into blocked arrangement. More...
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 Exchanges data items annotated by rank into striped arrangement. More...
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToStripedGuarded (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD])
 Exchanges data items annotated by rank into striped arrangement. Items with rank -1 are not exchanged. More...
 
template<typename OutputT , typename OffsetT , typename ValidFlag >
__device__ __forceinline__ void ScatterToStripedFlagged (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], ValidFlag is_valid[ITEMS_PER_THREAD])
 Exchanges valid data items annotated by rank into striped arrangement. More...
 

Private Types

enum  {
  BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z, LOG_WARP_THREADS = CUB_LOG_WARP_THREADS(PTX_ARCH), WARP_THREADS = 1 << LOG_WARP_THREADS, WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
  LOG_SMEM_BANKS = CUB_LOG_SMEM_BANKS(PTX_ARCH), SMEM_BANKS = 1 << LOG_SMEM_BANKS, TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD, TIME_SLICES = (WARP_TIME_SLICING) ? WARPS : 1,
  TIME_SLICED_THREADS = (WARP_TIME_SLICING) ? CUB_MIN(BLOCK_THREADS, WARP_THREADS) : BLOCK_THREADS, TIME_SLICED_ITEMS = TIME_SLICED_THREADS * ITEMS_PER_THREAD, WARP_TIME_SLICED_THREADS = CUB_MIN(BLOCK_THREADS, WARP_THREADS), WARP_TIME_SLICED_ITEMS = WARP_TIME_SLICED_THREADS * ITEMS_PER_THREAD,
  INSERT_PADDING = (ITEMS_PER_THREAD > 4) && (PowerOfTwo<ITEMS_PER_THREAD>::VALUE), PADDING_ITEMS = (INSERT_PADDING) ? (TIME_SLICED_ITEMS >> LOG_SMEM_BANKS) : 0
}
 Constants. More...
 

Private Member Functions

struct __align__ (16) _TempStorage
 Shared memory storage layout type.
 
__device__ __forceinline__ _TempStorage & PrivateStorage ()
 Internal storage allocator.
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< true >)
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToWarpStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT >
__device__ __forceinline__ void BlockedToWarpStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< true >)
 
template<typename OutputT >
__device__ __forceinline__ void StripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT >
__device__ __forceinline__ void StripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< true >)
 
template<typename OutputT >
__device__ __forceinline__ void WarpStripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT >
__device__ __forceinline__ void WarpStripedToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], Int2Type< true >)
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToBlocked (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], Int2Type< true >)
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], Int2Type< false >)
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (InputT input_items[ITEMS_PER_THREAD], OutputT output_items[ITEMS_PER_THREAD], OffsetT ranks[ITEMS_PER_THREAD], Int2Type< true >)
 

Private Attributes

_TempStorage & temp_storage
 Shared storage reference.
 
unsigned int linear_tid
 Linear thread-id.
 
unsigned int lane_id
 
unsigned int warp_id
 
unsigned int warp_offset
 

Member Enumeration Documentation

◆ anonymous enum

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
anonymous enum
private

Constants.

Enumerator
BLOCK_THREADS 

The thread block size in threads.

Definition at line 125 of file block_exchange.cuh.

Constructor & Destructor Documentation

◆ BlockExchange()

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockExchange ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Definition at line 726 of file block_exchange.cuh.

Member Function Documentation

◆ BlockedToStriped() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Transposes data items from blocked arrangement to striped arrangement. Specialized for no timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 200 of file block_exchange.cuh.

◆ BlockedToStriped() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Transposes data items from blocked arrangement to striped arrangement. Specialized for warp-timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 229 of file block_exchange.cuh.

◆ BlockedToStriped() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from blocked arrangement to striped arrangement.

  • \smemreuse
Snippet
The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement of 512 integer items partitioned across 128 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockExchange
__shared__ typename BlockExchange::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Collectively exchange data into a striped arrangement across threads
BlockExchange(temp_storage).BlockedToStriped(thread_data, thread_data);
// Store data striped across block threads into an ordered tile
cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
Suppose the set of blocked input thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. The corresponding output thread_data in those threads will be { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } in preparation for storing to device-accessible memory.
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.

Definition at line 829 of file block_exchange.cuh.

◆ BlockedToStriped() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToStriped ( InputT  items[ITEMS_PER_THREAD])
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.

Definition at line 1069 of file block_exchange.cuh.

◆ BlockedToWarpStriped() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToWarpStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for no timeslicing

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 289 of file block_exchange.cuh.

◆ BlockedToWarpStriped() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToWarpStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for warp-timeslicing

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 317 of file block_exchange.cuh.

◆ BlockedToWarpStriped() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToWarpStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from blocked arrangement to warp-striped arrangement.

  • \smemreuse
Snippet
The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement of 512 integer items partitioned across 128 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockExchange
__shared__ typename BlockExchange::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Collectively exchange data into a warp-striped arrangement across threads
BlockExchange(temp_storage).BlockedToWarpStriped(thread_data, thread_data);
// Store data striped across warp threads into an ordered tile
cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
Suppose the set of blocked input thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. The corresponding output thread_data in those threads will be { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } in preparation for storing to device-accessible memory. (The first 128 items are striped across the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.

Definition at line 928 of file block_exchange.cuh.

◆ BlockedToWarpStriped() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockedToWarpStriped ( InputT  items[ITEMS_PER_THREAD])
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.

Definition at line 1081 of file block_exchange.cuh.

◆ ScatterToBlocked() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Exchanges data items annotated by rank into blocked arrangement. Specialized for no timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 534 of file block_exchange.cuh.

◆ ScatterToBlocked() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Exchanges data items annotated by rank into blocked arrangement. Specialized for warp-timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 563 of file block_exchange.cuh.

◆ ScatterToBlocked() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline

Exchanges data items annotated by rank into blocked arrangement.

  • \smemreuse
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 953 of file block_exchange.cuh.

◆ ScatterToBlocked() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToBlocked ( InputT  items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 1088 of file block_exchange.cuh.

◆ ScatterToStriped() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Exchanges data items annotated by rank into striped arrangement. Specialized for no timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 616 of file block_exchange.cuh.

◆ ScatterToStriped() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Exchanges data items annotated by rank into striped arrangement. Specialized for warp-timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 646 of file block_exchange.cuh.

◆ ScatterToStriped() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStriped ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline

Exchanges data items annotated by rank into striped arrangement.

  • \smemreuse
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 972 of file block_exchange.cuh.

◆ ScatterToStriped() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStriped ( InputT  items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 1096 of file block_exchange.cuh.

◆ ScatterToStripedFlagged() [1/2]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT , typename ValidFlag >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStripedFlagged ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
ValidFlag  is_valid[ITEMS_PER_THREAD] 
)
inline

Exchanges valid data items annotated by rank into striped arrangement.

  • \smemreuse
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
ValidFlag[inferred] FlagT type denoting which items are valid
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks
[in]is_validCorresponding flag denoting item validity

Definition at line 1029 of file block_exchange.cuh.

◆ ScatterToStripedFlagged() [2/2]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OffsetT , typename ValidFlag >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStripedFlagged ( InputT  items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD],
ValidFlag  is_valid[ITEMS_PER_THREAD] 
)
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks
[in]is_validCorresponding flag denoting item validity

Definition at line 1112 of file block_exchange.cuh.

◆ ScatterToStripedGuarded() [1/2]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStripedGuarded ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline

Exchanges data items annotated by rank into striped arrangement. Items with rank -1 are not exchanged.

  • \smemreuse
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 991 of file block_exchange.cuh.

◆ ScatterToStripedGuarded() [2/2]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OffsetT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ScatterToStripedGuarded ( InputT  items[ITEMS_PER_THREAD],
OffsetT  ranks[ITEMS_PER_THREAD] 
)
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.
[in]ranksCorresponding scatter ranks

Definition at line 1104 of file block_exchange.cuh.

◆ StripedToBlocked() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::StripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Transposes data items from striped arrangement to blocked arrangement. Specialized for no timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 376 of file block_exchange.cuh.

◆ StripedToBlocked() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::StripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Transposes data items from striped arrangement to blocked arrangement. Specialized for warp-timeslicing.

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 406 of file block_exchange.cuh.

◆ StripedToBlocked() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::StripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from striped arrangement to blocked arrangement.

  • \smemreuse
Snippet
The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement of 512 integer items partitioned across 128 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockExchange
__shared__ typename BlockExchange::TempStorage temp_storage;
// Load a tile of ordered data into a striped arrangement across block threads
int thread_data[4];
cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
// Collectively exchange data into a blocked arrangement across threads
BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);
Suppose the set of striped input thread_data across the block of threads is { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } after loading from device-accessible memory. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.

Definition at line 780 of file block_exchange.cuh.

◆ StripedToBlocked() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::StripedToBlocked ( InputT  items[ITEMS_PER_THREAD])
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.

Definition at line 1063 of file block_exchange.cuh.

◆ WarpStripedToBlocked() [1/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpStripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< false >   
)
inlineprivate

Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for no timeslicing

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 467 of file block_exchange.cuh.

◆ WarpStripedToBlocked() [2/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpStripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD],
Int2Type< true >   
)
inlineprivate

Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for warp-timeslicing

Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems to exchange, converting between blocked and striped arrangements.

Definition at line 496 of file block_exchange.cuh.

◆ WarpStripedToBlocked() [3/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpStripedToBlocked ( InputT  input_items[ITEMS_PER_THREAD],
OutputT  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from warp-striped arrangement to blocked arrangement.

  • \smemreuse
Snippet
The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement of 512 integer items partitioned across 128 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockExchange
__shared__ typename BlockExchange::TempStorage temp_storage;
// Load a tile of ordered data into a warp-striped arrangement across warp threads
int thread_data[4];
cub::LoadSWarptriped<LOAD_DEFAULT>(threadIdx.x, d_data, thread_data);
// Collectively exchange data into a blocked arrangement across threads
BlockExchange(temp_storage).WarpStripedToBlocked(thread_data);
Suppose the set of warp-striped input thread_data across the block of threads is { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } after loading from device-accessible memory. (The first 128 items are striped across the first warp of 32 threads, the second 128 items are striped across the second warp, etc.) The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
Parameters
[in]input_itemsItems to exchange, converting between striped and blocked arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements.

Definition at line 877 of file block_exchange.cuh.

◆ WarpStripedToBlocked() [4/4]

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockExchange< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, WARP_TIME_SLICING, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpStripedToBlocked ( InputT  items[ITEMS_PER_THREAD])
inline
Parameters
[in,out]itemsItems to exchange, converting between striped and blocked arrangements.

Definition at line 1075 of file block_exchange.cuh.


The documentation for this class was generated from the following file: