OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method. More...

Detailed Description

template<typename KeyT, int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.

Template Parameters
KeyTKeyT type
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of items per thread
ValueT[optional] ValueT type (default: cub::NullType, which indicates a keys-only sort)
RADIX_BITS[optional] The number of radix bits per digit place (default: 4 bits)
MEMOIZE_OUTER_SCAN[optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).
INNER_SCAN_ALGORITHM[optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
SMEM_CONFIG[optional] Shared memory bank mode (default: cudaSharedMemBankSizeFourByte)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] \ptxversion
Overview
  • The radix sorting method arranges items into ascending order. It relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
  • BlockRadixSort can sort all of the built-in C++ numeric primitive types (unsigned char, int, double, etc.) as well as CUDA's __half half-precision floating-point type. Within each key, the implementation treats fixed-length bit-sequences of RADIX_BITS as radix digit places. Although the direct radix sorting method can only be applied to unsigned integral types, BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.
  • \rowmajor
Performance Considerations
  • \granularity
A Simple Example
\blockcollective{BlockRadixSort}
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thre...
_TempStorage & temp_storage
Shared storage reference.
__device__ __forceinline__ BlockRadixSort()
Collective constructor using a private static allocation of shared memory as temporary storage.
\smemstorage{BlockRadixSort}
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.

Definition at line 132 of file block_radix_sort.cuh.

Data Structures

union  _TempStorage
 Shared memory storage layout type. More...
 
struct  TempStorage
 \smemstorage{BlockRadixSort} More...
 

Public Member Functions

template<int DESCENDING, int KEYS_ONLY>
__device__ __forceinline__ void SortBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit, int end_bit, Int2Type< DESCENDING > is_descending, Int2Type< KEYS_ONLY > is_keys_only)
 Sort blocked -> striped arrangement.
 
Collective constructors
__device__ __forceinline__ BlockRadixSort ()
 Collective constructor using a private static allocation of shared memory as temporary storage.
 
__device__ __forceinline__ BlockRadixSort (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage.
 
Sorting (blocked arrangements)
__device__ __forceinline__ void Sort (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs an ascending block-wide radix sort over a blocked arrangement of keys.
 
__device__ __forceinline__ void Sort (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs an ascending block-wide radix sort across a blocked arrangement of keys and values.
 
__device__ __forceinline__ void SortDescending (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs a descending block-wide radix sort over a blocked arrangement of keys.
 
__device__ __forceinline__ void SortDescending (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs a descending block-wide radix sort across a blocked arrangement of keys and values.
 
Sorting (blocked arrangement -> striped arrangement)
__device__ __forceinline__ void SortBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs an ascending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.
 
__device__ __forceinline__ void SortBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs an ascending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.
 
__device__ __forceinline__ void SortDescendingBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs a descending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.
 
__device__ __forceinline__ void SortDescendingBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT) *8)
 Performs a descending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.
 

Private Types

enum  { BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z , KEYS_ONLY = Equals<ValueT, NullType>::VALUE }
 
typedef Traits< KeyT > KeyTraits
 
typedef KeyTraits::UnsignedBits UnsignedBits
 
typedef BlockRadixRank< BLOCK_DIM_X, RADIX_BITS, false, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > AscendingBlockRadixRank
 Ascending BlockRadixRank utility type.
 
typedef BlockRadixRank< BLOCK_DIM_X, RADIX_BITS, true, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > DescendingBlockRadixRank
 Descending BlockRadixRank utility type.
 
typedef BlockExchange< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > BlockExchangeKeys
 BlockExchange utility type for keys.
 
typedef BlockExchange< ValueT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > BlockExchangeValues
 BlockExchange utility type for values.
 

Private Member Functions

__device__ __forceinline__ _TempStoragePrivateStorage ()
 Internal storage allocator.
 
__device__ __forceinline__ void RankKeys (UnsignedBits(&unsigned_keys)[ITEMS_PER_THREAD], int(&ranks)[ITEMS_PER_THREAD], int begin_bit, int pass_bits, Int2Type< false >)
 Rank keys (specialized for ascending sort)
 
__device__ __forceinline__ void RankKeys (UnsignedBits(&unsigned_keys)[ITEMS_PER_THREAD], int(&ranks)[ITEMS_PER_THREAD], int begin_bit, int pass_bits, Int2Type< true >)
 Rank keys (specialized for descending sort)
 
__device__ __forceinline__ void ExchangeValues (ValueT(&values)[ITEMS_PER_THREAD], int(&ranks)[ITEMS_PER_THREAD], Int2Type< false >, Int2Type< true >)
 ExchangeValues (specialized for key-value sort, to-blocked arrangement)
 
__device__ __forceinline__ void ExchangeValues (ValueT(&values)[ITEMS_PER_THREAD], int(&ranks)[ITEMS_PER_THREAD], Int2Type< false >, Int2Type< false >)
 ExchangeValues (specialized for key-value sort, to-striped arrangement)
 
template<int IS_BLOCKED>
__device__ __forceinline__ void ExchangeValues (ValueT(&)[ITEMS_PER_THREAD], int(&)[ITEMS_PER_THREAD], Int2Type< true >, Int2Type< IS_BLOCKED >)
 ExchangeValues (specialized for keys-only sort)
 
template<int DESCENDING, int KEYS_ONLY>
__device__ __forceinline__ void SortBlocked (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit, int end_bit, Int2Type< DESCENDING > is_descending, Int2Type< KEYS_ONLY > is_keys_only)
 Sort blocked arrangement.
 

Private Attributes

_TempStoragetemp_storage
 Shared storage reference.
 
unsigned int linear_tid
 Linear thread-id.
 

Member Typedef Documentation

◆ AscendingBlockRadixRank

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef BlockRadixRank< BLOCK_DIM_X, RADIX_BITS, false, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::AscendingBlockRadixRank
private

Ascending BlockRadixRank utility type.

Definition at line 164 of file block_radix_sort.cuh.

◆ BlockExchangeKeys

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef BlockExchange<KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockExchangeKeys
private

BlockExchange utility type for keys.

Definition at line 180 of file block_radix_sort.cuh.

◆ BlockExchangeValues

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef BlockExchange<ValueT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockExchangeValues
private

BlockExchange utility type for values.

Definition at line 183 of file block_radix_sort.cuh.

◆ DescendingBlockRadixRank

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef BlockRadixRank< BLOCK_DIM_X, RADIX_BITS, true, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::DescendingBlockRadixRank
private

Descending BlockRadixRank utility type.

Definition at line 177 of file block_radix_sort.cuh.

◆ KeyTraits

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef Traits<KeyT> cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::KeyTraits
private

Definition at line 150 of file block_radix_sort.cuh.

◆ UnsignedBits

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
typedef KeyTraits::UnsignedBits cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::UnsignedBits
private

Definition at line 151 of file block_radix_sort.cuh.

Member Enumeration Documentation

◆ anonymous enum

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
anonymous enum
private

Definition at line 140 of file block_radix_sort.cuh.

Constructor & Destructor Documentation

◆ BlockRadixSort() [1/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockRadixSort ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

Definition at line 413 of file block_radix_sort.cuh.

◆ BlockRadixSort() [2/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockRadixSort ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Definition at line 423 of file block_radix_sort.cuh.

Member Function Documentation

◆ ExchangeValues() [1/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int IS_BLOCKED>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExchangeValues ( ValueT(&)  [ITEMS_PER_THREAD],
int(&)  [ITEMS_PER_THREAD],
Int2Type< true >  ,
Int2Type< IS_BLOCKED >   
)
inlineprivate

ExchangeValues (specialized for keys-only sort)

Definition at line 274 of file block_radix_sort.cuh.

◆ ExchangeValues() [2/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExchangeValues ( ValueT(&)  values[ITEMS_PER_THREAD],
int(&)  ranks[ITEMS_PER_THREAD],
Int2Type< false >  ,
Int2Type< false >   
)
inlineprivate

ExchangeValues (specialized for key-value sort, to-striped arrangement)

Definition at line 260 of file block_radix_sort.cuh.

◆ ExchangeValues() [3/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::ExchangeValues ( ValueT(&)  values[ITEMS_PER_THREAD],
int(&)  ranks[ITEMS_PER_THREAD],
Int2Type< false >  ,
Int2Type< true >   
)
inlineprivate

ExchangeValues (specialized for key-value sort, to-blocked arrangement)

Definition at line 247 of file block_radix_sort.cuh.

◆ PrivateStorage()

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ _TempStorage & cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::PrivateStorage ( )
inlineprivate

Internal storage allocator.

Definition at line 210 of file block_radix_sort.cuh.

◆ RankKeys() [1/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::RankKeys ( UnsignedBits(&)  unsigned_keys[ITEMS_PER_THREAD],
int(&)  ranks[ITEMS_PER_THREAD],
int  begin_bit,
int  pass_bits,
Int2Type< false >   
)
inlineprivate

Rank keys (specialized for ascending sort)

Definition at line 217 of file block_radix_sort.cuh.

◆ RankKeys() [2/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::RankKeys ( UnsignedBits(&)  unsigned_keys[ITEMS_PER_THREAD],
int(&)  ranks[ITEMS_PER_THREAD],
int  begin_bit,
int  pass_bits,
Int2Type< true >   
)
inlineprivate

Rank keys (specialized for descending sort)

Definition at line 232 of file block_radix_sort.cuh.

◆ Sort() [1/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sort ( KeyT(&)  keys[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs an ascending block-wide radix sort over a blocked arrangement of keys.

  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
Parameters
keys[in-out] Keys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 474 of file block_radix_sort.cuh.

◆ Sort() [2/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sort ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs an ascending block-wide radix sort across a blocked arrangement of keys and values.

  • BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys and values that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
int thread_values[4];
...
// Collectively sort the keys and values among block threads
BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
Parameters
keys[in-out] Keys to sort
values[in-out] Values to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 529 of file block_radix_sort.cuh.

◆ SortBlocked()

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int DESCENDING, int KEYS_ONLY>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlocked ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit,
int  end_bit,
Int2Type< DESCENDING >  is_descending,
Int2Type< KEYS_ONLY >  is_keys_only 
)
inlineprivate

Sort blocked arrangement.

Parameters
keysKeys to sort
valuesValues to sort
begin_bitThe beginning (least-significant) bit index needed for key comparison
end_bitThe past-the-end (most-significant) bit index needed for key comparison
is_descendingTag whether is a descending-order sort
is_keys_onlyTag whether is keys-only sort

Definition at line 283 of file block_radix_sort.cuh.

◆ SortBlockedToStriped() [1/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlockedToStriped ( KeyT(&)  keys[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs an ascending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.

  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. The final partitioning is striped.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }.
Parameters
keys[in-out] Keys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 685 of file block_radix_sort.cuh.

◆ SortBlockedToStriped() [2/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int DESCENDING, int KEYS_ONLY>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlockedToStriped ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit,
int  end_bit,
Int2Type< DESCENDING >  is_descending,
Int2Type< KEYS_ONLY >  is_keys_only 
)
inline

Sort blocked -> striped arrangement.

Parameters
keysKeys to sort
valuesValues to sort
begin_bitThe beginning (least-significant) bit index needed for key comparison
end_bitThe past-the-end (most-significant) bit index needed for key comparison
is_descendingTag whether is a descending-order sort
is_keys_onlyTag whether is keys-only sort

Definition at line 339 of file block_radix_sort.cuh.

◆ SortBlockedToStriped() [3/3]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlockedToStriped ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs an ascending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.

  • BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys and values that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs. The final partitioning is striped.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
int thread_values[4];
...
// Collectively sort the keys and values among block threads
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }.
Parameters
keys[in-out] Keys to sort
values[in-out] Values to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 740 of file block_radix_sort.cuh.

◆ SortDescending() [1/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescending ( KeyT(&)  keys[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs a descending block-wide radix sort over a blocked arrangement of keys.

  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }.
Parameters
keys[in-out] Keys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 575 of file block_radix_sort.cuh.

◆ SortDescending() [2/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescending ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs a descending block-wide radix sort across a blocked arrangement of keys and values.

  • BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys and values that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
int thread_values[4];
...
// Collectively sort the keys and values among block threads
BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }.
Parameters
keys[in-out] Keys to sort
values[in-out] Values to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 630 of file block_radix_sort.cuh.

◆ SortDescendingBlockedToStriped() [1/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescendingBlockedToStriped ( KeyT(&)  keys[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs a descending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.

  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. The final partitioning is striped.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }.
Parameters
keys[in-out] Keys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 788 of file block_radix_sort.cuh.

◆ SortDescendingBlockedToStriped() [2/2]

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescendingBlockedToStriped ( KeyT(&)  keys[ITEMS_PER_THREAD],
ValueT(&)  values[ITEMS_PER_THREAD],
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8 
)
inline

Performs a descending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.

  • BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
  • \granularity
  • \smemreuse
Snippet
The code snippet below illustrates a sort of 512 integer keys and values that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs. The final partitioning is striped.
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
// Allocate shared memory for BlockRadixSort
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
int thread_values[4];
...
// Collectively sort the keys and values among block threads
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
Suppose the set of input thread_keys across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The corresponding output thread_keys in those threads will be { [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }.
Parameters
keys[in-out] Keys to sort
values[in-out] Values to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 843 of file block_radix_sort.cuh.

Field Documentation

◆ linear_tid

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
unsigned int cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::linear_tid
private

Linear thread-id.

Definition at line 203 of file block_radix_sort.cuh.

◆ temp_storage

template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
_TempStorage& cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::temp_storage
private

Shared storage reference.

Definition at line 200 of file block_radix_sort.cuh.


The documentation for this class was generated from the following file: