OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
cub::DeviceSegmentedRadixSort Struct Reference

DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. More...

Detailed Description

DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.

Overview
The radix sorting method arranges items into ascending (or descending) order. The algorithm relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
DeviceSegmentedRadixSort can sort all of the built-in C++ numeric primitive types (unsigned char, int, double, etc.) as well as CUDA's __half half-precision floating-point type. Although the direct radix sorting method can only be applied to unsigned integral types, DeviceSegmentedRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.
Usage Considerations
\cdp_class{DeviceSegmentedRadixSort}

Definition at line 76 of file device_segmented_radix_sort.cuh.

Static Public Member Functions

Key-value pairs
template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of key-value pairs into ascending order. (~2N auxiliary storage required)
 
template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of key-value pairs into ascending order. (~N auxiliary storage required)
 
template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of key-value pairs into descending order. (~2N auxiliary storage required).
 
template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of key-value pairs into descending order. (~N auxiliary storage required).
 
Keys-only
template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of keys into ascending order. (~2N auxiliary storage required)
 
template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of keys into ascending order. (~N auxiliary storage required).
 
template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of keys into descending order. (~2N auxiliary storage required).
 
template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts segments of keys into descending order. (~N auxiliary storage required).
 

Member Function Documentation

◆ SortKeys() [1/2]

template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortKeys ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of keys into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • \devicestorageNP For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [-, -, -, -, -, -, -]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys_out <-- [6, 7, 8, 0, 3, 5, 9]
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
KeyT * d_keys_out
< [in] Input keys buffer
static CUB_RUNTIME_FUNCTION cudaError_t SortKeys(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
Sorts segments of keys into ascending order. (~2N auxiliary storage required)
Template Parameters
KeyT[inferred] Key type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inDevice-accessible pointer to the input data of key data to sort
[out]d_keys_outDevice-accessible pointer to the sorted output sequence of key data
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 546 of file device_segmented_radix_sort.cuh.

◆ SortKeys() [2/2]

template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortKeys ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of keys into ascending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers may be altered by the sorting operation.
  • Upon completion, the sorting operation will update the "current" indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageP
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys.Current() <-- [6, 7, 8, 0, 3, 5, 9]
Template Parameters
KeyT[inferred] Key type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 645 of file device_segmented_radix_sort.cuh.

◆ SortKeysDescending() [1/2]

template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortKeysDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of keys into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageNP For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [-, -, -, -, -, -, -]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys_out <-- [8, 7, 6, 9, 5, 3, 0]
static CUB_RUNTIME_FUNCTION cudaError_t SortKeysDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
Sorts segments of keys into descending order. (~2N auxiliary storage required).
Template Parameters
KeyT[inferred] Key type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inDevice-accessible pointer to the input data of key data to sort
[out]d_keys_outDevice-accessible pointer to the sorted output sequence of key data
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 734 of file device_segmented_radix_sort.cuh.

◆ SortKeysDescending() [2/2]

template<typename KeyT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortKeysDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of keys into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers may be altered by the sorting operation.
  • Upon completion, the sorting operation will update the "current" indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageP
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys.Current() <-- [8, 7, 6, 9, 5, 3, 0]
Template Parameters
KeyT[inferred] Key type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 832 of file device_segmented_radix_sort.cuh.

◆ SortPairs() [1/2]

template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortPairs ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
const ValueT *  d_values_in,
ValueT *  d_values_out,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of key-value pairs into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageNP For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [-, -, -, -, -, -, -]
int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_values_out; // e.g., [-, -, -, -, -, -, -]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys_out <-- [6, 7, 8, 0, 3, 5, 9]
// d_values_out <-- [1, 2, 0, 5, 4, 3, 6]
KeyT const ValueT ValueT * d_values_out
[in] Output values buffer
KeyT const ValueT * d_values_in
[in] Input values buffer
static CUB_RUNTIME_FUNCTION cudaError_t SortPairs(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
Sorts segments of key-value pairs into ascending order. (~2N auxiliary storage required)
Template Parameters
KeyT[inferred] Key type
ValueT[inferred] Value type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inDevice-accessible pointer to the input data of key data to sort
[out]d_keys_outDevice-accessible pointer to the sorted output sequence of key data
[in]d_values_inDevice-accessible pointer to the corresponding input sequence of associated value items
[out]d_values_outDevice-accessible pointer to the correspondingly-reordered output sequence of associated value items
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 143 of file device_segmented_radix_sort.cuh.

◆ SortPairs() [2/2]

template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortPairs ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
DoubleBuffer< ValueT > &  d_values,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of key-value pairs into ascending order. (~N auxiliary storage required)

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers within each pair may be altered by the sorting operation.
  • Upon completion, the sorting operation will update the "current" indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageP
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [-, -, -, -, -, -, -]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys.Current() <-- [6, 7, 8, 0, 3, 5, 9]
// d_values.Current() <-- [5, 4, 3, 1, 2, 0, 6]
Double-buffer storage wrapper for multi-pass stream transformations that require more than one storag...
Template Parameters
KeyT[inferred] Key type
ValueT[inferred] Value type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 252 of file device_segmented_radix_sort.cuh.

◆ SortPairsDescending() [1/2]

template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortPairsDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
const ValueT *  d_values_in,
ValueT *  d_values_out,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of key-value pairs into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageNP For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [-, -, -, -, -, -, -]
int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_values_out; // e.g., [-, -, -, -, -, -, -]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys_out <-- [8, 7, 6, 9, 5, 3, 0]
// d_values_out <-- [0, 2, 1, 6, 3, 4, 5]
static CUB_RUNTIME_FUNCTION cudaError_t SortPairsDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, int begin_bit=0, int end_bit=sizeof(KeyT) *8, cudaStream_t stream=0, bool debug_synchronous=false)
Sorts segments of key-value pairs into descending order. (~2N auxiliary storage required).
Template Parameters
KeyT[inferred] Key type
ValueT[inferred] Value type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inDevice-accessible pointer to the input data of key data to sort
[out]d_keys_outDevice-accessible pointer to the sorted output sequence of key data
[in]d_values_inDevice-accessible pointer to the corresponding input sequence of associated value items
[out]d_values_outDevice-accessible pointer to the correspondingly-reordered output sequence of associated value items
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 345 of file device_segmented_radix_sort.cuh.

◆ SortPairsDescending() [2/2]

template<typename KeyT , typename ValueT , typename OffsetIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedRadixSort::SortPairsDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
DoubleBuffer< ValueT > &  d_values,
int  num_items,
int  num_segments,
OffsetIteratorT  d_begin_offsets,
OffsetIteratorT  d_end_offsets,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts segments of key-value pairs into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers within each pair may be altered by the sorting operation.
  • Upon completion, the sorting operation will update the "current" indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments+1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets+1).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • \devicestorageP
  • \devicestorage
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [-, -, -, -, -, -, -]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values,
num_items, num_segments, d_offsets, d_offsets + 1);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values,
num_items, num_segments, d_offsets, d_offsets + 1);
// d_keys.Current() <-- [8, 7, 6, 9, 5, 3, 0]
// d_values.Current() <-- [0, 2, 1, 6, 3, 4, 5]
Template Parameters
KeyT[inferred] Key type
ValueT[inferred] Value type
OffsetIteratorT[inferred] Random-access input iterator type for reading segment offsets \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsThe total number of items to sort (across all segments)
[in]num_segmentsThe number of segments that comprise the sorting data
[in]d_begin_offsetsPointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
[in]d_end_offsetsPointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 454 of file device_segmented_radix_sort.cuh.


The documentation for this struct was generated from the following file: