OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
cub::DeviceReduce Struct Reference

DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More...

Detailed Description

DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.

Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations
\cdp_class{DeviceReduce}
Performance
\linear_performance{reduction, reduce-by-key, and run-length encode}
The following chart illustrates DeviceReduce::Sum performance across different CUDA architectures for int32 keys.
The following chart illustrates DeviceReduce::ReduceByKey (summation) performance across different CUDA architectures for fp32 values. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
\plots_below

Definition at line 84 of file device_reduce.cuh.

Static Public Member Functions

template<typename InputIteratorT , typename OutputIteratorT , typename ReductionOpT , typename T >
static CUB_RUNTIME_FUNCTION cudaError_t Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide reduction using the specified binary reduction_op functor and initial value init.
 
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide sum using the addition (+) operator.
 
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide minimum using the less-than ('<') operator.
 
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.
 
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide maximum using the greater-than ('>') operator.
 
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.
 
template<typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ReductionOpT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t ReduceByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
 

Member Function Documentation

◆ ArgMax()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMax ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.

  • The output value type of d_out is cub::KeyValuePair <int, T> (assuming the value type of d_in is T)
    • The maximum is written to d_out.value and its offset in the input array is written to d_out.key.
    • The {1, std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
  • Does not support > operators that are non-commutative.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
KeyValuePair<int, int> *d_out; // e.g., [{-,-}]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run argmax-reduction
cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
// d_out <-- [{6, 9}]
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
static CUB_RUNTIME_FUNCTION cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index o...
A key identifier paired with a corresponding value.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 550 of file device_reduce.cuh.

◆ ArgMin()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMin ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.

  • The output value type of d_out is cub::KeyValuePair <int, T> (assuming the value type of d_in is T)
    • The minimum is written to d_out.value and its offset in the input array is written to d_out.key.
    • The {1, std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
  • Does not support < operators that are non-commutative.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the argmin-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
KeyValuePair<int, int> *d_out; // e.g., [{-,-}]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run argmin-reduction
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
// d_out <-- [{5, 0}]
static CUB_RUNTIME_FUNCTION cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of t...
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 383 of file device_reduce.cuh.

◆ Max()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Max ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide maximum using the greater-than ('>') operator.

  • Uses std::numeric_limits<T>::lowest() as the initial value of the reduction.
  • Does not support > operators that are non-commutative.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the max-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [-]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run max-reduction
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
// d_out <-- [9]
static CUB_RUNTIME_FUNCTION cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide maximum using the greater-than ('>') operator.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 473 of file device_reduce.cuh.

◆ Min()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Min ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide minimum using the less-than ('<') operator.

  • Uses std::numeric_limits<T>::max() as the initial value of the reduction.
  • Does not support < operators that are non-commutative.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the min-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [-]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run min-reduction
cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// d_out <-- [0]
static CUB_RUNTIME_FUNCTION cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide minimum using the less-than ('<') operator.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 306 of file device_reduce.cuh.

◆ Reduce()

template<typename InputIteratorT , typename OutputIteratorT , typename ReductionOpT , typename T >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Reduce ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
ReductionOpT  reduction_op,
init,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide reduction using the specified binary reduction_op functor and initial value init.

  • Does not support binary reduction operators that are non-commutative.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
__device__ __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [-]
CustomMin min_op;
int init; // e.g., INT_MAX
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run reduction
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);
// d_out <-- [0]
OutputIteratorT OffsetT ReductionOpT OuputT init
< [in] The initial value of the reduction
static CUB_RUNTIME_FUNCTION cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide reduction using the specified binary reduction_op functor and initial value in...
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate \iterator
ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T[inferred] Data element type that is convertible to the value type of InputIteratorT
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]reduction_opBinary reduction functor
[in]initInitial value of the reduction
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 148 of file device_reduce.cuh.

◆ ReduceByKey()

template<typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ReductionOpT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceReduce::ReduceByKey ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
KeysInputIteratorT  d_keys_in,
UniqueOutputIteratorT  d_unique_out,
ValuesInputIteratorT  d_values_in,
AggregatesOutputIteratorT  d_aggregates_out,
NumRunsOutputIteratorT  d_num_runs_out,
ReductionOpT  reduction_op,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.

This operation computes segmented reductions within d_values_in using the specified binary reduction_op functor. The segments are identified by "runs" of corresponding keys in d_keys_in, where runs are maximal ranges of consecutive, identical keys. For the ith run encountered, the first key of the run and the corresponding value aggregate of that run are written to d_unique_out[i] and d_aggregates_out[i], respectively. The total number of runs encountered is written to d_num_runs_out.
  • The == equality operator is used to determine whether keys are equivalent
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Performance
The following chart illustrates reduction-by-key (sum) performance across different CUDA architectures for fp32 and fp64 values, respectively. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
Snippet
The code snippet below illustrates the segmented reduction of int values grouped by runs of associated int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_keys_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_values_in; // e.g., [0, 7, 1, 6, 2, 5, 3, 4]
int *d_unique_out; // e.g., [-, -, -, -, -, -, -, -]
int *d_aggregates_out; // e.g., [-, -, -, -, -, -, -, -]
int *d_num_runs_out; // e.g., [-]
CustomMin reduction_op;
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run reduce-by-key
// d_unique_out <-- [0, 2, 9, 5, 8]
// d_aggregates_out <-- [0, 1, 6, 2, 4]
// d_num_runs_out <-- [5]
UniqueOutputIteratorT d_unique_out
< Pointer to the input sequence of keys
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT d_num_runs_out
Pointer to total number of runs encountered (i.e., the length of d_unique_out)
OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT reduction_op
< [in] Binary reduction functor
KeyT const ValueT * d_values_in
[in] Input values buffer
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT d_aggregates_out
Pointer to the output sequence of value aggregates (one aggregate per run)
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t ReduceByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
Template Parameters
KeysInputIteratorT[inferred] Random-access input iterator type for reading input keys \iterator
UniqueOutputIteratorT[inferred] Random-access output iterator type for writing unique output keys \iterator
ValuesInputIteratorT[inferred] Random-access input iterator type for reading input values \iterator
AggregatesOutputIterator[inferred] Random-access output iterator type for writing output value aggregates \iterator
NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered \iterator
ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input sequence of keys
[out]d_unique_outPointer to the output sequence of unique keys (one key per run)
[in]d_values_inPointer to the input sequence of corresponding values
[out]d_aggregates_outPointer to the output sequence of value aggregates (one aggregate per run)
[out]d_num_runs_outPointer to total number of runs encountered (i.e., the length of d_unique_out)
[in]reduction_opBinary reduction functor
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 687 of file device_reduce.cuh.

◆ Sum()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Sum ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide sum using the addition (+) operator.

  • Uses 0 as the initial value of the reduction.
  • Does not support + operators that are non-commutative..
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Performance
The following charts illustrate saturated sum-reduction performance across different CUDA architectures for int32 and int64 items, respectively.
Snippet
The code snippet below illustrates the sum-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [-]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// d_out <-- [38]
static CUB_RUNTIME_FUNCTION cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide sum using the addition (+) operator.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 229 of file device_reduce.cuh.


The documentation for this struct was generated from the following file: