DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More...

Detailed Description

DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.

Overview: A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.

Usage Considerations: \cdp_class{DeviceReduce}

Performance: \linear_performance{reduction, reduce-by-key, and run-length encode}

: The following chart illustrates DeviceReduce::Sum performance across different CUDA architectures for int32 keys.

: The following chart illustrates DeviceReduce::ReduceByKey (summation) performance across different CUDA architectures for fp32 values. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].

: \plots_below

Definition at line 84 of file device_reduce.cuh.

Static Public Member Functions
template<typename InputIteratorT , typename OutputIteratorT , typename ReductionOpT , typename T >
static CUB_RUNTIME_FUNCTION cudaError_t	Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide reduction using the specified binary `reduction_op` functor and initial value `init`.

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide sum using the addition (`+`) operator.

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide minimum using the less-than ('<') operator.

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a device-wide maximum using the greater-than ('>') operator.

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t	ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.

template<typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ReductionOpT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	ReduceByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.

Member Function Documentation

◆ ArgMax()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMax	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.

The output value type of d_out is cub::KeyValuePair <int, T> (assuming the value type of d_in is T)
- The maximum is written to d_out.value and its offset in the input array is written to d_out.key.
- The {1, std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
Does not support > operators that are non-commutative.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

KeyValuePair<int, int> *d_out; // e.g., [{-,-}]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run argmax-reduction

cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);

// d_out <-- [{6, 9}]

cub::num_items
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
Definition dispatch_radix_sort.cuh:168

cub::d_out
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
Definition dispatch_reduce.cuh:71

cub::DeviceReduce::ArgMax
static CUB_RUNTIME_FUNCTION cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index o...
Definition device_reduce.cuh:550

cub::KeyValuePair
A key identifier paired with a corresponding value.
Definition util_type.cuh:667

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items (of some type `T`) \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate (having value type `cub::KeyValuePair<int, T>`) \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 550 of file device_reduce.cuh.

◆ ArgMin()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMin	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.

The output value type of d_out is cub::KeyValuePair <int, T> (assuming the value type of d_in is T)
- The minimum is written to d_out.value and its offset in the input array is written to d_out.key.
- The {1, std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
Does not support < operators that are non-commutative.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

KeyValuePair<int, int> *d_out; // e.g., [{-,-}]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run argmin-reduction

cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);

// d_out <-- [{5, 0}]

cub::DeviceReduce::ArgMin
static CUB_RUNTIME_FUNCTION cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of t...
Definition device_reduce.cuh:383

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items (of some type `T`) \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate (having value type `cub::KeyValuePair<int, T>`) \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 383 of file device_reduce.cuh.

◆ Max()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Max	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide maximum using the greater-than ('>') operator.

Uses std::numeric_limits<T>::lowest() as the initial value of the reduction.
Does not support > operators that are non-commutative.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the max-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [-]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run max-reduction

cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);

// d_out <-- [9]

cub::DeviceReduce::Max
static CUB_RUNTIME_FUNCTION cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide maximum using the greater-than ('>') operator.
Definition device_reduce.cuh:473

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 473 of file device_reduce.cuh.

◆ Min()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Min	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide minimum using the less-than ('<') operator.

Uses std::numeric_limits<T>::max() as the initial value of the reduction.
Does not support < operators that are non-commutative.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates the min-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [-]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run min-reduction

cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out <-- [0]

cub::DeviceReduce::Min
static CUB_RUNTIME_FUNCTION cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide minimum using the less-than ('<') operator.
Definition device_reduce.cuh:306

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 306 of file device_reduce.cuh.

◆ Reduce()

template<typename InputIteratorT , typename OutputIteratorT , typename ReductionOpT , typename T >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Reduce	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		ReductionOpT	reduction_op,
		T	init,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide reduction using the specified binary reduction_op functor and initial value init.

Does not support binary reduction operators that are non-commutative.
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Snippet: The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// CustomMin functor

struct CustomMin

{

template <typename T>

__device__ __forceinline__

T operator()(const T &a, const T &b) const {

return (b < a) ? b : a;

}

};

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [-]

CustomMin min_op;

int init; // e.g., INT_MAX

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run reduction

cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);

// d_out <-- [0]

cub::init
OutputIteratorT OffsetT ReductionOpT OuputT init
< [in] The initial value of the reduction
Definition dispatch_reduce.cuh:119

cub::DeviceReduce::Reduce
static CUB_RUNTIME_FUNCTION cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide reduction using the specified binary reduction_op functor and initial value in...
Definition device_reduce.cuh:148

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate \iterator
ReductionOpT	[inferred] Binary reduction functor type having member `T operator()(const T &a, const T &b)`
T	[inferred] Data element type that is convertible to the `value` type of `InputIteratorT`

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	reduction_op	Binary reduction functor
[in]	init	Initial value of the reduction
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 148 of file device_reduce.cuh.

◆ ReduceByKey()

template<typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ReductionOpT >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceReduce::ReduceByKey	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		KeysInputIteratorT	d_keys_in,
		UniqueOutputIteratorT	d_unique_out,
		ValuesInputIteratorT	d_values_in,
		AggregatesOutputIteratorT	d_aggregates_out,
		NumRunsOutputIteratorT	d_num_runs_out,
		ReductionOpT	reduction_op,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.

: This operation computes segmented reductions within d_values_in using the specified binary reduction_op functor. The segments are identified by "runs" of corresponding keys in d_keys_in, where runs are maximal ranges of consecutive, identical keys. For the i^th run encountered, the first key of the run and the corresponding value aggregate of that run are written to d_unique_out[i] and d_aggregates_out[i], respectively. The total number of runs encountered is written to d_num_runs_out.

The == equality operator is used to determine whether keys are equivalent
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Performance: The following chart illustrates reduction-by-key (sum) performance across different CUDA architectures for fp32 and fp64 values, respectively. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].

: The following charts are similar, but with segment lengths uniformly sampled from [1,10]:

Snippet: The code snippet below illustrates the segmented reduction of int values grouped by runs of associated int keys.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>

// CustomMin functor

struct CustomMin

{

template <typename T>

CUB_RUNTIME_FUNCTION __forceinline__

T operator()(const T &a, const T &b) const {

return (b < a) ? b : a;

}

};

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 8

int *d_keys_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

int *d_values_in; // e.g., [0, 7, 1, 6, 2, 5, 3, 4]

int *d_unique_out; // e.g., [-, -, -, -, -, -, -, -]

int *d_aggregates_out; // e.g., [-, -, -, -, -, -, -, -]

int *d_num_runs_out; // e.g., [-]

CustomMin reduction_op;

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run reduce-by-key

cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);

// d_unique_out <-- [0, 2, 9, 5, 8]

// d_aggregates_out <-- [0, 1, 6, 2, 4]

// d_num_runs_out <-- [5]

cub::d_unique_out
UniqueOutputIteratorT d_unique_out
< Pointer to the input sequence of keys
Definition dispatch_reduce_by_key.cuh:74

cub::d_num_runs_out
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT d_num_runs_out
Pointer to total number of runs encountered (i.e., the length of d_unique_out)
Definition dispatch_reduce_by_key.cuh:77

cub::reduction_op
OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT reduction_op
< [in] Binary reduction functor
Definition dispatch_reduce.cuh:75

cub::d_values_in
KeyT const ValueT * d_values_in
[in] Input values buffer
Definition dispatch_radix_sort.cuh:165

cub::d_aggregates_out
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT d_aggregates_out
Pointer to the output sequence of value aggregates (one aggregate per run)
Definition dispatch_reduce_by_key.cuh:76

cub::DeviceReduce::ReduceByKey
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t ReduceByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
Definition device_reduce.cuh:687

Template Parameters

KeysInputIteratorT	[inferred] Random-access input iterator type for reading input keys \iterator
UniqueOutputIteratorT	[inferred] Random-access output iterator type for writing unique output keys \iterator
ValuesInputIteratorT	[inferred] Random-access input iterator type for reading input values \iterator
AggregatesOutputIterator	[inferred] Random-access output iterator type for writing output value aggregates \iterator
NumRunsOutputIteratorT	[inferred] Output iterator type for recording the number of runs encountered \iterator
ReductionOpT	[inferred] Binary reduction functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_keys_in	Pointer to the input sequence of keys
[out]	d_unique_out	Pointer to the output sequence of unique keys (one key per run)
[in]	d_values_in	Pointer to the input sequence of corresponding values
[out]	d_aggregates_out	Pointer to the output sequence of value aggregates (one aggregate per run)
[out]	d_num_runs_out	Pointer to total number of runs encountered (i.e., the length of d_unique_out)
[in]	reduction_op	Binary reduction functor
[in]	num_items	Total number of associated key+value pairs (i.e., the length of `d_in_keys` and `d_in_values`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 687 of file device_reduce.cuh.

◆ Sum()

template<typename InputIteratorT , typename OutputIteratorT >

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Sum	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a device-wide sum using the addition (+) operator.

Uses 0 as the initial value of the reduction.
Does not support + operators that are non-commutative..
Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
\devicestorage

Performance: The following charts illustrate saturated sum-reduction performance across different CUDA architectures for int32 and int64 items, respectively.

Snippet: The code snippet below illustrates the sum-reduction of a device vector of int data elements.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 7

int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]

int *d_out; // e.g., [-]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sum-reduction

cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out <-- [38]

cub::DeviceReduce::Sum
static CUB_RUNTIME_FUNCTION cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide sum using the addition (+) operator.
Definition device_reduce.cuh:229

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Output iterator type for recording the reduced aggregate \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output aggregate
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is `false`.

Definition at line 229 of file device_reduce.cuh.

The documentation for this struct was generated from the following file:

openfpm_data/src/util/cuda/cub_old/device/device_reduce.cuh

Detailed Description

Static Public Member Functions

Member Function Documentation

◆ ArgMax()

◆ ArgMin()

◆ Max()

◆ Min()

◆ Reduce()

◆ ReduceByKey()

◆ Sum()