DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More...
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
int32
keys.fp32
values. Segments are identified by int32
keys, and have lengths uniformly sampled from [1,1000].Definition at line 84 of file device_reduce.cuh.
Static Public Member Functions | |
template<typename InputIteratorT , typename OutputIteratorT , typename ReductionOpT , typename T > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide reduction using the specified binary reduction_op functor and initial value init . | |
template<typename InputIteratorT , typename OutputIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide sum using the addition (+ ) operator. | |
template<typename InputIteratorT , typename OutputIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide minimum using the less-than ('<') operator. | |
template<typename InputIteratorT , typename OutputIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item. | |
template<typename InputIteratorT , typename OutputIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide maximum using the greater-than ('>') operator. | |
template<typename InputIteratorT , typename OutputIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item. | |
template<typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ReductionOpT > | |
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | ReduceByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Reduces segments of values, where segments are demarcated by corresponding runs of identical keys. | |
|
inlinestatic |
Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out.value
and its offset in the input array is written to d_out.key
.{1, std::numeric_limits<T>::lowest()}
tuple is produced for zero-length inputs>
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T> ) \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 550 of file device_reduce.cuh.
|
inlinestatic |
Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out.value
and its offset in the input array is written to d_out.key
.{1, std::numeric_limits<T>::max()}
tuple is produced for zero-length inputs<
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T> ) \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 383 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide maximum using the greater-than ('>') operator.
std::numeric_limits<T>::lowest()
as the initial value of the reduction.>
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 473 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide minimum using the less-than ('<') operator.
std::numeric_limits<T>::max()
as the initial value of the reduction.<
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 306 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide reduction using the specified binary reduction_op
functor and initial value init
.
int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
ReductionOpT | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
T | [inferred] Data element type that is convertible to the value type of InputIteratorT |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | reduction_op | Binary reduction functor |
[in] | init | Initial value of the reduction |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 148 of file device_reduce.cuh.
|
inlinestatic |
Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
d_values_in
using the specified binary reduction_op
functor. The segments are identified by "runs" of corresponding keys in d_keys_in
, where runs are maximal ranges of consecutive, identical keys. For the ith run encountered, the first key of the run and the corresponding value aggregate of that run are written to d_unique_out[i]
and d_aggregates_out[i]
, respectively. The total number of runs encountered is written to d_num_runs_out
.==
equality operator is used to determine whether keys are equivalentfp32
and fp64
values, respectively. Segments are identified by int32
keys, and have lengths uniformly sampled from [1,1000].int
values grouped by runs of associated int
keys. KeysInputIteratorT | [inferred] Random-access input iterator type for reading input keys \iterator |
UniqueOutputIteratorT | [inferred] Random-access output iterator type for writing unique output keys \iterator |
ValuesInputIteratorT | [inferred] Random-access input iterator type for reading input values \iterator |
AggregatesOutputIterator | [inferred] Random-access output iterator type for writing output value aggregates \iterator |
NumRunsOutputIteratorT | [inferred] Output iterator type for recording the number of runs encountered \iterator |
ReductionOpT | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input sequence of keys |
[out] | d_unique_out | Pointer to the output sequence of unique keys (one key per run) |
[in] | d_values_in | Pointer to the input sequence of corresponding values |
[out] | d_aggregates_out | Pointer to the output sequence of value aggregates (one aggregate per run) |
[out] | d_num_runs_out | Pointer to total number of runs encountered (i.e., the length of d_unique_out) |
[in] | reduction_op | Binary reduction functor |
[in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
Definition at line 687 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide sum using the addition (+
) operator.
0
as the initial value of the reduction.+
operators that are non-commutative..int32
and int64
items, respectively.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_items | Total number of input items (i.e., length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 229 of file device_reduce.cuh.