DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. More...
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
Definition at line 65 of file device_segmented_reduce.cuh.
Static Public Member Functions | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename ReductionOp , typename T > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, ReductionOp reduction_op, T initial_value, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented reduction using the specified binary reduction_op functor. | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented sum using the addition ('+') operator. | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented minimum using the less-than ('<') operator. | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide minimum in each segment using the less-than ('<') operator, also returning the in-segment index of that item. | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented maximum using the greater-than ('>') operator. | |
template<typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, OffsetIteratorT d_begin_offsets, OffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide maximum in each segment using the greater-than ('>') operator, also returning the in-segment index of that item. | |
|
inlinestatic |
Finds the first device-wide maximum in each segment using the greater-than ('>') operator, also returning the in-segment index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out[i].value
and its offset in that segment is written to d_out[i].key
.{1, std::numeric_limits<T>::lowest()}
tuple is produced for zero-length inputssegment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).>
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T> ) \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 568 of file device_segmented_reduce.cuh.
|
inlinestatic |
Finds the first device-wide minimum in each segment using the less-than ('<') operator, also returning the in-segment index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out[i].value
and its offset in that segment is written to d_out[i].key
.{1, std::numeric_limits<T>::max()}
tuple is produced for zero-length inputssegment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).<
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T> ) \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 385 of file device_segmented_reduce.cuh.
|
inlinestatic |
Computes a device-wide segmented maximum using the greater-than ('>') operator.
std::numeric_limits<T>::lowest()
as the initial value of the reduction.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).>
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 483 of file device_segmented_reduce.cuh.
|
inlinestatic |
Computes a device-wide segmented minimum using the less-than ('<') operator.
std::numeric_limits<T>::max()
as the initial value of the reduction for each segment.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).<
operators that are non-commutative.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 300 of file device_segmented_reduce.cuh.
|
inlinestatic |
Computes a device-wide segmented reduction using the specified binary reduction_op
functor.
segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
T | [inferred] Data element type that is convertible to the value type of InputIteratorT |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | reduction_op | Binary reduction functor |
[in] | initial_value | Initial value of the reduction for each segment |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 133 of file device_segmented_reduce.cuh.
|
inlinestatic |
Computes a device-wide segmented sum using the addition ('+') operator.
0
as the initial value of the reduction for each segment.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).+
operators that are non-commutative..int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate \iterator |
OffsetIteratorT | [inferred] Random-access input iterator type for reading segment offsets \iterator |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
Definition at line 215 of file device_segmented_reduce.cuh.