DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
More...
DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
- Overview
- These operations apply a selection criterion to construct a partitioned output sequence from items selected/unselected from a specified input sequence.
- Usage Considerations
- \cdp_class{DevicePartition}
- Performance
- \linear_performance{partition}
- The following chart illustrates DevicePartition::If performance across different CUDA architectures for
int32
items, where 50% of the items are randomly selected for the first partition. \plots_below
Definition at line 73 of file device_partition.cuh.
|
template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT > |
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_out . The total number of items copied into the first partition is written to d_num_selected_out .
|
|
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp > |
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false) |
| Uses the select_op functor to split the corresponding items from d_in into a partitioned sequence d_out . The total number of items copied into the first partition is written to d_num_selected_out .
|
|
◆ Flagged()
template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::Flagged |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
FlagIterator |
d_flags, |
|
|
OutputIteratorT |
d_out, |
|
|
NumSelectedIteratorT |
d_num_selected_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Uses the d_flags
sequence to split the corresponding items from d_in
into a partitioned sequence d_out
. The total number of items copied into the first partition is written to d_num_selected_out
.
- The value type of
d_flags
must be castable to bool
(e.g., bool
, char
, int
, etc.).
- Copies of the selected items are compacted into
d_out
and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out
in reverse order.
- \devicestorage
- Snippet
- The code snippet below illustrates the compaction of items selected from an
int
device vector.
#include <cub/cub.cuh>
int *d_in;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT d_num_selected_out
[out] Pointer to the total number of items selected (i.e., length of d_selected_out)
FlagsInputIteratorT d_flags
< [in] Pointer to the input sequence of data items
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t Flagged(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_ou...
- Template Parameters
-
InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
FlagIterator | [inferred] Random-access input iterator type for reading selection flags \iterator |
OutputIteratorT | [inferred] Random-access output iterator type for writing output items \iterator |
NumSelectedIteratorT | [inferred] Output iterator type for recording the number of items selected \iterator |
- Parameters
-
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[in] | d_flags | Pointer to the input sequence of selection flags |
[out] | d_out | Pointer to the output sequence of partitioned data items |
[out] | d_num_selected_out | Pointer to the output total number of items selected (i.e., the offset of the unselected partition) |
[in] | num_items | Total number of items to select from |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
Definition at line 126 of file device_partition.cuh.
◆ If()
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::If |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
OutputIteratorT |
d_out, |
|
|
NumSelectedIteratorT |
d_num_selected_out, |
|
|
int |
num_items, |
|
|
SelectOp |
select_op, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Uses the select_op
functor to split the corresponding items from d_in
into a partitioned sequence d_out
. The total number of items copied into the first partition is written to d_num_selected_out
.
- Copies of the selected items are compacted into
d_out
and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out
in reverse order.
- \devicestorage
- Performance
- The following charts illustrate saturated partition-if performance across different CUDA architectures for
int32
and int64
items, respectively. Items are selected for the first partition with 50% probability.
- The following charts are similar, but 5% selection probability for the first partition:
- Snippet
- The code snippet below illustrates the compaction of items selected from an
int
device vector.
#include <cub/cub.cuh>
struct LessThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
LessThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const {
return (a < compare);
}
};
int *d_in;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT select_op
[in] Selection operator
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t If(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items ...
- Template Parameters
-
InputIteratorT | [inferred] Random-access input iterator type for reading input items \iterator |
OutputIteratorT | [inferred] Random-access output iterator type for writing output items \iterator |
NumSelectedIteratorT | [inferred] Output iterator type for recording the number of items selected \iterator |
SelectOp | [inferred] Selection functor type having member bool operator()(const T &a) |
- Parameters
-
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output sequence of partitioned data items |
[out] | d_num_selected_out | Pointer to the output total number of items selected (i.e., the offset of the unselected partition) |
[in] | num_items | Total number of items to select from |
[in] | select_op | Unary selection operator |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
Definition at line 234 of file device_partition.cuh.
The documentation for this struct was generated from the following file: