DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. More...

Detailed Description

DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.

Overview: These operations apply a selection criterion to selectively copy items from a specified input sequence to a compact output sequence.

Usage Considerations: \cdp_class{DeviceSelect}

Performance: \linear_performance{select-flagged, select-if, and select-unique}

: The following chart illustrates DeviceSelect::If performance across different CUDA architectures for int32 items, where 50% of the items are randomly selected.

: The following chart illustrates DeviceSelect::Unique performance across different CUDA architectures for int32 items where segments have lengths uniformly sampled from [1,1000].

: \plots_below

Definition at line 82 of file device_select.cuh.

Static Public Member Functions
template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Uses the `d_flags` sequence to selectively copy the corresponding items from `d_in` into `d_out`. The total number of items selected is written to `d_num_selected_out`.

template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
	Uses the `select_op` functor to selectively copy items from `d_in` into `d_out`. The total number of items selected is written to `d_num_selected_out`.

template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	Unique (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Given an input sequence `d_in` having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to `d_out`. The total number of items selected is written to `d_num_selected_out`.

Member Function Documentation

◆ Flagged()

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::Flagged	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		FlagIterator	d_flags,
		OutputIteratorT	d_out,
		NumSelectedIteratorT	d_num_selected_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
\devicestorage

Snippet: The code snippet below illustrates the compaction of items selected from an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input, flags, and output

int num_items; // e.g., 8

int *d_in; // e.g., [1, 2, 3, 4, 5, 6, 7, 8]

char *d_flags; // e.g., [1, 0, 0, 1, 0, 1, 1, 0]

int *d_out; // e.g., [ , , , , , , , ]

int *d_num_selected_out; // e.g., [ ]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection

cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);

// d_out <-- [1, 4, 6, 7]

// d_num_selected_out <-- [4]

cub::num_items
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
Definition dispatch_radix_sort.cuh:168

cub::d_num_selected_out
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT d_num_selected_out
[out] Pointer to the total number of items selected (i.e., length of d_selected_out)
Definition dispatch_select_if.cuh:80

cub::d_flags
FlagsInputIteratorT d_flags
< [in] Pointer to the input sequence of data items
Definition dispatch_select_if.cuh:78

cub::d_out
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
Definition dispatch_reduce.cuh:71

cub::DeviceSelect::Flagged
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t Flagged(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out....
Definition device_select.cuh:133

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
FlagIterator	[inferred] Random-access input iterator type for reading selection flags \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing selected items \iterator
NumSelectedIteratorT	[inferred] Output iterator type for recording the number of items selected \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[in]	d_flags	Pointer to the input sequence of selection flags
[out]	d_out	Pointer to the output sequence of selected data items
[out]	d_num_selected_out	Pointer to the output total number of items selected (i.e., length of `d_out`)
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 133 of file device_select.cuh.

◆ If()

template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::If	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		NumSelectedIteratorT	d_num_selected_out,
		int	num_items,
		SelectOp	select_op,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

Copies of the selected items are compacted into d_out and maintain their original relative ordering.
\devicestorage

Performance: The following charts illustrate saturated select-if performance across different CUDA architectures for int32 and int64 items, respectively. Items are selected with 50% probability.

: The following charts are similar, but 5% selection probability:

Snippet: The code snippet below illustrates the compaction of items selected from an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>

// Functor type for selecting values less than some criteria

struct LessThan

{

int compare;

CUB_RUNTIME_FUNCTION __forceinline__

LessThan(int compare) : compare(compare) {}

CUB_RUNTIME_FUNCTION __forceinline__

bool operator()(const int &a) const {

return (a < compare);

}

};

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 8

int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8]

int *d_out; // e.g., [ , , , , , , , ]

int *d_num_selected_out; // e.g., [ ]

LessThan select_op(7);

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection

cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);

// d_out <-- [0, 2, 3, 5, 2]

// d_num_selected_out <-- [5]

cub::select_op
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT select_op
[in] Selection operator
Definition dispatch_select_if.cuh:82

cub::DeviceSelect::If
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t If(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items ...
Definition device_select.cuh:239

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing selected items \iterator
NumSelectedIteratorT	[inferred] Output iterator type for recording the number of items selected \iterator
SelectOp	[inferred] Selection operator type having member `bool operator()(const T &a)`

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of selected data items
[out]	d_num_selected_out	Pointer to the output total number of items selected (i.e., length of `d_out`)
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	select_op	Unary selection operator
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 239 of file device_select.cuh.

◆ Unique()

template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::Unique	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OutputIteratorT	d_out,
		NumSelectedIteratorT	d_num_selected_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out. The total number of items selected is written to d_num_selected_out.

The == equality operator is used to determine whether keys are equivalent
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
\devicestorage

Performance: The following charts illustrate saturated select-unique performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].

: The following charts are similar, but with segment lengths uniformly sampled from [1,10]:

Snippet: The code snippet below illustrates the compaction of items selected from an int device vector.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 8

int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

int *d_out; // e.g., [ , , , , , , , ]

int *d_num_selected_out; // e.g., [ ]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection

cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);

// d_out <-- [0, 2, 9, 5, 8]

// d_num_selected_out <-- [5]

cub::DeviceSelect::Unique
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t Unique(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from ea...
Definition device_select.cuh:329

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OutputIteratorT	[inferred] Random-access output iterator type for writing selected items \iterator
NumSelectedIteratorT	[inferred] Output iterator type for recording the number of items selected \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of data items
[out]	d_out	Pointer to the output sequence of selected data items
[out]	d_num_selected_out	Pointer to the output total number of items selected (i.e., length of `d_out`)
[in]	num_items	Total number of input items (i.e., length of `d_in`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 329 of file device_select.cuh.

The documentation for this struct was generated from the following file:

openfpm_data/src/util/cuda/cub_old/device/device_select.cuh

Detailed Description

Static Public Member Functions

Member Function Documentation

◆ Flagged()

◆ If()

◆ Unique()