DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. More...

Static Public Member Functions
template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	Encode (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Computes a run-length encoding of the sequence `d_in`. More...

template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	NonTrivialRuns (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
	Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence `d_in`. More...

Detailed Description

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.

Overview: A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.

Usage Considerations: \cdp_class{DeviceRunLengthEncode}

Performance: \linear_performance{run-length encode}

: The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across different CUDA architectures for int32 items. Segments have lengths uniformly sampled from [1,1000].

: \plots_below

Definition at line 78 of file device_run_length_encode.cuh.

Member Function Documentation

◆ Encode()

template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::Encode	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		UniqueOutputIteratorT	d_unique_out,
		LengthsOutputIteratorT	d_counts_out,
		NumRunsOutputIteratorT	d_num_runs_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Computes a run-length encoding of the sequence d_in.

For the i^th run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
\devicestorage

Performance: The following charts illustrate saturated encode performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].

: The following charts are similar, but with segment lengths uniformly sampled from [1,10]:

Snippet: The code snippet below illustrates the run-length encoding of a sequence of int values.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 8

int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

int *d_unique_out; // e.g., [ , , , , , , , ]

int *d_counts_out; // e.g., [ , , , , , , , ]

int *d_num_runs_out; // e.g., [ ]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding

cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// d_unique_out <-- [0, 2, 9, 5, 8]

// d_counts_out <-- [1, 2, 1, 3, 1]

// d_num_runs_out <-- [5]

cub::d_unique_out
UniqueOutputIteratorT d_unique_out
< Pointer to the input sequence of keys
Definition: dispatch_reduce_by_key.cuh:74

cub::d_num_runs_out
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT d_num_runs_out
Pointer to total number of runs encountered (i.e., the length of d_unique_out)
Definition: dispatch_reduce_by_key.cuh:77

cub::num_items
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
Definition: dispatch_radix_sort.cuh:168

cub::DeviceRunLengthEncode::Encode
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t Encode(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a run-length encoding of the sequence d_in.
Definition: device_run_length_encode.cuh:148

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
UniqueOutputIteratorT	[inferred] Random-access output iterator type for writing unique output items \iterator
LengthsOutputIteratorT	[inferred] Random-access output iterator type for writing output counts \iterator
NumRunsOutputIteratorT	[inferred] Output iterator type for recording the number of runs encountered \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to the input sequence of keys
[out]	d_unique_out	Pointer to the output sequence of unique keys (one key per run)
[out]	d_counts_out	Pointer to the output sequence of run-lengths (one count per run)
[out]	d_num_runs_out	Pointer to total number of runs
[in]	num_items	Total number of associated key+value pairs (i.e., the length of `d_in_keys` and `d_in_values`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 148 of file device_run_length_encode.cuh.

◆ NonTrivialRuns()

template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::NonTrivialRuns	(	void *	d_temp_storage,
		size_t &	temp_storage_bytes,
		InputIteratorT	d_in,
		OffsetsOutputIteratorT	d_offsets_out,
		LengthsOutputIteratorT	d_lengths_out,
		NumRunsOutputIteratorT	d_num_runs_out,
		int	num_items,
		cudaStream_t	stream = `0`,
		bool	debug_synchronous = `false`
	)

inlinestatic

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

For the i^th non-trivial run, the run's starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
\devicestorage

Performance

Snippet: The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.

: #include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers for input and output

int num_items; // e.g., 8

int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

int *d_offsets_out; // e.g., [ , , , , , , , ]

int *d_lengths_out; // e.g., [ , , , , , , , ]

int *d_num_runs_out; // e.g., [ ]

...

// Determine temporary device storage requirements

void *d_temp_storage = NULL;

size_t temp_storage_bytes = 0;

cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// Allocate temporary storage

cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding

cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// d_offsets_out <-- [1, 4]

// d_lengths_out <-- [2, 3]

// d_num_runs_out <-- [2]

cub::d_offsets_out
OffsetsOutputIteratorT d_offsets_out
< [in] Pointer to input sequence of data items
Definition: dispatch_rle.cuh:77

cub::d_lengths_out
OffsetsOutputIteratorT LengthsOutputIteratorT d_lengths_out
[out] Pointer to output sequence of run-lengths
Definition: dispatch_rle.cuh:78

cub::DeviceRunLengthEncode::NonTrivialRuns
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t NonTrivialRuns(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued ke...
Definition: device_run_length_encode.cuh:244

Template Parameters

InputIteratorT	[inferred] Random-access input iterator type for reading input items \iterator
OffsetsOutputIteratorT	[inferred] Random-access output iterator type for writing run-offset values \iterator
LengthsOutputIteratorT	[inferred] Random-access output iterator type for writing run-length values \iterator
NumRunsOutputIteratorT	[inferred] Output iterator type for recording the number of runs encountered \iterator

Parameters

[in]	d_temp_storage	Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to `temp_storage_bytes` and no work is done.
[in,out]	temp_storage_bytes	Reference to size in bytes of `d_temp_storage` allocation
[in]	d_in	Pointer to input sequence of data items
[out]	d_offsets_out	Pointer to output sequence of run-offsets (one offset per non-trivial run)
[out]	d_lengths_out	Pointer to output sequence of run-lengths (one count per non-trivial run)
[out]	d_num_runs_out	Pointer to total number of runs (i.e., length of `d_offsets_out`)
[in]	num_items	Total number of associated key+value pairs (i.e., the length of `d_in_keys` and `d_in_values`)
[in]	stream	[optional] CUDA stream to launch kernels within. Default is stream₀.
[in]	debug_synchronous	[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is `false`.

Definition at line 244 of file device_run_length_encode.cuh.

The documentation for this struct was generated from the following file:

openfpm_data/src/util/cuda/cub_old/device/device_run_length_encode.cuh

Static Public Member Functions

Detailed Description

Member Function Documentation

◆ Encode()

◆ NonTrivialRuns()