OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
cub::DeviceScan Struct Reference

DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. More...

Detailed Description

DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.

Overview
Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.
As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]. The central idea is to leverage a small, constant factor of redundant work in order to overlap the latencies of global prefix propagation with local computation. As such, our algorithm requires only ~2n data movement (n inputs are read, n outputs are written), and typically proceeds at "memcpy" speeds.
[1] Duane Merrill and Michael Garland. "Single-pass Parallel Prefix Scan with Decoupled Look-back", NVIDIA Technical Report NVR-2016-002, 2016.
Usage Considerations
\cdp_class{DeviceScan}
Performance
\linear_performance{prefix scan}
The following chart illustrates DeviceScan::ExclusiveSum performance across different CUDA architectures for int32 keys. \plots_below

Definition at line 89 of file device_scan.cuh.

Static Public Member Functions

Exclusive scans
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value, and is assigned to *d_out.
 
template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT , typename InitValueT >
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is applied as the initial value, and is assigned to *d_out.
 
Inclusive scans
template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide inclusive prefix sum.
 
template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT >
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.
 

Member Function Documentation

◆ ExclusiveScan()

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT , typename InitValueT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveScan ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
ScanOpT  scan_op,
InitValueT  init_value,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is applied as the initial value, and is assigned to *d_out.

  • Supports non-commutative scan operators.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the exclusive prefix min-scan of an int device vector
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ , , , , , , ]
CustomMin min_op
...
// Determine temporary device storage requirements for exclusive prefix scan
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);
// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);
// d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]
KeyT const ValueT ValueT OffsetT OffsetT num_items
[in] Total number of input data items
OutputIteratorT d_out
< [in] Pointer to the input sequence of data items
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide exclusive prefix scan using the specified binary scan_op functor....
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs \iterator
ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)
Identity[inferred] Type of the identity value used Binary scan functor type having member T operator()(const T &a, const T &b)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of data items
[in]scan_opBinary scan functor
[in]init_valueInitial value to seed the exclusive scan (and is assigned to *d_out)
[in]num_itemsTotal number of input items (i.e., the length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 243 of file device_scan.cuh.

◆ ExclusiveSum()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveSum ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value, and is assigned to *d_out.

  • Supports non-commutative sum operators.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Performance
The following charts illustrate saturated exclusive sum performance across different CUDA architectures for int32 and int64 items, respectively.
Snippet
The code snippet below illustrates the exclusive prefix sum of an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ , , , , , , ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// d_out s<-- [0, 8, 14, 21, 26, 29, 29]
static CUB_RUNTIME_FUNCTION cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value,...
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of data items
[in]num_itemsTotal number of input items (i.e., the length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 149 of file device_scan.cuh.

◆ InclusiveScan()

template<typename InputIteratorT , typename OutputIteratorT , typename ScanOpT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::InclusiveScan ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
ScanOpT  scan_op,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.

  • Supports non-commutative scan operators.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the inclusive prefix min-scan of an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ , , , , , , ]
CustomMin min_op;
...
// Determine temporary device storage requirements for inclusive prefix scan
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);
// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run inclusive prefix min-scan
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);
// d_out <-- [8, 6, 6, 5, 3, 0, 0]
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs \iterator
ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of data items
[in]scan_opBinary scan functor
[in]num_itemsTotal number of input items (i.e., the length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 407 of file device_scan.cuh.

◆ InclusiveSum()

template<typename InputIteratorT , typename OutputIteratorT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::InclusiveSum ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide inclusive prefix sum.

  • Supports non-commutative sum operators.
  • Provides "run-to-run" determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
  • \devicestorage
Snippet
The code snippet below illustrates the inclusive prefix sum of an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ , , , , , , ]
...
// Determine temporary device storage requirements for inclusive prefix sum
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run inclusive prefix sum
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// d_out <-- [8, 14, 21, 26, 29, 29, 38]
static CUB_RUNTIME_FUNCTION cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
Computes a device-wide inclusive prefix sum.
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading scan inputs \iterator
OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs \iterator
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of data items
[in]num_itemsTotal number of input items (i.e., the length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 323 of file device_scan.cuh.


The documentation for this struct was generated from the following file: