The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block. More...

Detailed Description

template<typename T, int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

Template Parameters

T	Data type being reduced
BLOCK_DIM_X	The thread block length in threads along the X dimension
ALGORITHM	[optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
BLOCK_DIM_Y	[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z	[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH	[optional] \ptxversion

Overview

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
\rowmajor
BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
1. cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. An efficient "raking" reduction algorithm that only supports commutative reduction operators. More...
2. cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. More...
3. cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. More...

Performance Considerations

\granularity
Very efficient (only one synchronization barrier).
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- Summation (vs. generic reduction)
- BLOCK_THREADS is a multiple of the architecture's warp size
- Every thread has a valid input (i.e., full vs. partial-tiles)
See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives

A Simple Example: \blockcollective{BlockReduce}

: The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Compute the block-wide sum for thread0

int aggregate = BlockReduce(temp_storage).Sum(thread_data);

cub::BlockReduce
The BlockReduce class provides collective methods for computing a parallel reduction of items partiti...
Definition block_reduce.cuh:222

cub::BlockReduce::BlockReduce
__device__ __forceinline__ BlockReduce()
Collective constructor using a private static allocation of shared memory as temporary storage.
Definition block_reduce.cuh:288

cub::BlockReduce::temp_storage
_TempStorage & temp_storage
Shared storage reference.
Definition block_reduce.cuh:268

aggregate
aggregate of properties, from a list of object if create a struct that follow the OPENFPM native stru...
Definition aggregate.hpp:215

cub::BlockReduce::TempStorage
\smemstorage{BlockReduce}
Definition block_reduce.cuh:277

Definition at line 221 of file block_reduce.cuh.

Data Structures
struct	TempStorage
	\smemstorage{BlockReduce} More...

Public Member Functions
Collective constructors
__device__ __forceinline__	BlockReduce ()
	Collective constructor using a private static allocation of shared memory as temporary storage.

__device__ __forceinline__	BlockReduce (TempStorage &temp_storage)
	Collective constructor using the specified memory allocation as temporary storage.

Generic reductions
template<typename ReductionOp >
__device__ __forceinline__ T	Reduce (T input, ReductionOp reduction_op)
	Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes one input element.

template<int ITEMS_PER_THREAD, typename ReductionOp >
__device__ __forceinline__ T	Reduce (T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op)
	Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

template<typename ReductionOp >
__device__ __forceinline__ T	Reduce (T input, ReductionOp reduction_op, int num_valid)
	Computes a block-wide reduction for thread₀ using the specified binary reduction functor. The first `num_valid` threads each contribute one input element.

Summation reductions
__device__ __forceinline__ T	Sum (T input)
	Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. Each thread contributes one input element.

template<int ITEMS_PER_THREAD>
__device__ __forceinline__ T	Sum (T(&inputs)[ITEMS_PER_THREAD])
	Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

__device__ __forceinline__ T	Sum (T input, int num_valid)
	Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. The first `num_valid` threads each contribute one input element.

Private Types
enum	{ BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z }
	Constants. More...

typedef BlockReduceWarpReductions< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >	WarpReductions

typedef BlockReduceRakingCommutativeOnly< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >	RakingCommutativeOnly

typedef BlockReduceRaking< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >	Raking

typedef If<(ALGORITHM==BLOCK_REDUCE_WARP_REDUCTIONS), WarpReductions, typenameIf<(ALGORITHM==BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY), RakingCommutativeOnly, Raking >::Type >::Type	InternalBlockReduce
	Internal specialization type.

typedef InternalBlockReduce::TempStorage	_TempStorage
	Shared memory storage layout type for BlockReduce.

Private Member Functions
__device__ __forceinline__ _TempStorage &	PrivateStorage ()
	Internal storage allocator.

Private Attributes
_TempStorage &	temp_storage
	Shared storage reference.

unsigned int	linear_tid
	Linear thread-id.

Member Typedef Documentation

◆ _TempStorage

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef InternalBlockReduce::TempStorage cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::_TempStorage

private

Shared memory storage layout type for BlockReduce.

Definition at line 248 of file block_reduce.cuh.

◆ InternalBlockReduce

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef If<(ALGORITHM==BLOCK_REDUCE_WARP_REDUCTIONS),WarpReductions,typenameIf<(ALGORITHM==BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY),RakingCommutativeOnly,Raking>::Type>::Type cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InternalBlockReduce

private

Internal specialization type.

Definition at line 245 of file block_reduce.cuh.

◆ Raking

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef BlockReduceRaking<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Raking

private

Definition at line 238 of file block_reduce.cuh.

◆ RakingCommutativeOnly

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef BlockReduceRakingCommutativeOnly<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::RakingCommutativeOnly

private

Definition at line 237 of file block_reduce.cuh.

◆ WarpReductions

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

typedef BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::WarpReductions

private

Definition at line 236 of file block_reduce.cuh.

Member Enumeration Documentation

◆ anonymous enum

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

anonymous enum

private

Constants.

Enumerator
BLOCK_THREADS	The thread block size in threads.

Definition at line 230 of file block_reduce.cuh.

Constructor & Destructor Documentation

◆ BlockReduce() [1/2]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockReduce ( )

inline

Collective constructor using a private static allocation of shared memory as temporary storage.

Definition at line 288 of file block_reduce.cuh.

◆ BlockReduce() [2/2]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockReduce ( TempStorage & temp_storage )

inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters

[in] temp_storage Reference to memory allocation having layout type TempStorage

Definition at line 298 of file block_reduce.cuh.

Member Function Documentation

◆ PrivateStorage()

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ _TempStorage & cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::PrivateStorage ( )

inlineprivate

Internal storage allocator.

Definition at line 256 of file block_reduce.cuh.

◆ Reduce() [1/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ReductionOp >

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce	(	T	input,
		ReductionOp	reduction_op
	)

inline

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes one input element.

The return value is undefined in threads other than thread₀.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Each thread obtains an input item

int thread_data;

...

// Compute the block-wide max for thread0

int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());

cub::Max
Default max functor.
Definition thread_operators.cuh:124

Template Parameters

ReductionOp [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input
[in]	reduction_op	Binary reduction functor

Definition at line 348 of file block_reduce.cuh.

◆ Reduce() [2/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<typename ReductionOp >

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce	(	T	input,
		ReductionOp	reduction_op,
		int	num_valid
	)

inline

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. The first num_valid threads each contribute one input element.

The return value is undefined in threads other than thread₀.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Each thread obtains an input item

int thread_data;

if (threadIdx.x < num_valid) thread_data = ...

// Compute the block-wide max for thread0

int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid);

Template Parameters

ReductionOp [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters

[in]	input	Calling thread's input
[in]	reduction_op	Binary reduction functor
[in]	num_valid	Number of threads containing valid elements (may be less than BLOCK_THREADS)

Definition at line 440 of file block_reduce.cuh.

◆ Reduce() [3/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD, typename ReductionOp >

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce	(	T(&)	inputs[ITEMS_PER_THREAD],
		ReductionOp	reduction_op
	)

inline

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

The return value is undefined in threads other than thread₀.
\granularity
\smemreuse

Snippet: The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Compute the block-wide max for thread0

int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());

Template Parameters

ITEMS_PER_THREAD	[inferred] The number of consecutive items partitioned onto each thread.
ReductionOp	[inferred] Binary reduction functor type having member `T operator()(const T &a, const T &b)`

Parameters

[in]	inputs	Calling thread's input segment
[in]	reduction_op	Binary reduction functor

Definition at line 395 of file block_reduce.cuh.

◆ Sum() [1/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum ( T input )

inline

Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. Each thread contributes one input element.

The return value is undefined in threads other than thread₀.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Each thread obtains an input item

int thread_data;

...

// Compute the block-wide sum for thread0

int aggregate = BlockReduce(temp_storage).Sum(thread_data);

Parameters

[in] input Calling thread's input

Definition at line 497 of file block_reduce.cuh.

◆ Sum() [2/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum	(	T	input,
		int	num_valid
	)

inline

Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. The first num_valid threads each contribute one input element.

The return value is undefined in threads other than thread₀.
\rowmajor
\smemreuse

Snippet: The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Each thread obtains an input item (up to num_items)

int thread_data;

if (threadIdx.x < num_valid)

thread_data = ...

// Compute the block-wide sum for thread0

int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);

Parameters

[in]	input	Calling thread's input
[in]	num_valid	Number of threads containing valid elements (may be less than BLOCK_THREADS)

Definition at line 582 of file block_reduce.cuh.

◆ Sum() [3/3]

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

template<int ITEMS_PER_THREAD>

__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum ( T(&) inputs[ITEMS_PER_THREAD] )

inline

Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

The return value is undefined in threads other than thread₀.
\granularity
\smemreuse

Snippet: The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

: #include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)

{

// Specialize BlockReduce for a 1D block of 128 threads on type int

typedef cub::BlockReduce<int, 128> BlockReduce;

// Allocate shared memory for BlockReduce

__shared__ typename BlockReduce::TempStorage temp_storage;

// Obtain a segment of consecutive items that are blocked across threads

int thread_data[4];

...

// Compute the block-wide sum for thread0

int aggregate = BlockReduce(temp_storage).Sum(thread_data);

Template Parameters

ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread.

Parameters

[in] inputs Calling thread's input segment

Definition at line 539 of file block_reduce.cuh.

Field Documentation

◆ linear_tid

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

unsigned int cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::linear_tid

private

Linear thread-id.

Definition at line 271 of file block_reduce.cuh.

◆ temp_storage

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>

_TempStorage& cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::temp_storage

private

Shared storage reference.

Definition at line 268 of file block_reduce.cuh.

The documentation for this class was generated from the following file:

openfpm_data/src/util/cuda/cub_old/block/block_reduce.cuh

Detailed Description

Data Structures

Public Member Functions

Private Types

Private Member Functions

Private Attributes

Member Typedef Documentation

◆ _TempStorage

◆ InternalBlockReduce

◆ Raking

◆ RakingCommutativeOnly

◆ WarpReductions

Member Enumeration Documentation

◆ anonymous enum

Constructor & Destructor Documentation

◆ BlockReduce() [1/2]

◆ BlockReduce() [2/2]

Member Function Documentation

◆ PrivateStorage()

◆ Reduce() [1/3]

◆ Reduce() [2/3]

◆ Reduce() [3/3]

◆ Sum() [1/3]

◆ Sum() [2/3]

◆ Sum() [3/3]

Field Documentation

◆ linear_tid

◆ temp_storage