Data Structures
class	cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >
	The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More...

class	cub::WarpScan< T, LOGICAL_WARP_THREADS, PTX_ARCH >
	The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More...

Functions
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	cub::ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
	Shuffle-up for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{i-src_offset}. For thread lanes i < src_offset, the thread's own `input` is returned to the thread.

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	cub::ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
	Shuffle-down for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{i+src_offset}. For thread lanes i >= WARP_THREADS, the thread's own `input` is returned to the thread.

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	cub::ShuffleIndex (T input, int src_lane, unsigned int member_mask)
	Shuffle-broadcast for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{src_lane}. For `src_lane` < 0 or `src_lane` >= WARP_THREADS, then the thread's own `input` is returned to the thread.

Detailed Description

Function Documentation

◆ ShuffleDown()

template<int LOGICAL_WARP_THREADS, typename T >

__device__ __forceinline__ T cub::ShuffleDown	(	T	input,
		int	src_offset,
		int	last_thread,
		unsigned int	member_mask
	)

Shuffle-down for any data type. Each warp-lane_i obtains the value input contributed by warp-lane_{i+src_offset}. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

Template Parameters

LOGICAL_WARP_THREADS	The number of threads per "logical" warp. Must be a power-of-two <= 32.
T	[inferred] The input/output element type

Available only for SM3.0 or newer

Snippet: The code snippet below illustrates each thread obtaining a double value from the successor of its successor.

: #include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>

__global__ void ExampleKernel(...)

{

// Obtain one input item per thread

double thread_data = ...

// Obtain item from two ranks below

double peer_data = ShuffleDown<32>(thread_data, 2, 31, 0xffffffff);

: Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters

[in]	input	The value to broadcast
[in]	src_offset	The relative up-offset of the peer to read from
[in]	last_thread	Index of last thread in logical warp (typically 31 for a 32-thread warp)
[in]	member_mask	32-bit mask of participating warp lanes

Definition at line 585 of file util_ptx.cuh.

◆ ShuffleIndex()

template<int LOGICAL_WARP_THREADS, typename T >

__device__ __forceinline__ T cub::ShuffleIndex	(	T	input,
		int	src_lane,
		unsigned int	member_mask
	)

Shuffle-broadcast for any data type. Each warp-lane_i obtains the value input contributed by warp-lane_{src_lane}. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

Template Parameters

LOGICAL_WARP_THREADS	The number of threads per "logical" warp. Must be a power-of-two <= 32.
T	[inferred] The input/output element type

Available only for SM3.0 or newer

Snippet: The code snippet below illustrates each thread obtaining a double value from warp-lane₀.

: #include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>

__global__ void ExampleKernel(...)

{

// Obtain one input item per thread

double thread_data = ...

// Obtain item from thread 0

double peer_data = ShuffleIndex<32>(thread_data, 0, 0xffffffff);

: Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters

[in]	input	The value to broadcast
[in]	src_lane	Which warp lane is to do the broadcasting
[in]	member_mask	32-bit mask of participating warp lanes

Definition at line 656 of file util_ptx.cuh.

◆ ShuffleUp()

template<int LOGICAL_WARP_THREADS, typename T >

__device__ __forceinline__ T cub::ShuffleUp	(	T	input,
		int	src_offset,
		int	first_thread,
		unsigned int	member_mask
	)

Shuffle-up for any data type. Each warp-lane_i obtains the value input contributed by warp-lane_{i-src_offset}. For thread lanes i < src_offset, the thread's own input is returned to the thread.

Template Parameters

LOGICAL_WARP_THREADS	The number of threads per "logical" warp. Must be a power-of-two <= 32.
T	[inferred] The input/output element type

Available only for SM3.0 or newer

Snippet: The code snippet below illustrates each thread obtaining a double value from the predecessor of its predecessor.

: #include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>

__global__ void ExampleKernel(...)

{

// Obtain one input item per thread

double thread_data = ...

// Obtain item from two ranks below

double peer_data = ShuffleUp<32>(thread_data, 2, 0, 0xffffffff);

: Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters

[in]	input	The value to broadcast
[in]	src_offset	The relative down-offset of the peer to read from
[in]	first_thread	Index of first lane in logical warp (typically 0)
[in]	member_mask	32-bit mask of participating warp lanes

Definition at line 517 of file util_ptx.cuh.

Data Structures

Functions

Detailed Description

Function Documentation

◆ ShuffleDown()

◆ ShuffleIndex()

◆ ShuffleUp()