Optional outer namespace(s) More...

Namespaces
namespace	internal
	Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations)

Data Structures
struct	AgentHistogram
	AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram . More...

struct	AgentHistogramPolicy
	< Whether to dequeue tiles from a global work queue More...

struct	AgentRadixSortDownsweep
	AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep . More...

struct	AgentRadixSortDownsweepPolicy
	< The number of radix bits, i.e., log2(bins) More...

struct	AgentRadixSortUpsweep
	AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep . More...

struct	AgentRadixSortUpsweepPolicy
	< The number of radix bits, i.e., log2(bins) More...

struct	AgentReduce
	AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction . More...

struct	AgentReduceByKey
	AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More...

struct	AgentReduceByKeyPolicy
	< The BlockScan algorithm to use More...

struct	AgentReducePolicy
	< Cache load modifier for reading input elements More...

struct	AgentRle
	AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode. More...

struct	AgentRlePolicy
	< The BlockScan algorithm to use More...

struct	AgentScan
	AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan . More...

struct	AgentScanPolicy
	< The BlockScan algorithm to use More...

struct	AgentSegmentFixup
	AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More...

struct	AgentSegmentFixupPolicy
	< The BlockScan algorithm to use More...

struct	AgentSelectIf
	AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide selection. More...

struct	AgentSelectIfPolicy
	< The BlockScan algorithm to use More...

struct	AgentSpmv
	AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV. More...

struct	AgentSpmvPolicy
	< The BlockScan algorithm to use More...

struct	AlignBytes
	Structure alignment. More...

struct	AlignBytes< const T >

struct	AlignBytes< const volatile T >

struct	AlignBytes< volatile T >

class	ArgIndexInputIterator
	A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming `KeyValuePair` tuples). More...

struct	ArgMax
	Arg max functor (keeps the value and offset of the first occurrence of the larger item) More...

struct	ArgMin
	Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More...

struct	ArrayWrapper
	A wrapper for passing simple static arrays as kernel parameters. More...

struct	BaseTraits
	Basic type traits. More...

struct	BaseTraits< FLOATING_POINT, true, false, _UnsignedBits, T >

struct	BaseTraits< SIGNED_INTEGER, true, false, _UnsignedBits, T >

struct	BaseTraits< UNSIGNED_INTEGER, true, false, _UnsignedBits, T >

struct	BinaryOpHasIdxParam
	Determine whether or not BinaryOp's functor is of the form `bool operator()(const T& a, const T&b)` or `bool operator()(const T& a, const T&b, unsigned int idx)` More...

class	BlockAdjacentDifference

class	BlockDiscontinuity
	The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. More...

class	BlockExchange
	The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block. More...

class	BlockHistogram
	The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...

struct	BlockHistogramAtomic
	The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...

struct	BlockHistogramSort
	The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...

class	BlockLoad
	The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...

class	BlockRadixRank
	BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block. More...

class	BlockRadixRankMatch

class	BlockRadixSort
	The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method. More...

struct	BlockRakingLayout
	BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data. More...

class	BlockReduce
	The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block. More...

struct	BlockReduceRaking
	BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More...

struct	BlockReduceRakingCommutativeOnly
	BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block. Does not support non-commutative reduction operators. Does not support block sizes that are not a multiple of the warp size. More...

struct	BlockReduceWarpReductions
	BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More...

class	BlockScan
	The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. More...

struct	BlockScanRaking
	BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block. More...

struct	BlockScanRunningPrefixOp
	< Wrapped scan operator type More...

struct	BlockScanWarpScans
	BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block. More...

class	BlockShuffle
	The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block. More...

class	BlockStore
	The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory. More...

class	CacheModifiedInputIterator
	A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More...

class	CacheModifiedOutputIterator
	A random-access output wrapper for storing array values using a PTX cache-modifier. More...

struct	CachingDeviceAllocator
	A simple caching allocator for device memory allocations. More...

struct	CastOp
	Default cast functor. More...

struct	ChainedPolicy
	Helper for dispatching into a policy chain. More...

struct	ChainedPolicy< PTX_VERSION, PolicyT, PolicyT >
	Helper for dispatching into a policy chain (end-of-chain specialization) More...

class	ConstantInputIterator
	A random-access input generator for dereferencing a sequence of homogeneous values. More...

class	CountingInputIterator
	A random-access input generator for dereferencing a sequence of incrementing integer values. More...

struct	CubVector
	Exposes a member typedef `Type` that names the corresponding CUDA vector type if one exists. Otherwise `Type` refers to the CubVector structure itself, which will wrap the corresponding `x`, `y`, etc. vector fields. More...

struct	CubVector< T, 1 >

struct	CubVector< T, 2 >

struct	CubVector< T, 3 >

struct	CubVector< T, 4 >

struct	DeviceHistogram
	DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory. More...

struct	DevicePartition
	DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory. More...

struct	DeviceRadixSort
	DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory. More...

struct	DeviceRadixSortPolicy
	< Signed integer type for global offsets More...

struct	DeviceReduce
	DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More...

struct	DeviceReducePolicy
	< Binary reduction functor type having member `T operator()(const T &a, const T &b)` More...

struct	DeviceRleDispatch
	< Signed integer type for global offsets More...

struct	DeviceRunLengthEncode
	DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. More...

struct	DeviceScan
	DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. More...

struct	DeviceSegmentedRadixSort
	DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. More...

struct	DeviceSegmentedReduce
	DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. More...

struct	DeviceSelect
	DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. More...

struct	DeviceSpmv
	DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More...

struct	DipatchHistogram
	< Signed integer type for global offsets More...

class	DiscardOutputIterator
	A discard iterator. More...

struct	DispatchRadixSort
	< Signed integer type for global offsets More...

struct	DispatchReduce
	< Binary reduction functor type having member `T operator()(const T &a, const T &b)` More...

struct	DispatchReduceByKey
	< Signed integer type for global offsets More...

struct	DispatchScan
	< Signed integer type for global offsets More...

struct	DispatchSegmentedRadixSort
	< Signed integer type for global offsets More...

struct	DispatchSegmentedReduce
	< Binary reduction functor type having member `T operator()(const T &a, const T &b)` More...

struct	DispatchSelectIf
	< Whether or not we push rejected items to the back of the output More...

struct	DispatchSpmv
	< Signed integer type for global offsets More...

struct	DoubleBuffer
	Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth. More...

struct	EnableIf
	Simple enable-if (similar to Boost) More...

struct	EnableIf< false, T >

struct	Equality
	Default equality functor. More...

struct	Equals
	Type equality test. More...

struct	Equals< A, A >

struct	FpLimits

struct	FpLimits< double >

struct	FpLimits< float >

class	GridBarrier
	GridBarrier implements a software global barrier among thread blocks within a CUDA grid. More...

class	GridBarrierLifetime
	GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation. More...

struct	GridEvenShare
	GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More...

class	GridQueue
	GridQueue is a descriptor utility for dynamic queue management. More...

struct	If
	Type selection (`IF ? ThenType : ElseType`) More...

struct	If< false, ThenType, ElseType >

struct	Inequality
	Default inequality functor. More...

struct	InequalityWrapper
	Inequality functor (wraps equality functor) More...

struct	Int2Type
	Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values) More...

struct	IsPointer
	Pointer vs. iterator. More...

struct	IsPointer< Tp * >

struct	IsVolatile
	Volatile modifier test. More...

struct	IsVolatile< Tp volatile >

struct	IterateThreadLoad
	Helper structure for templated load iteration (inductive case) More...

struct	IterateThreadLoad< MAX, MAX >
	Helper structure for templated load iteration (termination case) More...

struct	IterateThreadStore
	Helper structure for templated store iteration (inductive case) More...

struct	IterateThreadStore< MAX, MAX >
	Helper structure for templated store iteration (termination case) More...

struct	KernelConfig

struct	KeyValuePair
	A key identifier paired with a corresponding value. More...

struct	Log2
	Statically determine log2(N), rounded up. More...

struct	Log2< N, 0, COUNT >

struct	Max
	Default max functor. More...

struct	Min
	Default min functor. More...

struct	Mutex

struct	NullType
	A simple "NULL" marker type. More...

struct	NumericTraits
	Numeric type traits. More...

struct	NumericTraits< bool >

struct	NumericTraits< char >

struct	NumericTraits< double >

struct	NumericTraits< float >

struct	NumericTraits< int >

struct	NumericTraits< long >

struct	NumericTraits< long long >

struct	NumericTraits< NullType >

struct	NumericTraits< short >

struct	NumericTraits< signed char >

struct	NumericTraits< unsigned char >

struct	NumericTraits< unsigned int >

struct	NumericTraits< unsigned long >

struct	NumericTraits< unsigned long long >

struct	NumericTraits< unsigned short >

struct	PowerOfTwo
	Statically determine if N is a power-of-two. More...

struct	ReduceByKeyOp
	< Binary reduction operator to apply to values More...

struct	ReduceByKeyScanTileState

struct	ReduceByKeyScanTileState< ValueT, KeyT, false >

struct	ReduceByKeyScanTileState< ValueT, KeyT, true >

struct	ReduceBySegmentOp
	Reduce-by-segment functor. More...

struct	RemoveQualifiers
	Removes `const` and `volatile` qualifiers from type `Tp`. More...

struct	RemoveQualifiers< Tp, const Up >

struct	RemoveQualifiers< Tp, const volatile Up >

struct	RemoveQualifiers< Tp, volatile Up >

struct	ScanTileState

struct	ScanTileState< T, false >

struct	ScanTileState< T, true >

struct	SpmvParams
	< Signed integer type for sequence offsets More...

struct	Sum
	Default sum functor. More...

class	SwizzleScanOp
	Binary operator wrapper for switching non-commutative scan arguments. More...

union	TempStorage

class	TexObjInputIterator
	A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More...

struct	TilePrefixCallbackOp

struct	Traits
	Type traits. More...

class	TransformInputIterator
	A random-access input wrapper for transforming dereferenced values. More...

struct	Uninitialized
	A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions. More...

struct	UnitWord
	Unit-words of data movement. More...

struct	UnitWord< char2 >

struct	UnitWord< const T >

struct	UnitWord< const volatile T >

struct	UnitWord< float2 >

struct	UnitWord< float4 >

struct	UnitWord< volatile T >

class	WarpExchange

class	WarpReduce
	The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More...

struct	WarpReduceShfl
	WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp. More...

struct	WarpReduceSmem
	WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp. More...

class	WarpScan
	The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More...

struct	WarpScanShfl
	WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More...

struct	WarpScanSmem
	WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More...

Typedefs
typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy >::Type, KeyT, OffsetT >	AgentRadixSortUpsweepT

typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy >::Type, IS_DESCENDING, KeyT, ValueT, OffsetT >	AgentRadixSortDownsweepT

typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS,(ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM==RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM >	BlockRadixSortT

typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM >	BlockLoadKeys

typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM >	BlockLoadValues

typedef Traits< KeyT >::UnsignedBits	UnsignedBitsT

typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy< BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS >, KeyT, OffsetT >	BlockUpsweepT

typedef BlockScan< OffsetT, BLOCK_THREADS >	DigitScanT

typedef AgentRadixSortDownsweep< SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT >	BlockDownsweepT

typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT >	AgentReduceT

Enumerations
enum	BlockHistogramMemoryPreference { GMEM , SMEM , BLEND }

enum	RadixRankAlgorithm { RADIX_RANK_BASIC , RADIX_RANK_MEMOIZE , RADIX_RANK_MATCH }

enum	ScanTileStatus { SCAN_TILE_OOB , SCAN_TILE_INVALID = 99 , SCAN_TILE_PARTIAL , SCAN_TILE_INCLUSIVE }

enum	BlockHistogramAlgorithm { BLOCK_HISTO_SORT , BLOCK_HISTO_ATOMIC }
	BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. More...

enum	BlockLoadAlgorithm { BLOCK_LOAD_DIRECT , BLOCK_LOAD_VECTORIZE , BLOCK_LOAD_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED }
	cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More...

enum	BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY , BLOCK_REDUCE_RAKING , BLOCK_REDUCE_WARP_REDUCTIONS }

enum	BlockScanAlgorithm { BLOCK_SCAN_RAKING , BLOCK_SCAN_RAKING_MEMOIZE , BLOCK_SCAN_WARP_SCANS }
	BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. More...

enum	BlockStoreAlgorithm { BLOCK_STORE_DIRECT , BLOCK_STORE_VECTORIZE , BLOCK_STORE_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED }
	cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More...

enum	{ BLOCK_THREADS = SegmentedPolicyT::BLOCK_THREADS , ITEMS_PER_THREAD = SegmentedPolicyT::ITEMS_PER_THREAD , RADIX_BITS = SegmentedPolicyT::RADIX_BITS , TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD , RADIX_DIGITS = 1 << RADIX_BITS , KEYS_ONLY = Equals<ValueT, NullType>::VALUE }

enum	{ BINS_TRACKED_PER_THREAD = BlockDownsweepT::BINS_TRACKED_PER_THREAD }

enum	GridMappingStrategy { GRID_MAPPING_RAKE , GRID_MAPPING_STRIP_MINE , GRID_MAPPING_DYNAMIC }
	cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...

enum	CacheLoadModifier { LOAD_DEFAULT , LOAD_CA , LOAD_CG , LOAD_CS , LOAD_CV , LOAD_LDG , LOAD_VOLATILE }
	Enumeration of cache modifiers for memory load operations. More...

enum	CacheStoreModifier { STORE_DEFAULT , STORE_WB , STORE_CG , STORE_CS , STORE_WT , STORE_VOLATILE }
	Enumeration of cache modifiers for memory store operations. More...

enum	{ MAX_VEC_ELEMENTS = 4 }

enum	Category { NOT_A_NUMBER , SIGNED_INTEGER , UNSIGNED_INTEGER , FLOATING_POINT }
	Basic type traits categories. More...

Functions
template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT >
__global__ void	DeviceHistogramInitKernel (ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_output_bins_wrapper, ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_output_histograms_wrapper, GridQueue< int > tile_queue)
	< Signed integer type for global offsets

template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT >
	__launch_bounds__ (int(AgentHistogramPolicyT::BLOCK_THREADS)) __global__ void DeviceHistogramSweepKernel(SampleIteratorT d_samples
	< Signed integer type for global offsets

AgentHistogramT	agent (temp_storage, d_samples, num_output_bins_wrapper.array, num_privatized_bins_wrapper.array, d_output_histograms_wrapper.array, d_privatized_histograms_wrapper.array, output_decode_op_wrapper.array, privatized_decode_op_wrapper.array)

agent	InitBinCounters ()

agent	ConsumeTiles (num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue)

agent	StoreOutput ()

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT >
	__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortUpsweepKernel(const KeyT *d_keys
	< Signed integer type for global offsets

even_share template	BlockInit< TILE_ITEMS, GRID_MAPPING_RAKE > ()

upsweep	ProcessRegion (even_share.block_offset, even_share.block_end)

	CTA_SYNC ()

upsweep template	ExtractCounts< IS_DESCENDING > (d_spine, gridDim.x, blockIdx.x)

template<typename ChainedPolicyT , typename OffsetT >
	__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS), 1) __global__ void RadixSortScanBinsKernel(OffsetT *d_spine
	< Signed integer type for global offsets

AgentScanT	block_scan (temp_storage, d_spine, d_spine, cub::Sum(), OffsetT(0))

	while (block_offset+AgentScanT::TILE_ITEMS<=num_counts)

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
	__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortDownsweepKernel(const KeyT *d_keys_in
	< Signed integer type for global offsets

	AgentRadixSortDownsweepT (temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion(even_share.block_offset

template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
	__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceRadixSortSingleTileKernel(const KeyT *d_keys_in
	< Signed integer type for global offsets

	BlockLoadKeys (temp_storage.load_keys).Load(d_keys_in

	if (!KEYS_ONLY)

	BlockRadixSortT (temp_storage.sort).SortBlockedToStriped(keys

	Int2Type< IS_DESCENDING > ()

	Int2Type< KEYS_ONLY > ())

	for (int ITEM=0;ITEM< ITEMS_PER_THREAD;++ITEM)

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT >
	__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedRadixSortKernel(const KeyT *d_keys_in
	< Signed integer type for global offsets

	if (num_items<=0) return

BlockUpsweepT	upsweep (temp_storage.upsweep, d_keys_in, current_bit, pass_bits)

upsweep	ProcessRegion (segment_begin, segment_end)

upsweep	ExtractCounts (bin_count)

	if (IS_DESCENDING)

	DigitScanT (temp_storage.scan).ExclusiveSum(bin_count

BlockDownsweepT	downsweep (temp_storage.downsweep, bin_offset, num_items, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, pass_bits)

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT >
	__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceReduceKernel(InputIteratorT d_in
	< Binary reduction functor type having member `T operator()(const T &a, const T &b)`

	if (threadIdx.x==0) d_out[blockIdx.x] = reduction_op(init, block_aggregate)

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT >
	__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceReduceSingleTileKernel(InputIteratorT d_in
	< Data element type that is convertible to the `value` type of `OutputIteratorT`

	if (num_items==0)

template<typename T , typename OffsetT , typename IteratorT >
__device__ __forceinline__ void	NormalizeReductionOutput (T &, OffsetT, IteratorT)
	Normalize input iterator to segment offset.

template<typename KeyValuePairT , typename OffsetT , typename WrappedIteratorT , typename OutputValueT >
__device__ __forceinline__ void	NormalizeReductionOutput (KeyValuePairT &val, OffsetT base_offset, ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT >)
	Normalize input iterator to segment offset (specialized for arg-index)

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT >
	__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedReduceKernel(InputIteratorT d_in
	< Data element type that is convertible to the `value` type of `OutputIteratorT`

	if (segment_begin==segment_end)

	NormalizeReductionOutput (block_aggregate, segment_begin, d_in)

template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT >
	__launch_bounds__ (int(AgentReduceByKeyPolicyT::BLOCK_THREADS)) __global__ void DeviceReduceByKeyKernel(KeysInputIteratorT d_keys_in
	< Signed integer type for global offsets

	AgentReduceByKeyT (temp_storage, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, equality_op, reduction_op).ConsumeRange(num_items

template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT >
	__launch_bounds__ (int(AgentRlePolicyT::BLOCK_THREADS)) __global__ void DeviceRleSweepKernel(InputIteratorT d_in
	< Signed integer type for global offsets

	AgentRleT (temp_storage, d_in, d_offsets_out, d_lengths_out, equality_op, num_items).ConsumeRange(num_tiles

template<typename ScanTileStateT >
__global__ void	DeviceScanInitKernel (ScanTileStateT tile_state, int num_tiles)
	< Tile status interface type

template<typename ScanTileStateT , typename NumSelectedIteratorT >
__global__ void	DeviceCompactInitKernel (ScanTileStateT tile_state, int num_tiles, NumSelectedIteratorT d_num_selected_out)
	< Output iterator type for recording the number of items selected

template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT >
	__launch_bounds__ (int(ScanPolicyT::BLOCK_THREADS)) __global__ void DeviceScanKernel(InputIteratorT d_in
	< Signed integer type for global offsets

	AgentScanT (temp_storage, d_in, d_out, scan_op, init_value).ConsumeRange(num_items

template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS>
	__launch_bounds__ (int(AgentSelectIfPolicyT::BLOCK_THREADS)) __global__ void DeviceSelectSweepKernel(InputIteratorT d_in
	< Whether or not we push rejected items to the back of the output

	AgentSelectIfT (temp_storage, d_in, d_flags, d_selected_out, select_op, equality_op, num_items).ConsumeRange(num_tiles

template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT >
__global__ void	DeviceSpmv1ColKernel (SpmvParams< ValueT, OffsetT > spmv_params)
	< Signed integer type for sequence offsets

template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT >
__global__ void	DeviceSpmvSearchKernel (int num_merge_tiles, CoordinateT *d_tile_coordinates, SpmvParamsT spmv_params)
	< SpmvParams type

template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA>
	__launch_bounds__ (int(SpmvPolicyT::BLOCK_THREADS)) __global__ void DeviceSpmvKernel(SpmvParams< ValueT
	< Whether the input parameter Beta is 0

	AgentSpmvT (temp_storage, spmv_params).ConsumeTile(d_tile_coordinates

tile_state	InitializeStatus (num_segment_fixup_tiles)

template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT >
	__launch_bounds__ (int(AgentSegmentFixupPolicyT::BLOCK_THREADS)) __global__ void DeviceSegmentFixupKernel(PairsInputIteratorT d_pairs_in
	< Tile status interface type

	AgentSegmentFixupT (temp_storage, d_pairs_in, d_aggregates_out, cub::Equality(), cub::Sum()).ConsumeRange(num_items

template<typename OffsetT >
__global__ void	FillAndResetDrainKernel (GridQueue< OffsetT > grid_queue, OffsetT num_items)

template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT >
__host__ __device__ __forceinline__ void	MergePathSearch (OffsetT diagonal, AIteratorT a, BIteratorT b, OffsetT a_len, OffsetT b_len, CoordinateT &path_coordinate)

template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT	LowerBound (InputIteratorT input, OffsetT num_items, T val)
	Returns the offset of the first value within `input` which does not compare less than `val`.

template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT	UpperBound (InputIteratorT input, OffsetT num_items, T val)
	Returns the offset of the first value within `input` which compares greater than `val`.

__host__ __device__ __forceinline__ cudaError_t	Debug (cudaError_t error, const char *filename, int line)
	CUB error reporting macro (prints error messages to stderr)

template<int ALLOCATIONS>
__host__ __device__ __forceinline__ cudaError_t	AliasTemporaries (void d_temp_storage, size_t &temp_storage_bytes, void (&allocations)[ALLOCATIONS], size_t(&allocation_sizes)[ALLOCATIONS])

template<typename T >
__global__ void	EmptyKernel (void)

CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t	PtxVersion (int &ptx_version)
	Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10)

CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t	SmVersion (int &sm_version, int device_ordinal)
	Retrieves the SM version (major * 100 + minor * 10)

CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t	SyncStream (cudaStream_t stream)

template<typename KernelPtr >
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t	MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0)
	Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer `kernel_ptr` on the current device with `block_threads` per thread block.

__device__ __forceinline__ unsigned int	SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
	Shift-right then add. Returns (`x` >> `shift`) + `addend`.

__device__ __forceinline__ unsigned int	SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
	Shift-left then add. Returns (`x` << `shift`) + `addend`.

template<typename UnsignedBits , int BYTE_LEN>
__device__ __forceinline__ unsigned int	BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< BYTE_LEN >)

template<typename UnsignedBits >
__device__ __forceinline__ unsigned int	BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< 8 >)

template<typename UnsignedBits >
__device__ __forceinline__ unsigned int	BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
	Bitfield-extract. Extracts `num_bits` from `source` starting at bit-offset `bit_start`. The input `source` may be an 8b, 16b, 32b, or 64b unsigned integer type.

__device__ __forceinline__ void	BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
	Bitfield insert. Inserts the `num_bits` least significant bits of `y` into `x` at bit-offset `bit_start`.

__device__ __forceinline__ unsigned int	IADD3 (unsigned int x, unsigned int y, unsigned int z)
	Three-operand add. Returns `x` + `y` + `z`.

__device__ __forceinline__ int	PRMT (unsigned int a, unsigned int b, unsigned int index)
	Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later.

__device__ __forceinline__ void	BAR (int count)

__device__ __forceinline__ int	CTA_SYNC_AND (int p)

__device__ __forceinline__ void	WARP_SYNC (unsigned int member_mask)

__device__ __forceinline__ int	WARP_ANY (int predicate, unsigned int member_mask)

__device__ __forceinline__ int	WARP_ALL (int predicate, unsigned int member_mask)

__device__ __forceinline__ int	WARP_BALLOT (int predicate, unsigned int member_mask)

__device__ __forceinline__ unsigned int	SHFL_UP_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask)

__device__ __forceinline__ unsigned int	SHFL_DOWN_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask)

__device__ __forceinline__ unsigned int	SHFL_IDX_SYNC (unsigned int word, int src_lane, int flags, unsigned int member_mask)

__device__ __forceinline__ float	FMUL_RZ (float a, float b)

__device__ __forceinline__ float	FFMA_RZ (float a, float b, float c)

__device__ __forceinline__ void	ThreadExit ()
	Terminates the calling thread.

__device__ __forceinline__ void	ThreadTrap ()
	Abort execution and generate an interrupt to the host CPU.

__device__ __forceinline__ int	RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z)
	Returns the row-major linear thread identifier for a multidimensional thread block.

__device__ __forceinline__ unsigned int	LaneId ()
	Returns the warp lane ID of the calling thread.

__device__ __forceinline__ unsigned int	WarpId ()
	Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.

__device__ __forceinline__ unsigned int	LaneMaskLt ()
	Returns the warp lane mask of all lanes less than the calling thread.

__device__ __forceinline__ unsigned int	LaneMaskLe ()
	Returns the warp lane mask of all lanes less than or equal to the calling thread.

__device__ __forceinline__ unsigned int	LaneMaskGt ()
	Returns the warp lane mask of all lanes greater than the calling thread.

__device__ __forceinline__ unsigned int	LaneMaskGe ()
	Returns the warp lane mask of all lanes greater than or equal to the calling thread.

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
	Shuffle-up for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{i-src_offset}. For thread lanes i < src_offset, the thread's own `input` is returned to the thread.

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
	Shuffle-down for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{i+src_offset}. For thread lanes i >= WARP_THREADS, the thread's own `input` is returned to the thread.

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T	ShuffleIndex (T input, int src_lane, unsigned int member_mask)
	Shuffle-broadcast for any data type. Each warp-lane_i obtains the value `input` contributed by warp-lane_{src_lane}. For `src_lane` < 0 or `src_lane` >= WARP_THREADS, then the thread's own `input` is returned to the thread.

template<int LABEL_BITS>
__device__ unsigned int	MatchAny (unsigned int label)

Blocked arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
	Load a linear segment of items into a blocked arrangement across the thread block.

template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
	Load a linear segment of items into a blocked arrangement across the thread block, guarded by range.

template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
	Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements..

template<CacheLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void	InternalLoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])

template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void	LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
	Load a linear segment of items into a blocked arrangement across the thread block.

template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
	Store a blocked arrangement of items across a thread block into a linear segment of items.

template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
	Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range.

template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void	StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
	Store a blocked arrangement of items across a thread block into a linear segment of items.

Striped arrangement I/O (direct)
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
	Load a linear segment of items into a striped arrangement across the thread block.

template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
	Load a linear segment of items into a striped arrangement across the thread block, guarded by range.

template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
	Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.

template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
	Store a striped arrangement of data across the thread block into a linear segment of items.

template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
	Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range.

Warp-striped arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
	Load a linear segment of items into a warp-striped arrangement across the thread block.

template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
	Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range.

template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void	LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
	Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.

template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
	Store a warp-striped arrangement of data across the thread block into a linear segment of items.

template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void	StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
	Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range.

Thread I/O (cache modified)
template<CacheLoadModifier MODIFIER, typename InputIteratorT >
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type	ThreadLoad (InputIteratorT itr)
	Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type.

template<typename InputIteratorT >
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type	ThreadLoad (InputIteratorT itr, Int2Type< LOAD_DEFAULT >, Int2Type< false >)

template<typename T >
__device__ __forceinline__ T	ThreadLoad (T *ptr, Int2Type< LOAD_DEFAULT >, Int2Type< true >)

template<typename T >
__device__ __forceinline__ T	ThreadLoadVolatilePointer (T *ptr, Int2Type< true >)

template<typename T >
__device__ __forceinline__ T	ThreadLoadVolatilePointer (T *ptr, Int2Type< false >)

template<typename T >
__device__ __forceinline__ T	ThreadLoad (T *ptr, Int2Type< LOAD_VOLATILE >, Int2Type< true >)

template<typename T , int MODIFIER>
__device__ __forceinline__ T	ThreadLoad (T const *ptr, Int2Type< MODIFIER >, Int2Type< true >)

template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T >
__device__ __forceinline__ void	ThreadStore (OutputIteratorT itr, T val)
	Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type.

template<typename OutputIteratorT , typename T >
__device__ __forceinline__ void	ThreadStore (OutputIteratorT itr, T val, Int2Type< STORE_DEFAULT >, Int2Type< false >)

template<typename T >
__device__ __forceinline__ void	ThreadStore (T *ptr, T val, Int2Type< STORE_DEFAULT >, Int2Type< true >)

template<typename T >
__device__ __forceinline__ void	ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< true >)

template<typename T >
__device__ __forceinline__ void	ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< false >)

template<typename T >
__device__ __forceinline__ void	ThreadStore (T *ptr, T val, Int2Type< STORE_VOLATILE >, Int2Type< true >)

template<typename T , int MODIFIER>
__device__ __forceinline__ void	ThreadStore (T *ptr, T val, Int2Type< MODIFIER >, Int2Type< true >)

Variables
ArrayWrapper< int, NUM_ACTIVE_CHANNELS >	num_output_bins_wrapper
	< Input data to reduce

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS >	num_privatized_bins_wrapper
	The number bins per privatized histogram.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS >	d_output_histograms_wrapper
	Reference to final output histograms.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS >	d_privatized_histograms_wrapper
	Reference to privatized histograms.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS >	output_decode_op_wrapper
	The transform operator for determining output bin-ids from privatized counter indices, one for each channel.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS >	privatized_decode_op_wrapper
	The transform operator for determining privatized counter indices from samples, one for each channel.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT	num_row_pixels
	The number of multi-channel pixels per row in the region of interest.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT	num_rows
	The number of rows in the region of interest.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT	row_stride_samples
	The number of samples between starts of consecutive rows in the region of interest.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int	tiles_per_row
	Number of image tiles per row.

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT , NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int GridQueue< int >	tile_queue
	< Drain queue descriptor for dynamically mapping tile data onto thread blocks

__shared__ AgentHistogramT::TempStorage	temp_storage

OffsetT *	d_spine
	< [in] Input keys buffer

OffsetT	OffsetT
	[in] Total number of input data items

OffsetT int	current_bit
	[in] Bit position of current radix digit

OffsetT int int	num_bits
	[in] Number of bits of current radix digit

OffsetT int int GridEvenShare< OffsetT >	even_share
	< [in] Even-share descriptor for mapan equal number of tiles onto each thread block

AgentRadixSortUpsweepT	upsweep (temp_storage, d_keys, current_bit, num_bits)

int	num_counts
	< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

int	block_offset = 0

BlockScanRunningPrefixOp< OffsetT, Sum >	prefix_op (0, Sum())

KeyT *	d_keys_out
	< [in] Input keys buffer

KeyT const ValueT *	d_values_in
	[in] Input values buffer

KeyT const ValueT ValueT *	d_values_out
	[in] Output values buffer

KeyT const ValueT ValueT OffsetT OffsetT	num_items = segment_end - segment_begin
	[in] Total number of input data items

even_share	block_end

KeyT const ValueT ValueT OffsetT int int	end_bit
	< [in] The past-the-end (most-significant) bit index needed for key comparison

KeyT	keys [ITEMS_PER_THREAD]

ValueT	values [ITEMS_PER_THREAD]

UnsignedBitsT	default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY

KeyT	default_key = reinterpret_cast<KeyT&>(default_key_bits)

KeyT const ValueT ValueT OffsetIteratorT	d_begin_offsets
	[in] Pointer to the sequence of beginning offsets of length `num_segments`, such that `d_begin_offsets[i]` is the first element of the i^th data segment in `d_keys_` and `d_values_`

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT	d_end_offsets
	[in] Pointer to the sequence of ending offsets of length `num_segments`, such that `d_end_offsets[i]-1` is the last element of the i^th data segment in `d_keys_` and `d_values_`. If `d_end_offsets[i]-1` <= `d_begin_offsets[i]`, the i^th is considered empty.

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT	int
	[in] The number of segments that comprise the sorting data

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int	pass_bits
	< [in] Number of bits of current radix digit

OffsetT	segment_begin = d_begin_offsets[blockIdx.x]

OffsetT	segment_end = d_end_offsets[blockIdx.x]

OffsetT	bin_count [BINS_TRACKED_PER_THREAD]

OffsetT	bin_offset [BINS_TRACKED_PER_THREAD]

OutputIteratorT	d_out
	< [in] Pointer to the input sequence of data items

OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT	reduction_op
	< [in] Binary reduction functor

OutputT	block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share)

OutputIteratorT OffsetT ReductionOpT OuputT	init
	< [in] The initial value of the reduction

UniqueOutputIteratorT	d_unique_out
	< Pointer to the input sequence of keys

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT	d_aggregates_out
	Pointer to the output sequence of value aggregates (one aggregate per run)

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT	d_num_runs_out
	Pointer to total number of runs encountered (i.e., the length of d_unique_out)

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT	tile_state
	Tile status interface.

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int	start_tile
	The starting tile for the current grid.

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT	equality_op
	KeyT equality operator.

OffsetsOutputIteratorT	d_offsets_out
	< [in] Pointer to input sequence of data items

OffsetsOutputIteratorT LengthsOutputIteratorT	d_lengths_out
	[out] Pointer to output sequence of run-lengths

OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT	tile_status
	[in] Tile status interface

OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT EqualityOpT OffsetT int	num_tiles
	< [in] Total number of tiles for the entire problem

OutputIteratorT ScanTileStateT int ScanOpT	scan_op
	Binary scan functor.

OutputIteratorT ScanTileStateT int ScanOpT InitValueT	init_value
	Initial value to seed the exclusive scan.

FlagsInputIteratorT	d_flags
	< [in] Pointer to the input sequence of data items

FlagsInputIteratorT SelectedOutputIteratorT	d_selected_out
	[out] Pointer to the output sequence of selected data items

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT	d_num_selected_out
	[out] Pointer to the total number of items selected (i.e., length of `d_selected_out`)

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT	select_op
	[in] Selection operator

OffsetT	spmv_params
	[in] SpMV input parameter bundle

OffsetT CoordinateT *	d_tile_coordinates
	[in] Pointer to the temporary array of tile starting coordinates

OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > *	d_tile_carry_pairs
	[out] Pointer to the temporary array carry-out dot product row-ids, one per block

OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > int ScanTileStateT int	num_segment_fixup_tiles
	< [in] Number of reduce-by-key tiles (fixup grid size)

Detailed Description

Optional outer namespace(s)

CUB namespace

Typedef Documentation

◆ AgentRadixSortDownsweepT

typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy>::Type, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::AgentRadixSortDownsweepT

Definition at line 187 of file dispatch_radix_sort.cuh.

◆ AgentRadixSortUpsweepT

typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy>::Type, KeyT, OffsetT> cub::AgentRadixSortUpsweepT

Definition at line 92 of file dispatch_radix_sort.cuh.

◆ AgentReduceT

typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT> cub::AgentReduceT

Definition at line 88 of file dispatch_reduce.cuh.

◆ BlockDownsweepT

typedef AgentRadixSortDownsweep<SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::BlockDownsweepT

Definition at line 369 of file dispatch_radix_sort.cuh.

◆ BlockLoadKeys

typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadKeys

Definition at line 245 of file dispatch_radix_sort.cuh.

◆ BlockLoadValues

typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadValues

Definition at line 252 of file dispatch_radix_sort.cuh.

◆ BlockRadixSortT

typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS, (ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM == RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM> cub::BlockRadixSortT

Definition at line 238 of file dispatch_radix_sort.cuh.

◆ BlockUpsweepT

typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy<BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS>, KeyT, OffsetT> cub::BlockUpsweepT

Definition at line 363 of file dispatch_radix_sort.cuh.

◆ DigitScanT

typedef BlockScan<OffsetT, BLOCK_THREADS> cub::DigitScanT

Definition at line 366 of file dispatch_radix_sort.cuh.

◆ UnsignedBitsT

typedef Traits<KeyT>::UnsignedBits cub::UnsignedBitsT

Definition at line 255 of file dispatch_radix_sort.cuh.

Enumeration Type Documentation

◆ anonymous enum

anonymous enum

Definition at line 348 of file dispatch_radix_sort.cuh.

◆ anonymous enum

anonymous enum

Enumerator
BINS_TRACKED_PER_THREAD	Number of bin-starting offsets tracked per thread.

Definition at line 371 of file dispatch_radix_sort.cuh.

◆ BlockHistogramAlgorithm

enum cub::BlockHistogramAlgorithm

BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.

Enumerator

BLOCK_HISTO_SORT

Overview

Sorting followed by differentiation. Execution is comprised of two phases:

Sort the data using efficient radix sort
Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.

Performance Considerations: Delivers consistent throughput regardless of sample bin distribution.

BLOCK_HISTO_ATOMIC

Overview: Use atomic addition to update byte counts directly

Performance Considerations: Performance is strongly tied to the hardware implementation of atomic addition, and may be significantly degraded for non uniformly-random input distributions where many concurrent updates are likely to be made to the same bin counter.

Definition at line 56 of file block_histogram.cuh.

◆ BlockHistogramMemoryPreference

enum cub::BlockHistogramMemoryPreference

Definition at line 58 of file agent_histogram.cuh.

◆ BlockReduceAlgorithm

enum cub::BlockReduceAlgorithm

BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.

Enumerator

BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY

Overview: An efficient "raking" reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).

Execution is comprised of three phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.
Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions
A warp-synchronous Kogge-Stone style reduction within the raking warp.

: BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.

Performance Considerations

This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

BLOCK_REDUCE_RAKING

Overview: An efficient "raking" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. \blocked.

Execution is comprised of three phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
A warp-synchronous Kogge-Stone style reduction within the raking warp.

: BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.

Performance Considerations

This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

BLOCK_REDUCE_WARP_REDUCTIONS

Overview: A quick "tiled warp-reductions" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.

Execution is comprised of four phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.

: BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.

Performance Considerations

This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.

Definition at line 60 of file block_reduce.cuh.

◆ BlockScanAlgorithm

enum cub::BlockScanAlgorithm

BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.

Enumerator

BLOCK_SCAN_RAKING

Overview

An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.

: BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.

Performance Considerations

Although this variant may suffer longer turnaround latencies when the GPU is under-occupied, it can often provide higher overall throughput across the GPU when suitably occupied.

BLOCK_SCAN_RAKING_MEMOIZE

Overview: Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at the expense of higher register pressure. Raking threads preserve their "upsweep" segment of values in registers while performing warp-synchronous scan, allowing the "downsweep" not to re-read them from shared memory.

BLOCK_SCAN_WARP_SCANS

Overview

A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.

: BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.

Performance Considerations

Although this variant may suffer lower overall throughput across the GPU because due to a heavy reliance on inefficient warpscans, it can often provide lower turnaround latencies when the GPU is under-occupied.

Definition at line 57 of file block_scan.cuh.

◆ RadixRankAlgorithm

enum cub::RadixRankAlgorithm

Radix ranking algorithm

Definition at line 62 of file agent_radix_sort_downsweep.cuh.

◆ ScanTileStatus

enum cub::ScanTileStatus

Enumerations of tile status

Definition at line 105 of file single_pass_scan_operators.cuh.

Function Documentation

◆ __launch_bounds__() [1/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >

cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS) ) const

< Signed integer type for global offsets

Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place.

◆ __launch_bounds__() [2/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT >

cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS) ) const

< Signed integer type for global offsets

Segmented radix sorting pass (one block per segment)

◆ __launch_bounds__() [3/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT >

cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS) ) const

< Signed integer type for global offsets

Upsweep digit-counting kernel entry point (multi-block). Computes privatized digit histograms, one per block.

◆ __launch_bounds__() [4/15]

template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT >

cub::__launch_bounds__ ( int(AgentHistogramPolicyT::BLOCK_THREADS) )

< Signed integer type for global offsets

Histogram privatized sweep kernel entry point (multi-block). Computes privatized histograms, one per thread block.

◆ __launch_bounds__() [5/15]

template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT >

cub::__launch_bounds__ ( int(AgentReduceByKeyPolicyT::BLOCK_THREADS) )

< Signed integer type for global offsets

Multi-block reduce-by-key sweep kernel entry point

◆ __launch_bounds__() [6/15]

template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT >

cub::__launch_bounds__ ( int(AgentRlePolicyT::BLOCK_THREADS) )

< Signed integer type for global offsets

Select kernel entry point (multi-block)

Performs functor-based selection if SelectOp functor type != NullType Otherwise performs flag-based selection if FlagIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)

◆ __launch_bounds__() [7/15]

template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT >

cub::__launch_bounds__ ( int(AgentSegmentFixupPolicyT::BLOCK_THREADS) )

< Tile status interface type

Multi-block reduce-by-key sweep kernel entry point

◆ __launch_bounds__() [8/15]

template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS>

cub::__launch_bounds__ ( int(AgentSelectIfPolicyT::BLOCK_THREADS) )

< Whether or not we push rejected items to the back of the output

Select kernel entry point (multi-block)

Performs functor-based selection if SelectOpT functor type != NullType Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)

◆ __launch_bounds__() [9/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT >

cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) )

< Binary reduction functor type having member T operator()(const T &a, const T &b)

Reduce region kernel entry point (multi-block). Computes privatized reductions, one per thread block.

◆ __launch_bounds__() [10/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT >

cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) )

< Data element type that is convertible to the value type of OutputIteratorT

Segmented reduction (one block per segment)

◆ __launch_bounds__() [11/15]

template<typename ChainedPolicyT , typename OffsetT >

cub::__launch_bounds__	(	int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS)	,
		1
	)

< Signed integer type for global offsets

Spine scan kernel entry point (single-block). Computes an exclusive prefix sum over the privatized digit histograms

◆ __launch_bounds__() [12/15]

template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >

cub::__launch_bounds__	(	int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS)	,
		1
	)		const

< Signed integer type for global offsets

Single pass kernel entry point (single-block). Fully sorts a tile of input.

◆ __launch_bounds__() [13/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT >

cub::__launch_bounds__	(	int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS)	,
		1
	)

< Data element type that is convertible to the value type of OutputIteratorT

Reduce a single tile kernel entry point (single-block). Can be used to aggregate privatized thread block reductions from a previous multi-block reduction pass.

◆ __launch_bounds__() [14/15]

template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT >

cub::__launch_bounds__ ( int(ScanPolicyT::BLOCK_THREADS) )

< Signed integer type for global offsets

Scan kernel entry point (multi-block)

◆ __launch_bounds__() [15/15]

template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA>

cub::__launch_bounds__ ( int(SpmvPolicyT::BLOCK_THREADS) )

< Whether the input parameter Beta is 0

Spmv agent entry point

◆ DeviceCompactInitKernel()

template<typename ScanTileStateT , typename NumSelectedIteratorT >

__global__ void cub::DeviceCompactInitKernel	(	ScanTileStateT	tile_state,
		int	num_tiles,
		NumSelectedIteratorT	d_num_selected_out
	)

< Output iterator type for recording the number of items selected

Initialization kernel for tile status initialization (multi-block)

Parameters

[in]	tile_state	Tile status interface
[in]	num_tiles	Number of tiles
[out]	d_num_selected_out	Pointer to the total number of items selected (i.e., length of `d_selected_out`)

Definition at line 78 of file dispatch_scan.cuh.

◆ DeviceHistogramInitKernel()

template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT >

__global__ void cub::DeviceHistogramInitKernel	(	ArrayWrapper< int, NUM_ACTIVE_CHANNELS >	num_output_bins_wrapper,
		ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS >	d_output_histograms_wrapper,
		GridQueue< int >	tile_queue
	)

< Signed integer type for global offsets

Histogram initialization kernel entry point

Parameters

num_output_bins_wrapper	Number of output histogram bins per channel
d_output_histograms_wrapper	Histogram counter data having logical dimensions `CounterT[NUM_ACTIVE_CHANNELS][num_bins.array[CHANNEL]]`
tile_queue	Drain queue descriptor for dynamically mapping tile data onto thread blocks

Definition at line 67 of file dispatch_histogram.cuh.

◆ DeviceScanInitKernel()

template<typename ScanTileStateT >

__global__ void cub::DeviceScanInitKernel	(	ScanTileStateT	tile_state,
		int	num_tiles
	)

< Tile status interface type

Initialization kernel for tile status initialization (multi-block)

Parameters

[in]	tile_state	Tile status interface
[in]	num_tiles	Number of tiles

Definition at line 64 of file dispatch_scan.cuh.

◆ DeviceSpmv1ColKernel()

template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT >

__global__ void cub::DeviceSpmv1ColKernel ( SpmvParams< ValueT, OffsetT > spmv_params )

< Signed integer type for sequence offsets

Spmv search kernel. Identifies merge path starting coordinates for each tile.

Parameters

[in] spmv_params SpMV input parameter bundle

Definition at line 68 of file dispatch_spmv_orig.cuh.

◆ DeviceSpmvSearchKernel()

template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT >

__global__ void cub::DeviceSpmvSearchKernel	(	int	num_merge_tiles,
		CoordinateT *	d_tile_coordinates,
		SpmvParamsT	spmv_params
	)

< SpmvParams type

Spmv search kernel. Identifies merge path starting coordinates for each tile.

Constants

Parameters

[in]	num_merge_tiles	Number of SpMV merge tiles (spmv grid size)
[out]	d_tile_coordinates	Pointer to the temporary array of tile starting coordinates
[in]	spmv_params	SpMV input parameter bundle

Definition at line 104 of file dispatch_spmv_orig.cuh.

◆ for()

cub::for ( )

Definition at line 302 of file dispatch_radix_sort.cuh.

◆ if() [1/4]

cub::if ( ! KEYS_ONLY )

Definition at line 280 of file dispatch_radix_sort.cuh.

◆ if() [2/4]

cub::if ( IS_DESCENDING )

Definition at line 415 of file dispatch_radix_sort.cuh.

◆ if() [3/4]

cub::if ( num_items = = 0 )

Definition at line 133 of file dispatch_reduce.cuh.

◆ if() [4/4]

cub::if ( segment_begin = = segment_end )

Definition at line 210 of file dispatch_reduce.cuh.

◆ LowerBound()

template<typename InputIteratorT , typename OffsetT , typename T >

__device__ __forceinline__ OffsetT cub::LowerBound	(	InputIteratorT	input,
		OffsetT	num_items,
		T	val
	)

Returns the offset of the first value within input which does not compare less than val.

Parameters

[in]	input	Input sequence
[in]	num_items	Input sequence length
[in]	val	Search key

Definition at line 95 of file thread_search.cuh.

◆ MatchAny()

template<int LABEL_BITS>

__device__ unsigned int cub::MatchAny ( unsigned int label )

inline

Compute a 32b mask of threads having the same least-significant LABEL_BITS of label as the calling thread.

Definition at line 703 of file util_ptx.cuh.

◆ MergePathSearch()

template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT >

__host__ __device__ __forceinline__ void cub::MergePathSearch	(	OffsetT	diagonal,
		AIteratorT	a,
		BIteratorT	b,
		OffsetT	a_len,
		OffsetT	b_len,
		CoordinateT &	path_coordinate
	)

Computes the begin offsets into A and B for the specific diagonal

The value type of the input iterator

Definition at line 53 of file thread_search.cuh.

◆ NormalizeReductionOutput() [1/2]

template<typename KeyValuePairT , typename OffsetT , typename WrappedIteratorT , typename OutputValueT >

__device__ __forceinline__ void cub::NormalizeReductionOutput	(	KeyValuePairT &	val,
		OffsetT	base_offset,
		ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT >
	)

Normalize input iterator to segment offset (specialized for arg-index)

Definition at line 164 of file dispatch_reduce.cuh.

◆ NormalizeReductionOutput() [2/2]

template<typename T , typename OffsetT , typename IteratorT >

__device__ __forceinline__ void cub::NormalizeReductionOutput	(	T &	,
		OffsetT	,
		IteratorT
	)

Normalize input iterator to segment offset.

Definition at line 154 of file dispatch_reduce.cuh.

◆ UpperBound()

template<typename InputIteratorT , typename OffsetT , typename T >

__device__ __forceinline__ OffsetT cub::UpperBound	(	InputIteratorT	input,
		OffsetT	num_items,
		T	val
	)

Returns the offset of the first value within input which compares greater than val.

Parameters

[in]	input	Input sequence
[in]	num_items	Input sequence length
[in]	val	Search key

Definition at line 126 of file thread_search.cuh.

◆ while()

cub::while ( block_offset+AgentScanT::TILE_ITEMS<= num_counts )

Definition at line 141 of file dispatch_radix_sort.cuh.

Variable Documentation

◆ bin_count

OffsetT cub::bin_count[BINS_TRACKED_PER_THREAD]

Definition at line 410 of file dispatch_radix_sort.cuh.

◆ bin_offset

cub::bin_offset

Definition at line 440 of file dispatch_radix_sort.cuh.

◆ block_aggregate

OutputT cub::block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share)

Definition at line 94 of file dispatch_reduce.cuh.

◆ block_end

even_share cub::block_end

Definition at line 198 of file dispatch_radix_sort.cuh.

◆ block_offset

int cub::block_offset = 0

Definition at line 139 of file dispatch_radix_sort.cuh.

◆ current_bit

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int cub::current_bit

[in] Bit position of current radix digit

Definition at line 76 of file dispatch_radix_sort.cuh.

◆ d_aggregates_out

AggregatesOutputIteratorT cub::d_aggregates_out

Pointer to the output sequence of value aggregates (one aggregate per run)

< [in] Pointer to the array carry-out dot product row-ids, one per spmv block

[in,out] Output value aggregates

Definition at line 76 of file dispatch_reduce_by_key.cuh.

◆ d_begin_offsets

OutputIteratorT OffsetIteratorT cub::d_begin_offsets

[in] Pointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*

Definition at line 334 of file dispatch_radix_sort.cuh.

◆ d_end_offsets

OutputIteratorT OffsetIteratorT OffsetIteratorT cub::d_end_offsets

[in] Pointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the i^th is considered empty.

Definition at line 335 of file dispatch_radix_sort.cuh.

◆ d_flags

FlagsInputIteratorT cub::d_flags

< [in] Pointer to the input sequence of data items

[in] Pointer to the input sequence of selection flags (if applicable)

Definition at line 78 of file dispatch_select_if.cuh.

◆ d_keys_out

KeyT * cub::d_keys_out

< [in] Input keys buffer

[in] Output keys buffer

Definition at line 164 of file dispatch_radix_sort.cuh.

◆ d_lengths_out

OffsetsOutputIteratorT LengthsOutputIteratorT cub::d_lengths_out

[out] Pointer to output sequence of run-lengths

Definition at line 78 of file dispatch_rle.cuh.

◆ d_num_runs_out

OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT cub::d_num_runs_out

Pointer to total number of runs encountered (i.e., the length of d_unique_out)

[out] Pointer to total number of runs (i.e., length of d_offsets_out)

Definition at line 77 of file dispatch_reduce_by_key.cuh.

◆ d_num_selected_out

cub::d_num_selected_out

[out] Pointer to the total number of items selected (i.e., length of d_selected_out)

Definition at line 80 of file dispatch_select_if.cuh.

◆ d_offsets_out

OffsetsOutputIteratorT cub::d_offsets_out

< [in] Pointer to input sequence of data items

[out] Pointer to output sequence of run-offsets

Definition at line 77 of file dispatch_rle.cuh.

◆ d_out

OutputIteratorT cub::d_out

< [in] Pointer to the input sequence of data items

< Input data

[out] Pointer to the output aggregate

Output data

Definition at line 71 of file dispatch_reduce.cuh.

◆ d_output_histograms_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_output_histograms_wrapper

Reference to final output histograms.

Definition at line 104 of file dispatch_histogram.cuh.

◆ d_privatized_histograms_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_privatized_histograms_wrapper

Reference to privatized histograms.

Definition at line 105 of file dispatch_histogram.cuh.

◆ d_selected_out

FlagsInputIteratorT SelectedOutputIteratorT cub::d_selected_out

[out] Pointer to the output sequence of selected data items

Definition at line 79 of file dispatch_select_if.cuh.

◆ d_spine

KeyT const ValueT ValueT OffsetT * cub::d_spine

< [in] Input keys buffer

[in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

[out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

Definition at line 74 of file dispatch_radix_sort.cuh.

◆ d_tile_carry_pairs

cub::d_tile_carry_pairs

[out] Pointer to the temporary array carry-out dot product row-ids, one per block

Definition at line 161 of file dispatch_spmv_orig.cuh.

◆ d_tile_coordinates

OffsetT CoordinateT* cub::d_tile_coordinates

[in] Pointer to the temporary array of tile starting coordinates

Definition at line 160 of file dispatch_spmv_orig.cuh.

◆ d_unique_out

UniqueOutputIteratorT cub::d_unique_out

< Pointer to the input sequence of keys

Pointer to the output sequence of unique keys (one key per run)

Definition at line 74 of file dispatch_reduce_by_key.cuh.

◆ d_values_in

UniqueOutputIteratorT ValuesInputIteratorT cub::d_values_in

[in] Input values buffer

Pointer to the input sequence of corresponding values.

Definition at line 165 of file dispatch_radix_sort.cuh.

◆ d_values_out

KeyT const ValueT ValueT * cub::d_values_out

[in] Output values buffer

Definition at line 166 of file dispatch_radix_sort.cuh.

◆ default_key

cub::default_key = reinterpret_cast<KeyT&>(default_key_bits)

Definition at line 272 of file dispatch_radix_sort.cuh.

◆ default_key_bits

UnsignedBitsT cub::default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY

Definition at line 271 of file dispatch_radix_sort.cuh.

◆ downsweep

BlockDownsweepT::TempStorage cub::downsweep

Definition at line 385 of file dispatch_radix_sort.cuh.

◆ end_bit

cub::end_bit

Initial value:

{
    
    enum
    {
        BLOCK_THREADS           = ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS,
        ITEMS_PER_THREAD        = ChainedPolicyT::ActivePolicy::SingleTilePolicy::ITEMS_PER_THREAD,
        KEYS_ONLY               = Equals<ValueT, NullType>::VALUE,
    }

< [in] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 219 of file dispatch_radix_sort.cuh.

◆ equality_op

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT EqualityOpT cub::equality_op

KeyT equality operator.

[in] Equality operator

[in] Equality operator for input items

Definition at line 80 of file dispatch_reduce_by_key.cuh.

◆ even_share

OutputIteratorT OffsetT GridEvenShare< OffsetT > cub::even_share

Initial value:

{
    enum {
        TILE_ITEMS = ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS *
                        ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::ITEMS_PER_THREAD
    }

< [in] Even-share descriptor for mapan equal number of tiles onto each thread block

[in] Even-share descriptor for mapping an equal number of tiles onto each thread block

Definition at line 78 of file dispatch_radix_sort.cuh.

◆ init

OutputIteratorT OffsetIteratorT OffsetIteratorT ReductionOpT OutputT cub::init

Initial value:

{
    
    typedef AgentReduce<
            typename ChainedPolicyT::ActivePolicy::SingleTilePolicy,
            InputIteratorT,
            OutputIteratorT,
            OffsetT,
            ReductionOpT>
        AgentReduceT

< [in] The initial value of the reduction

Definition at line 118 of file dispatch_reduce.cuh.

◆ init_value

OutputIteratorT ScanTileStateT int ScanOpT InitValueT cub::init_value

Initial value to seed the exclusive scan.

Definition at line 110 of file dispatch_scan.cuh.

◆ int

OutputIteratorT OffsetIteratorT OffsetIteratorT cub::int

[in] The number of segments that comprise the sorting data

Definition at line 336 of file dispatch_radix_sort.cuh.

◆ keys

cub::keys

Definition at line 267 of file dispatch_radix_sort.cuh.

◆ num_bits

KeyT const ValueT ValueT OffsetT OffsetT int int cub::num_bits

[in] Number of bits of current radix digit

Definition at line 77 of file dispatch_radix_sort.cuh.

◆ num_counts

int cub::num_counts

Initial value:

{
    
    typedef AgentScan<
            typename ChainedPolicyT::ActivePolicy::ScanPolicy,
            OffsetT*,
            OffsetT*,
            cub::Sum,
            OffsetT,
            OffsetT>
        AgentScanT

< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

< [in] Total number of bin-counts

Definition at line 120 of file dispatch_radix_sort.cuh.

◆ num_items

AggregatesOutputIteratorT OffsetT cub::num_items = segment_end - segment_begin

[in] Total number of input data items

[in] Total number of items to select from

< Total number of scan items for the entire problem

[in] Total number of input items (i.e., length of d_in)

< Total number of items to select from

Definition at line 168 of file dispatch_radix_sort.cuh.

◆ num_output_bins_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_output_bins_wrapper

< Input data to reduce

The number bins per final output histogram

Definition at line 102 of file dispatch_histogram.cuh.

◆ num_privatized_bins_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_privatized_bins_wrapper

The number bins per privatized histogram.

Definition at line 103 of file dispatch_histogram.cuh.

◆ num_row_pixels

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT cub::num_row_pixels

The number of multi-channel pixels per row in the region of interest.

Definition at line 108 of file dispatch_histogram.cuh.

◆ num_rows

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT cub::num_rows

The number of rows in the region of interest.

Definition at line 109 of file dispatch_histogram.cuh.

◆ num_segment_fixup_tiles

OffsetT CoordinateT KeyValuePair<OffsetT,ValueT> int ScanTileStateT int cub::num_segment_fixup_tiles

Initial value:

{
    
    typedef AgentSpmv<
            SpmvPolicyT,
            ValueT,
            OffsetT,
            HAS_ALPHA,
            HAS_BETA>
        AgentSpmvT

< [in] Number of reduce-by-key tiles (fixup grid size)

Definition at line 164 of file dispatch_spmv_orig.cuh.

◆ num_tiles

AggregatesOutputIteratorT OffsetT int cub::num_tiles

Initial value:

{
    
    typedef AgentRle<
        AgentRlePolicyT,
        InputIteratorT,
        OffsetsOutputIteratorT,
        LengthsOutputIteratorT,
        EqualityOpT,
        OffsetT> AgentRleT

< [in] Total number of tiles for the entire problem

[in] Total number of tiles for the entire problem

[in] Number of merge tiles

Definition at line 83 of file dispatch_rle.cuh.

◆ OffsetT

OffsetT cub::OffsetT

[in] Total number of input data items

Definition at line 75 of file dispatch_radix_sort.cuh.

◆ output_decode_op_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> cub::output_decode_op_wrapper

The transform operator for determining output bin-ids from privatized counter indices, one for each channel.

Definition at line 106 of file dispatch_histogram.cuh.

◆ pass_bits

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int cub::pass_bits

Initial value:

{
    
    typedef typename If<(ALT_DIGIT_BITS),
        typename ChainedPolicyT::ActivePolicy::AltSegmentedPolicy,
        typename ChainedPolicyT::ActivePolicy::SegmentedPolicy>::Type SegmentedPolicyT

< [in] Number of bits of current radix digit

Definition at line 338 of file dispatch_radix_sort.cuh.

◆ privatized_decode_op_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> cub::privatized_decode_op_wrapper

The transform operator for determining privatized counter indices from samples, one for each channel.

Definition at line 107 of file dispatch_histogram.cuh.

◆ reduction_op

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT ReductionOpT cub::reduction_op

Initial value:

{
    
    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  
        typename std::iterator_traits<InputIteratorT>::value_type,                                          
        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT

< [in] Binary reduction functor

ValueT reduction operator.

[in] Binary reduction functor

Definition at line 74 of file dispatch_reduce.cuh.

◆ reverse_counts_in

volatile OffsetT cub::reverse_counts_in[RADIX_DIGITS]

Definition at line 388 of file dispatch_radix_sort.cuh.

◆ reverse_counts_out

volatile OffsetT cub::reverse_counts_out[RADIX_DIGITS]

Definition at line 389 of file dispatch_radix_sort.cuh.

◆ row_stride_samples

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT cub::row_stride_samples

The number of samples between starts of consecutive rows in the region of interest.

Definition at line 110 of file dispatch_histogram.cuh.

◆ scan

DigitScanT::TempStorage cub::scan

Definition at line 390 of file dispatch_radix_sort.cuh.

◆ scan_op

OutputIteratorT ScanTileStateT int ScanOpT cub::scan_op

Binary scan functor.

Definition at line 109 of file dispatch_scan.cuh.

◆ segment_begin

OffsetT cub::segment_begin = d_begin_offsets[blockIdx.x]

Definition at line 395 of file dispatch_radix_sort.cuh.

◆ segment_end

OffsetT cub::segment_end = d_end_offsets[blockIdx.x]

Definition at line 396 of file dispatch_radix_sort.cuh.

◆ select_op

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT cub::select_op

[in] Selection operator

Definition at line 82 of file dispatch_select_if.cuh.

◆ spmv_params

OffsetT cub::spmv_params

[in] SpMV input parameter bundle

Definition at line 159 of file dispatch_spmv_orig.cuh.

◆ start_tile

OutputIteratorT ScanTileStateT int cub::start_tile

The starting tile for the current grid.

Definition at line 79 of file dispatch_reduce_by_key.cuh.

◆ temp_storage

__shared__ AgentSegmentFixupT::TempStorage cub::temp_storage

Definition at line 128 of file dispatch_histogram.cuh.

◆ tile_queue

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int GridQueue<int> cub::tile_queue

Initial value:

{
    
    typedef AgentHistogram<
            AgentHistogramPolicyT,
            PRIVATIZED_SMEM_BINS,
            NUM_CHANNELS,
            NUM_ACTIVE_CHANNELS,
            SampleIteratorT,
            CounterT,
            PrivatizedDecodeOpT,
            OutputDecodeOpT,
            OffsetT>
        AgentHistogramT

< Drain queue descriptor for dynamically mapping tile data onto thread blocks

Definition at line 112 of file dispatch_histogram.cuh.

◆ tile_state

AggregatesOutputIteratorT OffsetT int ScanTileStateT cub::tile_state

Initial value:

{
    
    typedef AgentSegmentFixup<
            AgentSegmentFixupPolicyT,
            PairsInputIteratorT,
            AggregatesOutputIteratorT,
            cub::Equality,
            cub::Sum,
            OffsetT>
        AgentSegmentFixupT

Tile status interface.

< [in] Tile status interface

[in] Tile status interface for fixup reduce-by-key kernel

Definition at line 78 of file dispatch_reduce_by_key.cuh.

◆ tile_status

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT cub::tile_status

[in] Tile status interface

Definition at line 80 of file dispatch_rle.cuh.

◆ tiles_per_row

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int cub::tiles_per_row

Number of image tiles per row.

Definition at line 111 of file dispatch_histogram.cuh.

◆ upsweep

BlockUpsweepT::TempStorage cub::upsweep	(	temp_storage	,
		d_keys	,
		current_bit	,
		num_bits
	)

Definition at line 384 of file dispatch_radix_sort.cuh.

◆ values

cub::values

Definition at line 268 of file dispatch_radix_sort.cuh.

Namespaces

Data Structures

Typedefs

Enumerations

Functions

Variables

Detailed Description

Typedef Documentation

◆ AgentRadixSortDownsweepT

◆ AgentRadixSortUpsweepT

◆ AgentReduceT

◆ BlockDownsweepT

◆ BlockLoadKeys

◆ BlockLoadValues

◆ BlockRadixSortT

◆ BlockUpsweepT

◆ DigitScanT

◆ UnsignedBitsT

Enumeration Type Documentation

◆ anonymous enum

◆ anonymous enum

◆ BlockHistogramAlgorithm

◆ BlockHistogramMemoryPreference

◆ BlockReduceAlgorithm

◆ BlockScanAlgorithm

◆ RadixRankAlgorithm

◆ ScanTileStatus

Function Documentation

◆ __launch_bounds__() [1/15]

◆ __launch_bounds__() [2/15]

◆ __launch_bounds__() [3/15]

◆ __launch_bounds__() [4/15]

◆ __launch_bounds__() [5/15]

◆ __launch_bounds__() [6/15]

◆ __launch_bounds__() [7/15]

◆ __launch_bounds__() [8/15]

◆ __launch_bounds__() [9/15]

◆ __launch_bounds__() [10/15]

◆ __launch_bounds__() [11/15]

◆ __launch_bounds__() [12/15]

◆ __launch_bounds__() [13/15]

◆ __launch_bounds__() [14/15]

◆ __launch_bounds__() [15/15]

◆ DeviceCompactInitKernel()

◆ DeviceHistogramInitKernel()

◆ DeviceScanInitKernel()

◆ DeviceSpmv1ColKernel()

◆ DeviceSpmvSearchKernel()

◆ for()

◆ if() [1/4]

◆ if() [2/4]

◆ if() [3/4]

◆ if() [4/4]

◆ LowerBound()

◆ MatchAny()

◆ MergePathSearch()

◆ NormalizeReductionOutput() [1/2]

◆ NormalizeReductionOutput() [2/2]

◆ UpperBound()

◆ while()

Variable Documentation

◆ bin_count

◆ bin_offset

◆ block_aggregate

◆ block_end

◆ block_offset

◆ current_bit

◆ d_aggregates_out

◆ d_begin_offsets

◆ d_end_offsets

◆ d_flags

◆ d_keys_out

◆ d_lengths_out

◆ d_num_runs_out

◆ d_num_selected_out

◆ d_offsets_out

◆ d_out

◆ d_output_histograms_wrapper

◆ d_privatized_histograms_wrapper

◆ d_selected_out