OpenFPM_pdata  3.0.0
Project that contain the implementation of distributed structures
cub Namespace Reference

Optional outer namespace(s) More...

Namespaces

 internal
 Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations)
 

Data Structures

struct  AgentHistogram
 AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram . More...
 
struct  AgentHistogramPolicy
 < Whether to dequeue tiles from a global work queue More...
 
struct  AgentRadixSortDownsweep
 AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep . More...
 
struct  AgentRadixSortDownsweepPolicy
 < The number of radix bits, i.e., log2(bins) More...
 
struct  AgentRadixSortUpsweep
 AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep . More...
 
struct  AgentRadixSortUpsweepPolicy
 < The number of radix bits, i.e., log2(bins) More...
 
struct  AgentReduce
 AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction . More...
 
struct  AgentReduceByKey
 AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More...
 
struct  AgentReduceByKeyPolicy
 < The BlockScan algorithm to use More...
 
struct  AgentReducePolicy
 < Cache load modifier for reading input elements More...
 
struct  AgentRle
 AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode. More...
 
struct  AgentRlePolicy
 < The BlockScan algorithm to use More...
 
struct  AgentScan
 AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan . More...
 
struct  AgentScanPolicy
 < The BlockScan algorithm to use More...
 
struct  AgentSegmentFixup
 AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More...
 
struct  AgentSegmentFixupPolicy
 < The BlockScan algorithm to use More...
 
struct  AgentSelectIf
 AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide selection. More...
 
struct  AgentSelectIfPolicy
 < The BlockScan algorithm to use More...
 
struct  AgentSpmv
 AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV. More...
 
struct  AgentSpmvPolicy
 < The BlockScan algorithm to use More...
 
struct  AlignBytes
 Structure alignment. More...
 
struct  AlignBytes< const T >
 
struct  AlignBytes< const volatile T >
 
struct  AlignBytes< volatile T >
 
class  ArgIndexInputIterator
 A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples). More...
 
struct  ArgMax
 Arg max functor (keeps the value and offset of the first occurrence of the larger item) More...
 
struct  ArgMin
 Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More...
 
struct  ArrayWrapper
 A wrapper for passing simple static arrays as kernel parameters. More...
 
struct  BaseTraits
 Basic type traits. More...
 
struct  BaseTraits< FLOATING_POINT, true, false, _UnsignedBits, T >
 
struct  BaseTraits< SIGNED_INTEGER, true, false, _UnsignedBits, T >
 
struct  BaseTraits< UNSIGNED_INTEGER, true, false, _UnsignedBits, T >
 
struct  BinaryOpHasIdxParam
 Determine whether or not BinaryOp's functor is of the form bool operator()(const T& a, const T&b) or bool operator()(const T& a, const T&b, unsigned int idx) More...
 
class  BlockAdjacentDifference
 
class  BlockDiscontinuity
 The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. More...
 
class  BlockExchange
 The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block. More...
 
class  BlockHistogram
 The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...
 
struct  BlockHistogramAtomic
 The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...
 
struct  BlockHistogramSort
 The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More...
 
class  BlockLoad
 The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...
 
class  BlockRadixRank
 BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block. More...
 
class  BlockRadixRankMatch
 
class  BlockRadixSort
 The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method. More...
 
struct  BlockRakingLayout
 BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data. More...
 
class  BlockReduce
 The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block. More...
 
struct  BlockReduceRaking
 BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More...
 
struct  BlockReduceRakingCommutativeOnly
 BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block. Does not support non-commutative reduction operators. Does not support block sizes that are not a multiple of the warp size. More...
 
struct  BlockReduceWarpReductions
 BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More...
 
class  BlockScan
 The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. More...
 
struct  BlockScanRaking
 BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block. More...
 
struct  BlockScanRunningPrefixOp
 < Wrapped scan operator type More...
 
struct  BlockScanWarpScans
 BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block. More...
 
class  BlockShuffle
 The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block. More...
 
class  BlockStore
 The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory. More...
 
class  CacheModifiedInputIterator
 A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More...
 
class  CacheModifiedOutputIterator
 A random-access output wrapper for storing array values using a PTX cache-modifier. More...
 
struct  CachingDeviceAllocator
 A simple caching allocator for device memory allocations. More...
 
struct  CastOp
 Default cast functor. More...
 
struct  ChainedPolicy
 Helper for dispatching into a policy chain. More...
 
struct  ChainedPolicy< PTX_VERSION, PolicyT, PolicyT >
 Helper for dispatching into a policy chain (end-of-chain specialization) More...
 
class  ConstantInputIterator
 A random-access input generator for dereferencing a sequence of homogeneous values. More...
 
class  CountingInputIterator
 A random-access input generator for dereferencing a sequence of incrementing integer values. More...
 
struct  CubVector
 Exposes a member typedef Type that names the corresponding CUDA vector type if one exists. Otherwise Type refers to the CubVector structure itself, which will wrap the corresponding x, y, etc. vector fields. More...
 
struct  CubVector< T, 1 >
 
struct  CubVector< T, 2 >
 
struct  CubVector< T, 3 >
 
struct  CubVector< T, 4 >
 
struct  DeviceHistogram
 DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory. More...
 
struct  DevicePartition
 DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory. More...
 
struct  DeviceRadixSort
 DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory. More...
 
struct  DeviceRadixSortPolicy
 < Signed integer type for global offsets More...
 
struct  DeviceReduce
 DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More...
 
struct  DeviceReducePolicy
 < Binary reduction functor type having member T operator()(const T &a, const T &b) More...
 
struct  DeviceRleDispatch
 < Signed integer type for global offsets More...
 
struct  DeviceRunLengthEncode
 DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. More...
 
struct  DeviceScan
 DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. More...
 
struct  DeviceSegmentedRadixSort
 DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. More...
 
struct  DeviceSegmentedReduce
 DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. More...
 
struct  DeviceSelect
 DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. More...
 
struct  DeviceSpmv
 DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More...
 
struct  DipatchHistogram
 < Signed integer type for global offsets More...
 
class  DiscardOutputIterator
 A discard iterator. More...
 
struct  DispatchRadixSort
 < Signed integer type for global offsets More...
 
struct  DispatchReduce
 < Binary reduction functor type having member T operator()(const T &a, const T &b) More...
 
struct  DispatchReduceByKey
 < Signed integer type for global offsets More...
 
struct  DispatchScan
 < Signed integer type for global offsets More...
 
struct  DispatchSegmentedRadixSort
 < Signed integer type for global offsets More...
 
struct  DispatchSegmentedReduce
 < Binary reduction functor type having member T operator()(const T &a, const T &b) More...
 
struct  DispatchSelectIf
 < Whether or not we push rejected items to the back of the output More...
 
struct  DispatchSpmv
 < Signed integer type for global offsets More...
 
struct  DoubleBuffer
 Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth. More...
 
struct  EnableIf
 Simple enable-if (similar to Boost) More...
 
struct  EnableIf< false, T >
 
struct  Equality
 Default equality functor. More...
 
struct  Equals
 Type equality test. More...
 
struct  Equals< A, A >
 
struct  FpLimits
 
struct  FpLimits< double >
 
struct  FpLimits< float >
 
class  GridBarrier
 GridBarrier implements a software global barrier among thread blocks within a CUDA grid. More...
 
class  GridBarrierLifetime
 GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation. More...
 
struct  GridEvenShare
 GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More...
 
class  GridQueue
 GridQueue is a descriptor utility for dynamic queue management. More...
 
struct  If
 Type selection (IF ? ThenType : ElseType) More...
 
struct  If< false, ThenType, ElseType >
 
struct  Inequality
 Default inequality functor. More...
 
struct  InequalityWrapper
 Inequality functor (wraps equality functor) More...
 
struct  Int2Type
 Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values) More...
 
struct  IsPointer
 Pointer vs. iterator. More...
 
struct  IsPointer< Tp * >
 
struct  IsVolatile
 Volatile modifier test. More...
 
struct  IsVolatile< Tp volatile >
 
struct  IterateThreadLoad
 Helper structure for templated load iteration (inductive case) More...
 
struct  IterateThreadLoad< MAX, MAX >
 Helper structure for templated load iteration (termination case) More...
 
struct  IterateThreadStore
 Helper structure for templated store iteration (inductive case) More...
 
struct  IterateThreadStore< MAX, MAX >
 Helper structure for templated store iteration (termination case) More...
 
struct  KernelConfig
 
struct  KeyValuePair
 A key identifier paired with a corresponding value. More...
 
struct  Log2
 Statically determine log2(N), rounded up. More...
 
struct  Log2< N, 0, COUNT >
 
struct  Max
 Default max functor. More...
 
struct  Min
 Default min functor. More...
 
struct  Mutex
 
struct  NullType
 A simple "NULL" marker type. More...
 
struct  NumericTraits
 Numeric type traits. More...
 
struct  NumericTraits< bool >
 
struct  NumericTraits< char >
 
struct  NumericTraits< double >
 
struct  NumericTraits< float >
 
struct  NumericTraits< int >
 
struct  NumericTraits< long >
 
struct  NumericTraits< long long >
 
struct  NumericTraits< NullType >
 
struct  NumericTraits< short >
 
struct  NumericTraits< signed char >
 
struct  NumericTraits< unsigned char >
 
struct  NumericTraits< unsigned int >
 
struct  NumericTraits< unsigned long >
 
struct  NumericTraits< unsigned long long >
 
struct  NumericTraits< unsigned short >
 
struct  PowerOfTwo
 Statically determine if N is a power-of-two. More...
 
struct  ReduceByKeyOp
 < Binary reduction operator to apply to values More...
 
struct  ReduceByKeyScanTileState
 
struct  ReduceByKeyScanTileState< ValueT, KeyT, false >
 
struct  ReduceByKeyScanTileState< ValueT, KeyT, true >
 
struct  ReduceBySegmentOp
 Reduce-by-segment functor. More...
 
struct  RemoveQualifiers
 Removes const and volatile qualifiers from type Tp. More...
 
struct  RemoveQualifiers< Tp, const Up >
 
struct  RemoveQualifiers< Tp, const volatile Up >
 
struct  RemoveQualifiers< Tp, volatile Up >
 
struct  ScanTileState
 
struct  ScanTileState< T, false >
 
struct  ScanTileState< T, true >
 
struct  SpmvParams
 < Signed integer type for sequence offsets More...
 
struct  Sum
 Default sum functor. More...
 
class  SwizzleScanOp
 Binary operator wrapper for switching non-commutative scan arguments. More...
 
union  TempStorage
 
class  TexObjInputIterator
 A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More...
 
struct  TilePrefixCallbackOp
 
struct  Traits
 Type traits. More...
 
class  TransformInputIterator
 A random-access input wrapper for transforming dereferenced values. More...
 
struct  Uninitialized
 A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions. More...
 
struct  UnitWord
 Unit-words of data movement. More...
 
struct  UnitWord< char2 >
 
struct  UnitWord< const T >
 
struct  UnitWord< const volatile T >
 
struct  UnitWord< float2 >
 
struct  UnitWord< float4 >
 
struct  UnitWord< volatile T >
 
class  WarpExchange
 
class  WarpReduce
 The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More...
 
struct  WarpReduceShfl
 WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp. More...
 
struct  WarpReduceSmem
 WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp. More...
 
class  WarpScan
 The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More...
 
struct  WarpScanShfl
 WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More...
 
struct  WarpScanSmem
 WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More...
 

Typedefs

typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy >::Type, KeyT, OffsetTAgentRadixSortUpsweepT
 
typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy >::Type, IS_DESCENDING, KeyT, ValueT, OffsetTAgentRadixSortDownsweepT
 
typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS,(ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM==RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM > BlockRadixSortT
 
typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > BlockLoadKeys
 
typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > BlockLoadValues
 
typedef Traits< KeyT >::UnsignedBits UnsignedBitsT
 
typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy< BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS >, KeyT, OffsetTBlockUpsweepT
 
typedef BlockScan< OffsetT, BLOCK_THREADS > DigitScanT
 
typedef AgentRadixSortDownsweep< SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetTBlockDownsweepT
 
typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT > AgentReduceT
 

Enumerations

enum  BlockHistogramMemoryPreference { GMEM, SMEM, BLEND }
 
enum  RadixRankAlgorithm { RADIX_RANK_BASIC, RADIX_RANK_MEMOIZE, RADIX_RANK_MATCH }
 
enum  ScanTileStatus { SCAN_TILE_OOB, SCAN_TILE_INVALID = 99, SCAN_TILE_PARTIAL, SCAN_TILE_INCLUSIVE }
 
enum  BlockHistogramAlgorithm { BLOCK_HISTO_SORT, BLOCK_HISTO_ATOMIC }
 BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. More...
 
enum  BlockLoadAlgorithm {
  BLOCK_LOAD_DIRECT, BLOCK_LOAD_VECTORIZE, BLOCK_LOAD_TRANSPOSE, BLOCK_LOAD_WARP_TRANSPOSE,
  BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
}
 cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More...
 
enum  BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY, BLOCK_REDUCE_RAKING, BLOCK_REDUCE_WARP_REDUCTIONS }
 
enum  BlockScanAlgorithm { BLOCK_SCAN_RAKING, BLOCK_SCAN_RAKING_MEMOIZE, BLOCK_SCAN_WARP_SCANS }
 BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. More...
 
enum  BlockStoreAlgorithm {
  BLOCK_STORE_DIRECT, BLOCK_STORE_VECTORIZE, BLOCK_STORE_TRANSPOSE, BLOCK_STORE_WARP_TRANSPOSE,
  BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED
}
 cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More...
 
enum  {
  BLOCK_THREADS = SegmentedPolicyT::BLOCK_THREADS, ITEMS_PER_THREAD = SegmentedPolicyT::ITEMS_PER_THREAD, RADIX_BITS = SegmentedPolicyT::RADIX_BITS, TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
  RADIX_DIGITS = 1 << RADIX_BITS, KEYS_ONLY = Equals<ValueT, NullType>::VALUE
}
 
enum  { BINS_TRACKED_PER_THREAD = BlockDownsweepT::BINS_TRACKED_PER_THREAD }
 
enum  GridMappingStrategy { GRID_MAPPING_RAKE, GRID_MAPPING_STRIP_MINE, GRID_MAPPING_DYNAMIC }
 cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
 
enum  CacheLoadModifier {
  LOAD_DEFAULT, LOAD_CA, LOAD_CG, LOAD_CS,
  LOAD_CV, LOAD_LDG, LOAD_VOLATILE
}
 Enumeration of cache modifiers for memory load operations. More...
 
enum  CacheStoreModifier {
  STORE_DEFAULT, STORE_WB, STORE_CG, STORE_CS,
  STORE_WT, STORE_VOLATILE
}
 Enumeration of cache modifiers for memory store operations. More...
 
enum  { MAX_VEC_ELEMENTS = 4 }
 
enum  Category { NOT_A_NUMBER, SIGNED_INTEGER, UNSIGNED_INTEGER, FLOATING_POINT }
 Basic type traits categories.
 

Functions

template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT >
__global__ void DeviceHistogramInitKernel (ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_output_bins_wrapper, ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_output_histograms_wrapper, GridQueue< int > tile_queue)
 < Signed integer type for global offsets More...
 
template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT >
 __launch_bounds__ (int(AgentHistogramPolicyT::BLOCK_THREADS)) __global__ void DeviceHistogramSweepKernel(SampleIteratorT d_samples
 < Signed integer type for global offsets More...
 
AgentHistogramT agent (temp_storage, d_samples, num_output_bins_wrapper.array, num_privatized_bins_wrapper.array, d_output_histograms_wrapper.array, d_privatized_histograms_wrapper.array, output_decode_op_wrapper.array, privatized_decode_op_wrapper.array)
 
agent InitBinCounters ()
 
agent ConsumeTiles (num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue)
 
agent StoreOutput ()
 
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT >
 __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortUpsweepKernel(const KeyT *d_keys
 < Signed integer type for global offsets More...
 
even_share template BlockInit< TILE_ITEMS, GRID_MAPPING_RAKE > ()
 
upsweep ProcessRegion (even_share.block_offset, even_share.block_end)
 
 CTA_SYNC ()
 
upsweep template ExtractCounts< IS_DESCENDING > (d_spine, gridDim.x, blockIdx.x)
 
template<typename ChainedPolicyT , typename OffsetT >
 __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS), 1) __global__ void RadixSortScanBinsKernel(OffsetT *d_spine
 < Signed integer type for global offsets More...
 
AgentScanT block_scan (temp_storage, d_spine, d_spine, cub::Sum(), OffsetT(0))
 
 while (block_offset+AgentScanT::TILE_ITEMS<=num_counts)
 
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
 __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortDownsweepKernel(const KeyT *d_keys_in
 < Signed integer type for global offsets More...
 
 AgentRadixSortDownsweepT (temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion(even_share.block_offset
 
template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
 __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceRadixSortSingleTileKernel(const KeyT *d_keys_in
 < Signed integer type for global offsets More...
 
 BlockLoadKeys (temp_storage.load_keys).Load(d_keys_in
 
 if (!KEYS_ONLY)
 
 BlockRadixSortT (temp_storage.sort).SortBlockedToStriped(keys
 
 Int2Type< IS_DESCENDING > ()
 
 Int2Type< KEYS_ONLY > ())
 
 for (int ITEM=0;ITEM< ITEMS_PER_THREAD;++ITEM)
 
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT >
 __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedRadixSortKernel(const KeyT *d_keys_in
 < Signed integer type for global offsets More...
 
 if (num_items<=0) return
 
BlockUpsweepT upsweep (temp_storage.upsweep, d_keys_in, current_bit, pass_bits)
 
upsweep ProcessRegion (segment_begin, segment_end)
 
upsweep ExtractCounts (bin_count)
 
 if (IS_DESCENDING)
 
 DigitScanT (temp_storage.scan).ExclusiveSum(bin_count
 
BlockDownsweepT downsweep (temp_storage.downsweep, bin_offset, num_items, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, pass_bits)
 
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT >
 __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceReduceKernel(InputIteratorT d_in
 < Binary reduction functor type having member T operator()(const T &a, const T &b) More...
 
 if (threadIdx.x==0) d_out[blockIdx.x] = reduction_op(init, block_aggregate)
 
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT >
 __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceReduceSingleTileKernel(InputIteratorT d_in
 < Data element type that is convertible to the value type of OutputIteratorT More...
 
 if (num_items==0)
 
template<typename T , typename OffsetT , typename IteratorT >
__device__ __forceinline__ void NormalizeReductionOutput (T &, OffsetT, IteratorT)
 Normalize input iterator to segment offset.
 
template<typename KeyValuePairT , typename OffsetT , typename WrappedIteratorT , typename OutputValueT >
__device__ __forceinline__ void NormalizeReductionOutput (KeyValuePairT &val, OffsetT base_offset, ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT >)
 Normalize input iterator to segment offset (specialized for arg-index)
 
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT >
 __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedReduceKernel(InputIteratorT d_in
 < Data element type that is convertible to the value type of OutputIteratorT More...
 
 if (segment_begin==segment_end)
 
 NormalizeReductionOutput (block_aggregate, segment_begin, d_in)
 
template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT >
 __launch_bounds__ (int(AgentReduceByKeyPolicyT::BLOCK_THREADS)) __global__ void DeviceReduceByKeyKernel(KeysInputIteratorT d_keys_in
 < Signed integer type for global offsets More...
 
 AgentReduceByKeyT (temp_storage, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, equality_op, reduction_op).ConsumeRange(num_items
 
template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT >
 __launch_bounds__ (int(AgentRlePolicyT::BLOCK_THREADS)) __global__ void DeviceRleSweepKernel(InputIteratorT d_in
 < Signed integer type for global offsets More...
 
 AgentRleT (temp_storage, d_in, d_offsets_out, d_lengths_out, equality_op, num_items).ConsumeRange(num_tiles
 
template<typename ScanTileStateT >
__global__ void DeviceScanInitKernel (ScanTileStateT tile_state, int num_tiles)
 < Tile status interface type More...
 
template<typename ScanTileStateT , typename NumSelectedIteratorT >
__global__ void DeviceCompactInitKernel (ScanTileStateT tile_state, int num_tiles, NumSelectedIteratorT d_num_selected_out)
 < Output iterator type for recording the number of items selected More...
 
template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT >
 __launch_bounds__ (int(ScanPolicyT::BLOCK_THREADS)) __global__ void DeviceScanKernel(InputIteratorT d_in
 < Signed integer type for global offsets More...
 
 AgentScanT (temp_storage, d_in, d_out, scan_op, init_value).ConsumeRange(num_items
 
template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS>
 __launch_bounds__ (int(AgentSelectIfPolicyT::BLOCK_THREADS)) __global__ void DeviceSelectSweepKernel(InputIteratorT d_in
 < Whether or not we push rejected items to the back of the output More...
 
 AgentSelectIfT (temp_storage, d_in, d_flags, d_selected_out, select_op, equality_op, num_items).ConsumeRange(num_tiles
 
template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT >
__global__ void DeviceSpmv1ColKernel (SpmvParams< ValueT, OffsetT > spmv_params)
 < Signed integer type for sequence offsets More...
 
template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT >
__global__ void DeviceSpmvSearchKernel (int num_merge_tiles, CoordinateT *d_tile_coordinates, SpmvParamsT spmv_params)
 < SpmvParams type More...
 
template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA>
 __launch_bounds__ (int(SpmvPolicyT::BLOCK_THREADS)) __global__ void DeviceSpmvKernel(SpmvParams< ValueT
 < Whether the input parameter Beta is 0 More...
 
 AgentSpmvT (temp_storage, spmv_params).ConsumeTile(d_tile_coordinates
 
tile_state InitializeStatus (num_segment_fixup_tiles)
 
template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT >
 __launch_bounds__ (int(AgentSegmentFixupPolicyT::BLOCK_THREADS)) __global__ void DeviceSegmentFixupKernel(PairsInputIteratorT d_pairs_in
 < Tile status interface type More...
 
 AgentSegmentFixupT (temp_storage, d_pairs_in, d_aggregates_out, cub::Equality(), cub::Sum()).ConsumeRange(num_items
 
template<typename OffsetT >
__global__ void FillAndResetDrainKernel (GridQueue< OffsetT > grid_queue, OffsetT num_items)
 
template<typename InputIteratorT >
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type ThreadLoad (InputIteratorT itr, Int2Type< LOAD_DEFAULT >, Int2Type< false >)
 
template<typename T >
__device__ __forceinline__ T ThreadLoad (T *ptr, Int2Type< LOAD_DEFAULT >, Int2Type< true >)
 
template<typename T >
__device__ __forceinline__ T ThreadLoadVolatilePointer (T *ptr, Int2Type< true >)
 
template<typename T >
__device__ __forceinline__ T ThreadLoadVolatilePointer (T *ptr, Int2Type< false >)
 
template<typename T >
__device__ __forceinline__ T ThreadLoad (T *ptr, Int2Type< LOAD_VOLATILE >, Int2Type< true >)
 
template<typename T , int MODIFIER>
__device__ __forceinline__ T ThreadLoad (T const *ptr, Int2Type< MODIFIER >, Int2Type< true >)
 
template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT >
__host__ __device__ __forceinline__ void MergePathSearch (OffsetT diagonal, AIteratorT a, BIteratorT b, OffsetT a_len, OffsetT b_len, CoordinateT &path_coordinate)
 
template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT LowerBound (InputIteratorT input, OffsetT num_items, T val)
 Returns the offset of the first value within input which does not compare less than val. More...
 
template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT UpperBound (InputIteratorT input, OffsetT num_items, T val)
 Returns the offset of the first value within input which compares greater than val. More...
 
template<typename OutputIteratorT , typename T >
__device__ __forceinline__ void ThreadStore (OutputIteratorT itr, T val, Int2Type< STORE_DEFAULT >, Int2Type< false >)
 
template<typename T >
__device__ __forceinline__ void ThreadStore (T *ptr, T val, Int2Type< STORE_DEFAULT >, Int2Type< true >)
 
template<typename T >
__device__ __forceinline__ void ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< true >)
 
template<typename T >
__device__ __forceinline__ void ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< false >)
 
template<typename T >
__device__ __forceinline__ void ThreadStore (T *ptr, T val, Int2Type< STORE_VOLATILE >, Int2Type< true >)
 
template<typename T , int MODIFIER>
__device__ __forceinline__ void ThreadStore (T *ptr, T val, Int2Type< MODIFIER >, Int2Type< true >)
 
__host__ __device__ __forceinline__ cudaError_t Debug (cudaError_t error, const char *filename, int line)
 CUB error reporting macro (prints error messages to stderr) More...
 
template<int ALLOCATIONS>
__host__ __device__ __forceinline__ cudaError_t AliasTemporaries (void *d_temp_storage, size_t &temp_storage_bytes, void *(&allocations)[ALLOCATIONS], size_t(&allocation_sizes)[ALLOCATIONS])
 
template<typename T >
__global__ void EmptyKernel (void)
 
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t PtxVersion (int &ptx_version)
 Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10) More...
 
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t SmVersion (int &sm_version, int device_ordinal)
 Retrieves the SM version (major * 100 + minor * 10)
 
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t SyncStream (cudaStream_t stream)
 
template<typename KernelPtr >
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0)
 Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. More...
 
__device__ __forceinline__ unsigned int SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
 Shift-right then add. Returns (x >> shift) + addend.
 
__device__ __forceinline__ unsigned int SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
 Shift-left then add. Returns (x << shift) + addend.
 
template<typename UnsignedBits , int BYTE_LEN>
__device__ __forceinline__ unsigned int BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< BYTE_LEN >)
 
template<typename UnsignedBits >
__device__ __forceinline__ unsigned int BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< 8 >)
 
template<typename UnsignedBits >
__device__ __forceinline__ unsigned int BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
 Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.
 
__device__ __forceinline__ void BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
 Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.
 
__device__ __forceinline__ unsigned int IADD3 (unsigned int x, unsigned int y, unsigned int z)
 Three-operand add. Returns x + y + z.
 
__device__ __forceinline__ int PRMT (unsigned int a, unsigned int b, unsigned int index)
 Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. More...
 
__device__ __forceinline__ void BAR (int count)
 
__device__ __forceinline__ int CTA_SYNC_AND (int p)
 
__device__ __forceinline__ void WARP_SYNC (unsigned int member_mask)
 
__device__ __forceinline__ int WARP_ANY (int predicate, unsigned int member_mask)
 
__device__ __forceinline__ int WARP_ALL (int predicate, unsigned int member_mask)
 
__device__ __forceinline__ int WARP_BALLOT (int predicate, unsigned int member_mask)
 
__device__ __forceinline__ unsigned int SHFL_UP_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask)
 
__device__ __forceinline__ unsigned int SHFL_DOWN_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask)
 
__device__ __forceinline__ unsigned int SHFL_IDX_SYNC (unsigned int word, int src_lane, int flags, unsigned int member_mask)
 
__device__ __forceinline__ float FMUL_RZ (float a, float b)
 
__device__ __forceinline__ float FFMA_RZ (float a, float b, float c)
 
__device__ __forceinline__ void ThreadExit ()
 Terminates the calling thread.
 
__device__ __forceinline__ void ThreadTrap ()
 Abort execution and generate an interrupt to the host CPU.
 
__device__ __forceinline__ int RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z)
 Returns the row-major linear thread identifier for a multidimensional thread block.
 
__device__ __forceinline__ unsigned int LaneId ()
 Returns the warp lane ID of the calling thread.
 
__device__ __forceinline__ unsigned int WarpId ()
 Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
 
__device__ __forceinline__ unsigned int LaneMaskLt ()
 Returns the warp lane mask of all lanes less than the calling thread.
 
__device__ __forceinline__ unsigned int LaneMaskLe ()
 Returns the warp lane mask of all lanes less than or equal to the calling thread.
 
__device__ __forceinline__ unsigned int LaneMaskGt ()
 Returns the warp lane mask of all lanes greater than the calling thread.
 
__device__ __forceinline__ unsigned int LaneMaskGe ()
 Returns the warp lane mask of all lanes greater than or equal to the calling thread.
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
 Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread. More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
 Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread. More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleIndex (T input, int src_lane, unsigned int member_mask)
 Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread. More...
 
template<int LABEL_BITS>
__device__ unsigned int MatchAny (unsigned int label)
 
Blocked arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a blocked arrangement across the thread block. More...
 
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. More...
 
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More...
 
template<CacheLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void InternalLoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
 
template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a blocked arrangement across the thread block. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a blocked arrangement of items across a thread block into a linear segment of items. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. More...
 
template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
 Store a blocked arrangement of items across a thread block into a linear segment of items. More...
 
Striped arrangement I/O (direct)
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a striped arrangement across the thread block. More...
 
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a striped arrangement across the thread block, guarded by range. More...
 
template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
 
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a striped arrangement of data across the thread block into a linear segment of items. More...
 
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. More...
 
Warp-striped arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a warp-striped arrangement across the thread block. More...
 
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. More...
 
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a warp-striped arrangement of data across the thread block into a linear segment of items. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. More...
 
Thread I/O (cache modified)
template<CacheLoadModifier MODIFIER, typename InputIteratorT >
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type ThreadLoad (InputIteratorT itr)
 Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. More...
 
template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T >
__device__ __forceinline__ void ThreadStore (OutputIteratorT itr, T val)
 Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More...
 

Variables

ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_output_bins_wrapper
 < Input data to reduce More...
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_privatized_bins_wrapper
 The number bins per privatized histogram.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_output_histograms_wrapper
 Reference to final output histograms.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_privatized_histograms_wrapper
 Reference to privatized histograms.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > output_decode_op_wrapper
 The transform operator for determining output bin-ids from privatized counter indices, one for each channel.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > privatized_decode_op_wrapper
 The transform operator for determining privatized counter indices from samples, one for each channel.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT num_row_pixels
 The number of multi-channel pixels per row in the region of interest.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT num_rows
 The number of rows in the region of interest.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT row_stride_samples
 The number of samples between starts of consecutive rows in the region of interest.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int tiles_per_row
 Number of image tiles per row.
 
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int GridQueue< inttile_queue
 < Drain queue descriptor for dynamically mapping tile data onto thread blocks More...
 
__shared__ AgentHistogramT::TempStorage temp_storage
 
OffsetTd_spine
 < [in] Input keys buffer More...
 
OffsetT OffsetT
 [in] Total number of input data items
 
OffsetT int current_bit
 [in] Bit position of current radix digit
 
OffsetT int int num_bits
 [in] Number of bits of current radix digit
 
OffsetT int int GridEvenShare< OffsetTeven_share
 < [in] Even-share descriptor for mapan equal number of tiles onto each thread block More...
 
AgentRadixSortUpsweepT upsweep (temp_storage, d_keys, current_bit, num_bits)
 
int num_counts
 < [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.) More...
 
int block_offset = 0
 
BlockScanRunningPrefixOp< OffsetT, Sumprefix_op (0, Sum())
 
KeyT * d_keys_out
 < [in] Input keys buffer More...
 
KeyT const ValueT * d_values_in
 [in] Input values buffer More...
 
KeyT const ValueT ValueT * d_values_out
 [in] Output values buffer
 
KeyT const ValueT ValueT OffsetT OffsetT num_items = segment_end - segment_begin
 [in] Total number of input data items More...
 
even_share block_end
 
KeyT const ValueT ValueT OffsetT int int end_bit
 < [in] The past-the-end (most-significant) bit index needed for key comparison More...
 
KeyT keys [ITEMS_PER_THREAD]
 
ValueT values [ITEMS_PER_THREAD]
 
UnsignedBitsT default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY
 
KeyT default_key = reinterpret_cast<KeyT&>(default_key_bits)
 
KeyT const ValueT ValueT OffsetIteratorT d_begin_offsets
 [in] Pointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*
 
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT d_end_offsets
 [in] Pointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.
 
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int
 [in] The number of segments that comprise the sorting data
 
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int pass_bits
 < [in] Number of bits of current radix digit More...
 
OffsetT segment_begin = d_begin_offsets[blockIdx.x]
 
OffsetT segment_end = d_end_offsets[blockIdx.x]
 
OffsetT bin_count [BINS_TRACKED_PER_THREAD]
 
OffsetT bin_offset [BINS_TRACKED_PER_THREAD]
 
OutputIteratorT d_out
 < [in] Pointer to the input sequence of data items More...
 
OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT reduction_op
 < [in] Binary reduction functor More...
 
OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share)
 
OutputIteratorT OffsetT ReductionOpT OuputT init
 < [in] The initial value of the reduction More...
 
UniqueOutputIteratorT d_unique_out
 < Pointer to the input sequence of keys More...
 
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT d_aggregates_out
 Pointer to the output sequence of value aggregates (one aggregate per run) More...
 
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT d_num_runs_out
 Pointer to total number of runs encountered (i.e., the length of d_unique_out) More...
 
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT tile_state
 Tile status interface. More...
 
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int start_tile
 The starting tile for the current grid.
 
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT equality_op
 KeyT equality operator. More...
 
OffsetsOutputIteratorT d_offsets_out
 < [in] Pointer to input sequence of data items More...
 
OffsetsOutputIteratorT LengthsOutputIteratorT d_lengths_out
 [out] Pointer to output sequence of run-lengths
 
OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT tile_status
 [in] Tile status interface
 
OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT EqualityOpT OffsetT int num_tiles
 < [in] Total number of tiles for the entire problem More...
 
OutputIteratorT ScanTileStateT int ScanOpT scan_op
 Binary scan functor.
 
OutputIteratorT ScanTileStateT int ScanOpT InitValueT init_value
 Initial value to seed the exclusive scan.
 
FlagsInputIteratorT d_flags
 < [in] Pointer to the input sequence of data items More...
 
FlagsInputIteratorT SelectedOutputIteratorT d_selected_out
 [out] Pointer to the output sequence of selected data items
 
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT d_num_selected_out
 [out] Pointer to the total number of items selected (i.e., length of d_selected_out)
 
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT select_op
 [in] Selection operator
 
OffsetT spmv_params
 [in] SpMV input parameter bundle
 
OffsetT CoordinateT * d_tile_coordinates
 [in] Pointer to the temporary array of tile starting coordinates
 
OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > * d_tile_carry_pairs
 [out] Pointer to the temporary array carry-out dot product row-ids, one per block
 
OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > int ScanTileStateT int num_segment_fixup_tiles
 < [in] Number of reduce-by-key tiles (fixup grid size) More...
 

Detailed Description

Optional outer namespace(s)

CUB namespace

Enumeration Type Documentation

◆ anonymous enum

anonymous enum
Enumerator
BINS_TRACKED_PER_THREAD 

Number of bin-starting offsets tracked per thread.

Definition at line 371 of file dispatch_radix_sort.cuh.

◆ BlockHistogramAlgorithm

BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.

Enumerator
BLOCK_HISTO_SORT 
Overview
Sorting followed by differentiation. Execution is comprised of two phases:
  1. Sort the data using efficient radix sort
  2. Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.
Performance Considerations
Delivers consistent throughput regardless of sample bin distribution.
BLOCK_HISTO_ATOMIC 
Overview
Use atomic addition to update byte counts directly
Performance Considerations
Performance is strongly tied to the hardware implementation of atomic addition, and may be significantly degraded for non uniformly-random input distributions where many concurrent updates are likely to be made to the same bin counter.

Definition at line 56 of file block_histogram.cuh.

◆ BlockHistogramMemoryPreference

◆ BlockLoadAlgorithm

cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.

cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.

Enumerator
BLOCK_LOAD_DIRECT 
Overview

A blocked arrangement of data is read directly from memory.

Performance Considerations
  • The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
BLOCK_LOAD_VECTORIZE 
Overview

A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
  • The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
    • ITEMS_PER_THREAD is odd
    • The InputIteratorTis not a simple pointer type
    • The block input offset is not quadword-aligned
    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
BLOCK_LOAD_TRANSPOSE 
Overview

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
BLOCK_LOAD_WARP_TRANSPOSE 
Overview

A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • The local reordering incurs slightly larger latencies than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
  • Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED 
Overview

Like BLOCK_LOAD_WARP_TRANSPOSE, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.

Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.

Definition at line 473 of file block_load.cuh.

◆ BlockReduceAlgorithm

BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.

Enumerator
BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY 
Overview
An efficient "raking" reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).
Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions
  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.
block_reduce.png
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
BLOCK_REDUCE_RAKING 
Overview
An efficient "raking" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. \blocked.
Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.
block_reduce.png
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
BLOCK_REDUCE_WARP_REDUCTIONS 
Overview
A quick "tiled warp-reductions" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.
Execution is comprised of four phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
  3. A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.
block_scan_warpscans.png
BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.

Definition at line 60 of file block_reduce.cuh.

◆ BlockScanAlgorithm

BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.

Enumerator
BLOCK_SCAN_RAKING 
Overview
An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
  3. A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
  4. Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
  5. Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
block_scan_raking.png
BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • Although this variant may suffer longer turnaround latencies when the GPU is under-occupied, it can often provide higher overall throughput across the GPU when suitably occupied.
BLOCK_SCAN_RAKING_MEMOIZE 
Overview
Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at the expense of higher register pressure. Raking threads preserve their "upsweep" segment of values in registers while performing warp-synchronous scan, allowing the "downsweep" not to re-read them from shared memory.
BLOCK_SCAN_WARP_SCANS 
Overview
A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
  3. A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
  4. Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
block_scan_warpscans.png
BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • Although this variant may suffer lower overall throughput across the GPU because due to a heavy reliance on inefficient warpscans, it can often provide lower turnaround latencies when the GPU is under-occupied.

Definition at line 57 of file block_scan.cuh.

◆ BlockStoreAlgorithm

cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.

Enumerator
BLOCK_STORE_DIRECT 
Overview

A blocked arrangement of data is written directly to memory.

Performance Considerations
  • The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
BLOCK_STORE_VECTORIZE 
Overview

A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. For example, st.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
  • The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
    • ITEMS_PER_THREAD is odd
    • The OutputIteratorT is not a simple pointer type
    • The block output offset is not quadword-aligned
    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
BLOCK_STORE_TRANSPOSE 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
BLOCK_STORE_WARP_TRANSPOSE 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement
Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.
Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.

Definition at line 354 of file block_store.cuh.

◆ RadixRankAlgorithm

Radix ranking algorithm

Definition at line 62 of file agent_radix_sort_downsweep.cuh.

◆ ScanTileStatus

Enumerations of tile status

Definition at line 105 of file single_pass_scan_operators.cuh.

Function Documentation

◆ __launch_bounds__() [1/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT >
cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)  )

< Binary reduction functor type having member T operator()(const T &a, const T &b)

Reduce region kernel entry point (multi-block). Computes privatized reductions, one per thread block.

◆ __launch_bounds__() [2/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT >
cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS)  ) const

< Signed integer type for global offsets

Upsweep digit-counting kernel entry point (multi-block). Computes privatized digit histograms, one per block.

◆ __launch_bounds__() [3/15]

template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT >
cub::__launch_bounds__ ( int(AgentReduceByKeyPolicyT::BLOCK_THREADS)  )

< Signed integer type for global offsets

Multi-block reduce-by-key sweep kernel entry point

◆ __launch_bounds__() [4/15]

template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT >
cub::__launch_bounds__ ( int(AgentRlePolicyT::BLOCK_THREADS)  )

< Signed integer type for global offsets

Select kernel entry point (multi-block)

Performs functor-based selection if SelectOp functor type != NullType Otherwise performs flag-based selection if FlagIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)

◆ __launch_bounds__() [5/15]

template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS>
cub::__launch_bounds__ ( int(AgentSelectIfPolicyT::BLOCK_THREADS)  )

< Whether or not we push rejected items to the back of the output

Select kernel entry point (multi-block)

Performs functor-based selection if SelectOpT functor type != NullType Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)

◆ __launch_bounds__() [6/15]

template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT >
cub::__launch_bounds__ ( int(AgentHistogramPolicyT::BLOCK_THREADS)  )

< Signed integer type for global offsets

Histogram privatized sweep kernel entry point (multi-block). Computes privatized histograms, one per thread block.

◆ __launch_bounds__() [7/15]

template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT >
cub::__launch_bounds__ ( int(ScanPolicyT::BLOCK_THREADS)  )

< Signed integer type for global offsets

Scan kernel entry point (multi-block)

◆ __launch_bounds__() [8/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT >
cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS)  ,
 
)

< Data element type that is convertible to the value type of OutputIteratorT

Reduce a single tile kernel entry point (single-block). Can be used to aggregate privatized thread block reductions from a previous multi-block reduction pass.

◆ __launch_bounds__() [9/15]

template<typename ChainedPolicyT , typename OffsetT >
cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS)  ,
 
)

< Signed integer type for global offsets

Spine scan kernel entry point (single-block). Computes an exclusive prefix sum over the privatized digit histograms

◆ __launch_bounds__() [10/15]

template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA>
cub::__launch_bounds__ ( int(SpmvPolicyT::BLOCK_THREADS)  )

< Whether the input parameter Beta is 0

Spmv agent entry point

◆ __launch_bounds__() [11/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS)  ) const

< Signed integer type for global offsets

Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place.

◆ __launch_bounds__() [12/15]

template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT >
cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)  )

< Data element type that is convertible to the value type of OutputIteratorT

Segmented reduction (one block per segment)

◆ __launch_bounds__() [13/15]

template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT >
cub::__launch_bounds__ ( int(AgentSegmentFixupPolicyT::BLOCK_THREADS)  )

< Tile status interface type

Multi-block reduce-by-key sweep kernel entry point

◆ __launch_bounds__() [14/15]

template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT >
cub::__launch_bounds__ ( int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS)  ,
 
) const

< Signed integer type for global offsets

Single pass kernel entry point (single-block). Fully sorts a tile of input.

◆ __launch_bounds__() [15/15]

template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT >
cub::__launch_bounds__ ( int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS)  ) const

< Signed integer type for global offsets

Segmented radix sorting pass (one block per segment)

◆ DeviceCompactInitKernel()

template<typename ScanTileStateT , typename NumSelectedIteratorT >
__global__ void cub::DeviceCompactInitKernel ( ScanTileStateT  tile_state,
int  num_tiles,
NumSelectedIteratorT  d_num_selected_out 
)

< Output iterator type for recording the number of items selected

Initialization kernel for tile status initialization (multi-block)

Parameters
[in]tile_stateTile status interface
[in]num_tilesNumber of tiles
[out]d_num_selected_outPointer to the total number of items selected (i.e., length of d_selected_out)

Definition at line 78 of file dispatch_scan.cuh.

◆ DeviceHistogramInitKernel()

template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT >
__global__ void cub::DeviceHistogramInitKernel ( ArrayWrapper< int, NUM_ACTIVE_CHANNELS >  num_output_bins_wrapper,
ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS >  d_output_histograms_wrapper,
GridQueue< int tile_queue 
)

< Signed integer type for global offsets

Histogram initialization kernel entry point

Parameters
num_output_bins_wrapperNumber of output histogram bins per channel
d_output_histograms_wrapperHistogram counter data having logical dimensions CounterT[NUM_ACTIVE_CHANNELS][num_bins.array[CHANNEL]]
tile_queueDrain queue descriptor for dynamically mapping tile data onto thread blocks

Definition at line 67 of file dispatch_histogram.cuh.

◆ DeviceScanInitKernel()

template<typename ScanTileStateT >
__global__ void cub::DeviceScanInitKernel ( ScanTileStateT  tile_state,
int  num_tiles 
)

< Tile status interface type

Initialization kernel for tile status initialization (multi-block)

Parameters
[in]tile_stateTile status interface
[in]num_tilesNumber of tiles

Definition at line 64 of file dispatch_scan.cuh.

◆ DeviceSpmv1ColKernel()

template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT >
__global__ void cub::DeviceSpmv1ColKernel ( SpmvParams< ValueT, OffsetT spmv_params)

< Signed integer type for sequence offsets

Spmv search kernel. Identifies merge path starting coordinates for each tile.

Parameters
[in]spmv_paramsSpMV input parameter bundle

Definition at line 68 of file dispatch_spmv_orig.cuh.

◆ DeviceSpmvSearchKernel()

template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT >
__global__ void cub::DeviceSpmvSearchKernel ( int  num_merge_tiles,
CoordinateT *  d_tile_coordinates,
SpmvParamsT  spmv_params 
)

< SpmvParams type

Spmv search kernel. Identifies merge path starting coordinates for each tile.

Constants

Parameters
[in]num_merge_tilesNumber of SpMV merge tiles (spmv grid size)
[out]d_tile_coordinatesPointer to the temporary array of tile starting coordinates
[in]spmv_paramsSpMV input parameter bundle

Definition at line 104 of file dispatch_spmv_orig.cuh.

◆ LowerBound()

template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT cub::LowerBound ( InputIteratorT  input,
OffsetT  num_items,
val 
)

Returns the offset of the first value within input which does not compare less than val.

Parameters
[in]inputInput sequence
[in]num_itemsInput sequence length
[in]valSearch key

Definition at line 95 of file thread_search.cuh.

◆ MatchAny()

template<int LABEL_BITS>
__device__ unsigned int cub::MatchAny ( unsigned int  label)
inline

Compute a 32b mask of threads having the same least-significant LABEL_BITS of label as the calling thread.

Definition at line 703 of file util_ptx.cuh.

◆ MergePathSearch()

template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT >
__host__ __device__ __forceinline__ void cub::MergePathSearch ( OffsetT  diagonal,
AIteratorT  a,
BIteratorT  b,
OffsetT  a_len,
OffsetT  b_len,
CoordinateT &  path_coordinate 
)

Computes the begin offsets into A and B for the specific diagonal

The value type of the input iterator

Definition at line 53 of file thread_search.cuh.

◆ UpperBound()

template<typename InputIteratorT , typename OffsetT , typename T >
__device__ __forceinline__ OffsetT cub::UpperBound ( InputIteratorT  input,
OffsetT  num_items,
val 
)

Returns the offset of the first value within input which compares greater than val.

Parameters
[in]inputInput sequence
[in]num_itemsInput sequence length
[in]valSearch key

Definition at line 126 of file thread_search.cuh.

Variable Documentation

◆ d_aggregates_out

AggregatesOutputIteratorT cub::d_aggregates_out

Pointer to the output sequence of value aggregates (one aggregate per run)

< [in] Pointer to the array carry-out dot product row-ids, one per spmv block

[in,out] Output value aggregates

Definition at line 74 of file dispatch_reduce_by_key.cuh.

◆ d_flags

FlagsInputIteratorT cub::d_flags

< [in] Pointer to the input sequence of data items

[in] Pointer to the input sequence of selection flags (if applicable)

Definition at line 78 of file dispatch_select_if.cuh.

◆ d_keys_out

KeyT * cub::d_keys_out

< [in] Input keys buffer

[in] Output keys buffer

Definition at line 164 of file dispatch_radix_sort.cuh.

◆ d_num_runs_out

OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT cub::d_num_runs_out

Pointer to total number of runs encountered (i.e., the length of d_unique_out)

[out] Pointer to total number of runs (i.e., length of d_offsets_out)

Definition at line 74 of file dispatch_reduce_by_key.cuh.

◆ d_offsets_out

OffsetsOutputIteratorT cub::d_offsets_out

< [in] Pointer to input sequence of data items

[out] Pointer to output sequence of run-offsets

Definition at line 77 of file dispatch_rle.cuh.

◆ d_out

OutputIteratorT cub::d_out

< [in] Pointer to the input sequence of data items

< Input data

[out] Pointer to the output aggregate

Output data

Definition at line 71 of file dispatch_reduce.cuh.

◆ d_spine

KeyT const ValueT ValueT OffsetT * cub::d_spine

< [in] Input keys buffer

[in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

[out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

Definition at line 74 of file dispatch_radix_sort.cuh.

◆ d_unique_out

UniqueOutputIteratorT cub::d_unique_out

< Pointer to the input sequence of keys

Pointer to the output sequence of unique keys (one key per run)

Definition at line 74 of file dispatch_reduce_by_key.cuh.

◆ d_values_in

UniqueOutputIteratorT ValuesInputIteratorT cub::d_values_in

[in] Input values buffer

Pointer to the input sequence of corresponding values.

Definition at line 164 of file dispatch_radix_sort.cuh.

◆ end_bit

cub::end_bit
Initial value:
{
enum
{
BLOCK_THREADS = ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS,
ITEMS_PER_THREAD = ChainedPolicyT::ActivePolicy::SingleTilePolicy::ITEMS_PER_THREAD,
KEYS_ONLY = Equals<ValueT, NullType>::VALUE,
}

< [in] The past-the-end (most-significant) bit index needed for key comparison

Definition at line 220 of file dispatch_radix_sort.cuh.

◆ equality_op

FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT EqualityOpT cub::equality_op

KeyT equality operator.

[in] Equality operator

[in] Equality operator for input items

Definition at line 74 of file dispatch_reduce_by_key.cuh.

◆ even_share

OutputIteratorT OffsetT GridEvenShare< OffsetT > cub::even_share
Initial value:
{
enum {
TILE_ITEMS = ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS *
ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::ITEMS_PER_THREAD
}

< [in] Even-share descriptor for mapan equal number of tiles onto each thread block

[in] Even-share descriptor for mapping an equal number of tiles onto each thread block

Definition at line 79 of file dispatch_radix_sort.cuh.

◆ init

OutputIteratorT OffsetIteratorT OffsetIteratorT ReductionOpT OutputT cub::init
Initial value:
{
typedef AgentReduce<
typename ChainedPolicyT::ActivePolicy::SingleTilePolicy,
InputIteratorT,
OutputIteratorT,
ReductionOpT>
AgentReduceT
OffsetT OffsetT
[in] Total number of input data items

< [in] The initial value of the reduction

Definition at line 119 of file dispatch_reduce.cuh.

◆ num_counts

int cub::num_counts
Initial value:
{
typedef AgentScan<
typename ChainedPolicyT::ActivePolicy::ScanPolicy,
AgentScanT
OffsetT OffsetT
[in] Total number of input data items
Default sum functor.

< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)

< [in] Total number of bin-counts

Definition at line 121 of file dispatch_radix_sort.cuh.

◆ num_items

AggregatesOutputIteratorT OffsetT cub::num_items = segment_end - segment_begin

[in] Total number of input data items

[in] Total number of items to select from

< Total number of scan items for the entire problem

[in] Total number of input items (i.e., length of d_in)

< Total number of items to select from

Definition at line 164 of file dispatch_radix_sort.cuh.

◆ num_output_bins_wrapper

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_output_bins_wrapper

< Input data to reduce

The number bins per final output histogram

Definition at line 102 of file dispatch_histogram.cuh.

◆ num_segment_fixup_tiles

OffsetT CoordinateT KeyValuePair<OffsetT,ValueT> int ScanTileStateT int cub::num_segment_fixup_tiles
Initial value:
{
typedef AgentSpmv<
SpmvPolicyT,
ValueT,
HAS_ALPHA,
HAS_BETA>
AgentSpmvT
OffsetT OffsetT
[in] Total number of input data items

< [in] Number of reduce-by-key tiles (fixup grid size)

Definition at line 165 of file dispatch_spmv_orig.cuh.

◆ num_tiles

AggregatesOutputIteratorT OffsetT int cub::num_tiles
Initial value:
{
typedef AgentRle<
AgentRlePolicyT,
InputIteratorT,
OffsetsOutputIteratorT,
LengthsOutputIteratorT,
EqualityOpT,
OffsetT> AgentRleT
OffsetT OffsetT
[in] Total number of input data items

< [in] Total number of tiles for the entire problem

[in] Total number of tiles for the entire problem

[in] Number of merge tiles

Definition at line 84 of file dispatch_rle.cuh.

◆ pass_bits

KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int cub::pass_bits
Initial value:
{
typedef typename If<(ALT_DIGIT_BITS),
typename ChainedPolicyT::ActivePolicy::AltSegmentedPolicy,
typename ChainedPolicyT::ActivePolicy::SegmentedPolicy>::Type SegmentedPolicyT

< [in] Number of bits of current radix digit

Definition at line 339 of file dispatch_radix_sort.cuh.

◆ reduction_op

UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT ReductionOpT cub::reduction_op
Initial value:
{
typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),
typename std::iterator_traits<InputIteratorT>::value_type,
typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT

< [in] Binary reduction functor

ValueT reduction operator.

[in] Binary reduction functor

Definition at line 75 of file dispatch_reduce.cuh.

◆ tile_queue

ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int GridQueue<int> cub::tile_queue
Initial value:
{
typedef AgentHistogram<
AgentHistogramPolicyT,
PRIVATIZED_SMEM_BINS,
NUM_CHANNELS,
NUM_ACTIVE_CHANNELS,
SampleIteratorT,
CounterT,
PrivatizedDecodeOpT,
OutputDecodeOpT,
AgentHistogramT
OffsetT OffsetT
[in] Total number of input data items

< Drain queue descriptor for dynamically mapping tile data onto thread blocks

Definition at line 113 of file dispatch_histogram.cuh.

◆ tile_state

AggregatesOutputIteratorT OffsetT int ScanTileStateT cub::tile_state
Initial value:
{
typedef AgentSegmentFixup<
AgentSegmentFixupPolicyT,
PairsInputIteratorT,
AggregatesOutputIteratorT,
AgentSegmentFixupT
Default equality functor.
OffsetT OffsetT
[in] Total number of input data items
Default sum functor.

Tile status interface.

< [in] Tile status interface

[in] Tile status interface for fixup reduce-by-key kernel

Definition at line 74 of file dispatch_reduce_by_key.cuh.