Optional outer namespace(s) More...
Namespaces | |
| namespace | internal | 
| Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations)  | |
Data Structures | |
| struct | AgentHistogram | 
| AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram .  More... | |
| struct | AgentHistogramPolicy | 
| < Whether to dequeue tiles from a global work queue  More... | |
| struct | AgentRadixSortDownsweep | 
| AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep .  More... | |
| struct | AgentRadixSortDownsweepPolicy | 
| < The number of radix bits, i.e., log2(bins)  More... | |
| struct | AgentRadixSortUpsweep | 
| AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep .  More... | |
| struct | AgentRadixSortUpsweepPolicy | 
| < The number of radix bits, i.e., log2(bins)  More... | |
| struct | AgentReduce | 
| AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction .  More... | |
| struct | AgentReduceByKey | 
| AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key.  More... | |
| struct | AgentReduceByKeyPolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AgentReducePolicy | 
| < Cache load modifier for reading input elements  More... | |
| struct | AgentRle | 
| AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode.  More... | |
| struct | AgentRlePolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AgentScan | 
| AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan .  More... | |
| struct | AgentScanPolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AgentSegmentFixup | 
| AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key.  More... | |
| struct | AgentSegmentFixupPolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AgentSelectIf | 
| AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide selection.  More... | |
| struct | AgentSelectIfPolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AgentSpmv | 
| AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV.  More... | |
| struct | AgentSpmvPolicy | 
| < The BlockScan algorithm to use  More... | |
| struct | AlignBytes | 
| Structure alignment.  More... | |
| struct | AlignBytes< const T > | 
| struct | AlignBytes< const volatile T > | 
| struct | AlignBytes< volatile T > | 
| class | ArgIndexInputIterator | 
A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples).  More... | |
| struct | ArgMax | 
| Arg max functor (keeps the value and offset of the first occurrence of the larger item)  More... | |
| struct | ArgMin | 
| Arg min functor (keeps the value and offset of the first occurrence of the smallest item)  More... | |
| struct | ArrayWrapper | 
| A wrapper for passing simple static arrays as kernel parameters.  More... | |
| struct | BaseTraits | 
| Basic type traits.  More... | |
| struct | BaseTraits< FLOATING_POINT, true, false, _UnsignedBits, T > | 
| struct | BaseTraits< SIGNED_INTEGER, true, false, _UnsignedBits, T > | 
| struct | BaseTraits< UNSIGNED_INTEGER, true, false, _UnsignedBits, T > | 
| struct | BinaryOpHasIdxParam | 
Determine whether or not BinaryOp's functor is of the form bool operator()(const T& a, const T&b) or bool operator()(const T& a, const T&b, unsigned int idx)  More... | |
| class | BlockAdjacentDifference | 
| class | BlockDiscontinuity | 
| The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.  More... | |
| class | BlockExchange | 
| The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.  More... | |
| class | BlockHistogram | 
| The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.  More... | |
| struct | BlockHistogramAtomic | 
| The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.  More... | |
| struct | BlockHistogramSort | 
| The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.  More... | |
| class | BlockLoad | 
| The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.  More... | |
| class | BlockRadixRank | 
| BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block.  More... | |
| class | BlockRadixRankMatch | 
| class | BlockRadixSort | 
| The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.  More... | |
| struct | BlockRakingLayout | 
| BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data.  More... | |
| class | BlockReduce | 
| The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.  More... | |
| struct | BlockReduceRaking | 
| BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators.  More... | |
| struct | BlockReduceRakingCommutativeOnly | 
| BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block. Does not support non-commutative reduction operators. Does not support block sizes that are not a multiple of the warp size.  More... | |
| struct | BlockReduceWarpReductions | 
| BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block. Supports non-commutative reduction operators.  More... | |
| class | BlockScan | 
| The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.  More... | |
| struct | BlockScanRaking | 
| BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block.  More... | |
| struct | BlockScanRunningPrefixOp | 
| < Wrapped scan operator type  More... | |
| struct | BlockScanWarpScans | 
| BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.  More... | |
| class | BlockShuffle | 
| The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block.  More... | |
| class | BlockStore | 
| The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.  More... | |
| class | CacheModifiedInputIterator | 
| A random-access input wrapper for dereferencing array values using a PTX cache load modifier.  More... | |
| class | CacheModifiedOutputIterator | 
| A random-access output wrapper for storing array values using a PTX cache-modifier.  More... | |
| struct | CachingDeviceAllocator | 
| A simple caching allocator for device memory allocations.  More... | |
| struct | CastOp | 
| Default cast functor.  More... | |
| struct | ChainedPolicy | 
| Helper for dispatching into a policy chain.  More... | |
| struct | ChainedPolicy< PTX_VERSION, PolicyT, PolicyT > | 
| Helper for dispatching into a policy chain (end-of-chain specialization)  More... | |
| class | ConstantInputIterator | 
| A random-access input generator for dereferencing a sequence of homogeneous values.  More... | |
| class | CountingInputIterator | 
| A random-access input generator for dereferencing a sequence of incrementing integer values.  More... | |
| struct | CubVector | 
Exposes a member typedef Type that names the corresponding CUDA vector type if one exists. Otherwise Type refers to the CubVector structure itself, which will wrap the corresponding x, y, etc. vector fields.  More... | |
| struct | CubVector< T, 1 > | 
| struct | CubVector< T, 2 > | 
| struct | CubVector< T, 3 > | 
| struct | CubVector< T, 4 > | 
| struct | DeviceHistogram | 
| DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.  More... | |
| struct | DevicePartition | 
| DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.  More... | |
| struct | DeviceRadixSort | 
| DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.  More... | |
| struct | DeviceRadixSortPolicy | 
| < Signed integer type for global offsets  More... | |
| struct | DeviceReduce | 
| DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.  More... | |
| struct | DeviceReducePolicy | 
< Binary reduction functor type having member T operator()(const T &a, const T &b)  More... | |
| struct | DeviceRleDispatch | 
| < Signed integer type for global offsets  More... | |
| struct | DeviceRunLengthEncode | 
| DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.  More... | |
| struct | DeviceScan | 
| DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.  More... | |
| struct | DeviceSegmentedRadixSort | 
| DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.  More... | |
| struct | DeviceSegmentedReduce | 
| DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.  More... | |
| struct | DeviceSelect | 
| DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.  More... | |
| struct | DeviceSpmv | 
| DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV).  More... | |
| struct | DipatchHistogram | 
| < Signed integer type for global offsets  More... | |
| class | DiscardOutputIterator | 
| A discard iterator.  More... | |
| struct | DispatchRadixSort | 
| < Signed integer type for global offsets  More... | |
| struct | DispatchReduce | 
< Binary reduction functor type having member T operator()(const T &a, const T &b)  More... | |
| struct | DispatchReduceByKey | 
| < Signed integer type for global offsets  More... | |
| struct | DispatchScan | 
| < Signed integer type for global offsets  More... | |
| struct | DispatchSegmentedRadixSort | 
| < Signed integer type for global offsets  More... | |
| struct | DispatchSegmentedReduce | 
< Binary reduction functor type having member T operator()(const T &a, const T &b)  More... | |
| struct | DispatchSelectIf | 
| < Whether or not we push rejected items to the back of the output  More... | |
| struct | DispatchSpmv | 
| < Signed integer type for global offsets  More... | |
| struct | DoubleBuffer | 
| Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth.  More... | |
| struct | EnableIf | 
| Simple enable-if (similar to Boost)  More... | |
| struct | EnableIf< false, T > | 
| struct | Equality | 
| Default equality functor.  More... | |
| struct | Equals | 
| Type equality test.  More... | |
| struct | Equals< A, A > | 
| struct | FpLimits | 
| struct | FpLimits< double > | 
| struct | FpLimits< float > | 
| class | GridBarrier | 
| GridBarrier implements a software global barrier among thread blocks within a CUDA grid.  More... | |
| class | GridBarrierLifetime | 
| GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation.  More... | |
| struct | GridEvenShare | 
| GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles.  More... | |
| class | GridQueue | 
| GridQueue is a descriptor utility for dynamic queue management.  More... | |
| struct | If | 
Type selection (IF ? ThenType : ElseType)  More... | |
| struct | If< false, ThenType, ElseType > | 
| struct | Inequality | 
| Default inequality functor.  More... | |
| struct | InequalityWrapper | 
| Inequality functor (wraps equality functor)  More... | |
| struct | Int2Type | 
| Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values)  More... | |
| struct | IsPointer | 
| Pointer vs. iterator.  More... | |
| struct | IsPointer< Tp * > | 
| struct | IsVolatile | 
| Volatile modifier test.  More... | |
| struct | IsVolatile< Tp volatile > | 
| struct | IterateThreadLoad | 
| Helper structure for templated load iteration (inductive case)  More... | |
| struct | IterateThreadLoad< MAX, MAX > | 
| Helper structure for templated load iteration (termination case)  More... | |
| struct | IterateThreadStore | 
| Helper structure for templated store iteration (inductive case)  More... | |
| struct | IterateThreadStore< MAX, MAX > | 
| Helper structure for templated store iteration (termination case)  More... | |
| struct | KernelConfig | 
| struct | KeyValuePair | 
| A key identifier paired with a corresponding value.  More... | |
| struct | Log2 | 
| Statically determine log2(N), rounded up.  More... | |
| struct | Log2< N, 0, COUNT > | 
| struct | Max | 
| Default max functor.  More... | |
| struct | Min | 
| Default min functor.  More... | |
| struct | Mutex | 
| struct | NullType | 
| A simple "NULL" marker type.  More... | |
| struct | NumericTraits | 
| Numeric type traits.  More... | |
| struct | NumericTraits< bool > | 
| struct | NumericTraits< char > | 
| struct | NumericTraits< double > | 
| struct | NumericTraits< float > | 
| struct | NumericTraits< int > | 
| struct | NumericTraits< long > | 
| struct | NumericTraits< long long > | 
| struct | NumericTraits< NullType > | 
| struct | NumericTraits< short > | 
| struct | NumericTraits< signed char > | 
| struct | NumericTraits< unsigned char > | 
| struct | NumericTraits< unsigned int > | 
| struct | NumericTraits< unsigned long > | 
| struct | NumericTraits< unsigned long long > | 
| struct | NumericTraits< unsigned short > | 
| struct | PowerOfTwo | 
| Statically determine if N is a power-of-two.  More... | |
| struct | ReduceByKeyOp | 
| < Binary reduction operator to apply to values  More... | |
| struct | ReduceByKeyScanTileState | 
| struct | ReduceByKeyScanTileState< ValueT, KeyT, false > | 
| struct | ReduceByKeyScanTileState< ValueT, KeyT, true > | 
| struct | ReduceBySegmentOp | 
| Reduce-by-segment functor.  More... | |
| struct | RemoveQualifiers | 
Removes const and volatile qualifiers from type Tp.  More... | |
| struct | RemoveQualifiers< Tp, const Up > | 
| struct | RemoveQualifiers< Tp, const volatile Up > | 
| struct | RemoveQualifiers< Tp, volatile Up > | 
| struct | ScanTileState | 
| struct | ScanTileState< T, false > | 
| struct | ScanTileState< T, true > | 
| struct | SpmvParams | 
| < Signed integer type for sequence offsets  More... | |
| struct | Sum | 
| Default sum functor.  More... | |
| class | SwizzleScanOp | 
| Binary operator wrapper for switching non-commutative scan arguments.  More... | |
| union | TempStorage | 
| class | TexObjInputIterator | 
| A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects.  More... | |
| struct | TilePrefixCallbackOp | 
| struct | Traits | 
| Type traits.  More... | |
| class | TransformInputIterator | 
| A random-access input wrapper for transforming dereferenced values.  More... | |
| struct | Uninitialized | 
| A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions.  More... | |
| struct | UnitWord | 
| Unit-words of data movement.  More... | |
| struct | UnitWord< char2 > | 
| struct | UnitWord< const T > | 
| struct | UnitWord< const volatile T > | 
| struct | UnitWord< float2 > | 
| struct | UnitWord< float4 > | 
| struct | UnitWord< volatile T > | 
| class | WarpExchange | 
| class | WarpReduce | 
| The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.  More... | |
| struct | WarpReduceShfl | 
| WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp.  More... | |
| struct | WarpReduceSmem | 
| WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp.  More... | |
| class | WarpScan | 
| The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.  More... | |
| struct | WarpScanShfl | 
| WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.  More... | |
| struct | WarpScanSmem | 
| WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.  More... | |
Typedefs | |
| typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy >::Type, KeyT, OffsetT > | AgentRadixSortUpsweepT | 
| typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy >::Type, IS_DESCENDING, KeyT, ValueT, OffsetT > | AgentRadixSortDownsweepT | 
| typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS,(ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM==RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM > | BlockRadixSortT | 
| typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > | BlockLoadKeys | 
| typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > | BlockLoadValues | 
| typedef Traits< KeyT >::UnsignedBits | UnsignedBitsT | 
| typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy< BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS >, KeyT, OffsetT > | BlockUpsweepT | 
| typedef BlockScan< OffsetT, BLOCK_THREADS > | DigitScanT | 
| typedef AgentRadixSortDownsweep< SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT > | BlockDownsweepT | 
| typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT > | AgentReduceT | 
Enumerations | |
| enum | BlockHistogramMemoryPreference { GMEM , SMEM , BLEND } | 
| enum | RadixRankAlgorithm { RADIX_RANK_BASIC , RADIX_RANK_MEMOIZE , RADIX_RANK_MATCH } | 
| enum | ScanTileStatus { SCAN_TILE_OOB , SCAN_TILE_INVALID = 99 , SCAN_TILE_PARTIAL , SCAN_TILE_INCLUSIVE } | 
| enum | BlockHistogramAlgorithm { BLOCK_HISTO_SORT , BLOCK_HISTO_ATOMIC } | 
| BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.  More... | |
| enum | BlockLoadAlgorithm {  BLOCK_LOAD_DIRECT , BLOCK_LOAD_VECTORIZE , BLOCK_LOAD_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED }  | 
| cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.  More... | |
| enum | BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY , BLOCK_REDUCE_RAKING , BLOCK_REDUCE_WARP_REDUCTIONS } | 
| enum | BlockScanAlgorithm { BLOCK_SCAN_RAKING , BLOCK_SCAN_RAKING_MEMOIZE , BLOCK_SCAN_WARP_SCANS } | 
| BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.  More... | |
| enum | BlockStoreAlgorithm {  BLOCK_STORE_DIRECT , BLOCK_STORE_VECTORIZE , BLOCK_STORE_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED }  | 
| cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.  More... | |
| enum | {  BLOCK_THREADS = SegmentedPolicyT::BLOCK_THREADS , ITEMS_PER_THREAD = SegmentedPolicyT::ITEMS_PER_THREAD , RADIX_BITS = SegmentedPolicyT::RADIX_BITS , TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD , RADIX_DIGITS = 1 << RADIX_BITS , KEYS_ONLY = Equals<ValueT, NullType>::VALUE }  | 
| enum | { BINS_TRACKED_PER_THREAD = BlockDownsweepT::BINS_TRACKED_PER_THREAD } | 
| enum | GridMappingStrategy { GRID_MAPPING_RAKE , GRID_MAPPING_STRIP_MINE , GRID_MAPPING_DYNAMIC } | 
| cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.  More... | |
| enum | CacheLoadModifier {  LOAD_DEFAULT , LOAD_CA , LOAD_CG , LOAD_CS , LOAD_CV , LOAD_LDG , LOAD_VOLATILE }  | 
| Enumeration of cache modifiers for memory load operations.  More... | |
| enum | CacheStoreModifier {  STORE_DEFAULT , STORE_WB , STORE_CG , STORE_CS , STORE_WT , STORE_VOLATILE }  | 
| Enumeration of cache modifiers for memory store operations.  More... | |
| enum | { MAX_VEC_ELEMENTS = 4 } | 
| enum | Category { NOT_A_NUMBER , SIGNED_INTEGER , UNSIGNED_INTEGER , FLOATING_POINT } | 
| Basic type traits categories.  More... | |
Functions | |
| template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT > | |
| __global__ void | DeviceHistogramInitKernel (ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_output_bins_wrapper, ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_output_histograms_wrapper, GridQueue< int > tile_queue) | 
| < Signed integer type for global offsets   | |
| template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT > | |
| __launch_bounds__ (int(AgentHistogramPolicyT::BLOCK_THREADS)) __global__ void DeviceHistogramSweepKernel(SampleIteratorT d_samples | |
| < Signed integer type for global offsets   | |
| AgentHistogramT | agent (temp_storage, d_samples, num_output_bins_wrapper.array, num_privatized_bins_wrapper.array, d_output_histograms_wrapper.array, d_privatized_histograms_wrapper.array, output_decode_op_wrapper.array, privatized_decode_op_wrapper.array) | 
| agent | InitBinCounters () | 
| agent | ConsumeTiles (num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue) | 
| agent | StoreOutput () | 
| template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT > | |
| __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortUpsweepKernel(const KeyT *d_keys | |
| < Signed integer type for global offsets   | |
| even_share template | BlockInit< TILE_ITEMS, GRID_MAPPING_RAKE > () | 
| upsweep | ProcessRegion (even_share.block_offset, even_share.block_end) | 
| CTA_SYNC () | |
| upsweep template | ExtractCounts< IS_DESCENDING > (d_spine, gridDim.x, blockIdx.x) | 
| template<typename ChainedPolicyT , typename OffsetT > | |
| __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS), 1) __global__ void RadixSortScanBinsKernel(OffsetT *d_spine | |
| < Signed integer type for global offsets   | |
| AgentScanT | block_scan (temp_storage, d_spine, d_spine, cub::Sum(), OffsetT(0)) | 
| while (block_offset+AgentScanT::TILE_ITEMS<=num_counts) | |
| template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT > | |
| __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortDownsweepKernel(const KeyT *d_keys_in | |
| < Signed integer type for global offsets   | |
| AgentRadixSortDownsweepT (temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion(even_share.block_offset | |
| template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT > | |
| __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceRadixSortSingleTileKernel(const KeyT *d_keys_in | |
| < Signed integer type for global offsets   | |
| BlockLoadKeys (temp_storage.load_keys).Load(d_keys_in | |
| if (!KEYS_ONLY) | |
| BlockRadixSortT (temp_storage.sort).SortBlockedToStriped(keys | |
| Int2Type< IS_DESCENDING > () | |
| Int2Type< KEYS_ONLY > ()) | |
| for (int ITEM=0;ITEM< ITEMS_PER_THREAD;++ITEM) | |
| template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT > | |
| __launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedRadixSortKernel(const KeyT *d_keys_in | |
| < Signed integer type for global offsets   | |
| if (num_items<=0) return | |
| BlockUpsweepT | upsweep (temp_storage.upsweep, d_keys_in, current_bit, pass_bits) | 
| upsweep | ProcessRegion (segment_begin, segment_end) | 
| upsweep | ExtractCounts (bin_count) | 
| if (IS_DESCENDING) | |
| DigitScanT (temp_storage.scan).ExclusiveSum(bin_count | |
| BlockDownsweepT | downsweep (temp_storage.downsweep, bin_offset, num_items, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, pass_bits) | 
| template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT > | |
| __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceReduceKernel(InputIteratorT d_in | |
< Binary reduction functor type having member T operator()(const T &a, const T &b)   | |
| if (threadIdx.x==0) d_out[blockIdx.x] = reduction_op(init, block_aggregate) | |
| template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT > | |
| __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceReduceSingleTileKernel(InputIteratorT d_in | |
< Data element type that is convertible to the value type of OutputIteratorT   | |
| if (num_items==0) | |
| template<typename T , typename OffsetT , typename IteratorT > | |
| __device__ __forceinline__ void | NormalizeReductionOutput (T &, OffsetT, IteratorT) | 
| Normalize input iterator to segment offset.   | |
| template<typename KeyValuePairT , typename OffsetT , typename WrappedIteratorT , typename OutputValueT > | |
| __device__ __forceinline__ void | NormalizeReductionOutput (KeyValuePairT &val, OffsetT base_offset, ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT >) | 
| Normalize input iterator to segment offset (specialized for arg-index)   | |
| template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT > | |
| __launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedReduceKernel(InputIteratorT d_in | |
< Data element type that is convertible to the value type of OutputIteratorT   | |
| if (segment_begin==segment_end) | |
| NormalizeReductionOutput (block_aggregate, segment_begin, d_in) | |
| template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT > | |
| __launch_bounds__ (int(AgentReduceByKeyPolicyT::BLOCK_THREADS)) __global__ void DeviceReduceByKeyKernel(KeysInputIteratorT d_keys_in | |
| < Signed integer type for global offsets   | |
| AgentReduceByKeyT (temp_storage, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, equality_op, reduction_op).ConsumeRange(num_items | |
| template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT > | |
| __launch_bounds__ (int(AgentRlePolicyT::BLOCK_THREADS)) __global__ void DeviceRleSweepKernel(InputIteratorT d_in | |
| < Signed integer type for global offsets   | |
| AgentRleT (temp_storage, d_in, d_offsets_out, d_lengths_out, equality_op, num_items).ConsumeRange(num_tiles | |
| template<typename ScanTileStateT > | |
| __global__ void | DeviceScanInitKernel (ScanTileStateT tile_state, int num_tiles) | 
| < Tile status interface type   | |
| template<typename ScanTileStateT , typename NumSelectedIteratorT > | |
| __global__ void | DeviceCompactInitKernel (ScanTileStateT tile_state, int num_tiles, NumSelectedIteratorT d_num_selected_out) | 
| < Output iterator type for recording the number of items selected   | |
| template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT > | |
| __launch_bounds__ (int(ScanPolicyT::BLOCK_THREADS)) __global__ void DeviceScanKernel(InputIteratorT d_in | |
| < Signed integer type for global offsets   | |
| AgentScanT (temp_storage, d_in, d_out, scan_op, init_value).ConsumeRange(num_items | |
| template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS> | |
| __launch_bounds__ (int(AgentSelectIfPolicyT::BLOCK_THREADS)) __global__ void DeviceSelectSweepKernel(InputIteratorT d_in | |
| < Whether or not we push rejected items to the back of the output   | |
| AgentSelectIfT (temp_storage, d_in, d_flags, d_selected_out, select_op, equality_op, num_items).ConsumeRange(num_tiles | |
| template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT > | |
| __global__ void | DeviceSpmv1ColKernel (SpmvParams< ValueT, OffsetT > spmv_params) | 
| < Signed integer type for sequence offsets   | |
| template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT > | |
| __global__ void | DeviceSpmvSearchKernel (int num_merge_tiles, CoordinateT *d_tile_coordinates, SpmvParamsT spmv_params) | 
| < SpmvParams type   | |
| template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA> | |
| __launch_bounds__ (int(SpmvPolicyT::BLOCK_THREADS)) __global__ void DeviceSpmvKernel(SpmvParams< ValueT | |
| < Whether the input parameter Beta is 0   | |
| AgentSpmvT (temp_storage, spmv_params).ConsumeTile(d_tile_coordinates | |
| tile_state | InitializeStatus (num_segment_fixup_tiles) | 
| template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT > | |
| __launch_bounds__ (int(AgentSegmentFixupPolicyT::BLOCK_THREADS)) __global__ void DeviceSegmentFixupKernel(PairsInputIteratorT d_pairs_in | |
| < Tile status interface type   | |
| AgentSegmentFixupT (temp_storage, d_pairs_in, d_aggregates_out, cub::Equality(), cub::Sum()).ConsumeRange(num_items | |
| template<typename OffsetT > | |
| __global__ void | FillAndResetDrainKernel (GridQueue< OffsetT > grid_queue, OffsetT num_items) | 
| template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT > | |
| __host__ __device__ __forceinline__ void | MergePathSearch (OffsetT diagonal, AIteratorT a, BIteratorT b, OffsetT a_len, OffsetT b_len, CoordinateT &path_coordinate) | 
| template<typename InputIteratorT , typename OffsetT , typename T > | |
| __device__ __forceinline__ OffsetT | LowerBound (InputIteratorT input, OffsetT num_items, T val) | 
Returns the offset of the first value within input which does not compare less than val.   | |
| template<typename InputIteratorT , typename OffsetT , typename T > | |
| __device__ __forceinline__ OffsetT | UpperBound (InputIteratorT input, OffsetT num_items, T val) | 
Returns the offset of the first value within input which compares greater than val.   | |
| __host__ __device__ __forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) | 
| CUB error reporting macro (prints error messages to stderr)   | |
| template<int ALLOCATIONS> | |
| __host__ __device__ __forceinline__ cudaError_t | AliasTemporaries (void *d_temp_storage, size_t &temp_storage_bytes, void *(&allocations)[ALLOCATIONS], size_t(&allocation_sizes)[ALLOCATIONS]) | 
| template<typename T > | |
| __global__ void | EmptyKernel (void) | 
| CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | PtxVersion (int &ptx_version) | 
| Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10)   | |
| CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | SmVersion (int &sm_version, int device_ordinal) | 
| Retrieves the SM version (major * 100 + minor * 10)   | |
| CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | SyncStream (cudaStream_t stream) | 
| template<typename KernelPtr > | |
| CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0) | 
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block.   | |
| __device__ __forceinline__ unsigned int | SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend) | 
Shift-right then add. Returns (x >> shift) + addend.   | |
| __device__ __forceinline__ unsigned int | SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend) | 
Shift-left then add. Returns (x << shift) + addend.   | |
| template<typename UnsignedBits , int BYTE_LEN> | |
| __device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< BYTE_LEN >) | 
| template<typename UnsignedBits > | |
| __device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< 8 >) | 
| template<typename UnsignedBits > | |
| __device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits) | 
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.   | |
| __device__ __forceinline__ void | BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits) | 
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.   | |
| __device__ __forceinline__ unsigned int | IADD3 (unsigned int x, unsigned int y, unsigned int z) | 
Three-operand add. Returns x + y + z.   | |
| __device__ __forceinline__ int | PRMT (unsigned int a, unsigned int b, unsigned int index) | 
| Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later.   | |
| __device__ __forceinline__ void | BAR (int count) | 
| __device__ __forceinline__ int | CTA_SYNC_AND (int p) | 
| __device__ __forceinline__ void | WARP_SYNC (unsigned int member_mask) | 
| __device__ __forceinline__ int | WARP_ANY (int predicate, unsigned int member_mask) | 
| __device__ __forceinline__ int | WARP_ALL (int predicate, unsigned int member_mask) | 
| __device__ __forceinline__ int | WARP_BALLOT (int predicate, unsigned int member_mask) | 
| __device__ __forceinline__ unsigned int | SHFL_UP_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask) | 
| __device__ __forceinline__ unsigned int | SHFL_DOWN_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask) | 
| __device__ __forceinline__ unsigned int | SHFL_IDX_SYNC (unsigned int word, int src_lane, int flags, unsigned int member_mask) | 
| __device__ __forceinline__ float | FMUL_RZ (float a, float b) | 
| __device__ __forceinline__ float | FFMA_RZ (float a, float b, float c) | 
| __device__ __forceinline__ void | ThreadExit () | 
| Terminates the calling thread.   | |
| __device__ __forceinline__ void | ThreadTrap () | 
| Abort execution and generate an interrupt to the host CPU.   | |
| __device__ __forceinline__ int | RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z) | 
| Returns the row-major linear thread identifier for a multidimensional thread block.   | |
| __device__ __forceinline__ unsigned int | LaneId () | 
| Returns the warp lane ID of the calling thread.   | |
| __device__ __forceinline__ unsigned int | WarpId () | 
| Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.   | |
| __device__ __forceinline__ unsigned int | LaneMaskLt () | 
| Returns the warp lane mask of all lanes less than the calling thread.   | |
| __device__ __forceinline__ unsigned int | LaneMaskLe () | 
| Returns the warp lane mask of all lanes less than or equal to the calling thread.   | |
| __device__ __forceinline__ unsigned int | LaneMaskGt () | 
| Returns the warp lane mask of all lanes greater than the calling thread.   | |
| __device__ __forceinline__ unsigned int | LaneMaskGe () | 
| Returns the warp lane mask of all lanes greater than or equal to the calling thread.   | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) | 
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.   | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) | 
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.   | |
| template<int LOGICAL_WARP_THREADS, typename T > | |
| __device__ __forceinline__ T | ShuffleIndex (T input, int src_lane, unsigned int member_mask) | 
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.   | |
| template<int LABEL_BITS> | |
| __device__ unsigned int | MatchAny (unsigned int label) | 
Blocked arrangement I/O (direct)  | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) | 
| Load a linear segment of items into a blocked arrangement across the thread block.   | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Load a linear segment of items into a blocked arrangement across the thread block, guarded by range.   | |
| template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) | 
| Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements..   | |
| template<CacheLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
| __device__ __forceinline__ void | InternalLoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) | 
| template<typename T , int ITEMS_PER_THREAD> | |
| __device__ __forceinline__ void | LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) | 
| Load a linear segment of items into a blocked arrangement across the thread block.   | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) | 
| Store a blocked arrangement of items across a thread block into a linear segment of items.   | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range.   | |
| template<typename T , int ITEMS_PER_THREAD> | |
| __device__ __forceinline__ void | StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) | 
| Store a blocked arrangement of items across a thread block into a linear segment of items.   | |
Striped arrangement I/O (direct)  | |
| template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) | 
| Load a linear segment of items into a striped arrangement across the thread block.   | |
| template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Load a linear segment of items into a striped arrangement across the thread block, guarded by range.   | |
| template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) | 
| Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.   | |
| template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) | 
| Store a striped arrangement of data across the thread block into a linear segment of items.   | |
| template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range.   | |
Warp-striped arrangement I/O (direct)  | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) | 
| Load a linear segment of items into a warp-striped arrangement across the thread block.   | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range.   | |
| template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) | 
| Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.   | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) | 
| Store a warp-striped arrangement of data across the thread block into a linear segment of items.   | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) | 
| Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range.   | |
Thread I/O (cache modified)  | |
| template<CacheLoadModifier MODIFIER, typename InputIteratorT > | |
| __device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type | ThreadLoad (InputIteratorT itr) | 
| Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type.   | |
| template<typename InputIteratorT > | |
| __device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type | ThreadLoad (InputIteratorT itr, Int2Type< LOAD_DEFAULT >, Int2Type< false >) | 
| template<typename T > | |
| __device__ __forceinline__ T | ThreadLoad (T *ptr, Int2Type< LOAD_DEFAULT >, Int2Type< true >) | 
| template<typename T > | |
| __device__ __forceinline__ T | ThreadLoadVolatilePointer (T *ptr, Int2Type< true >) | 
| template<typename T > | |
| __device__ __forceinline__ T | ThreadLoadVolatilePointer (T *ptr, Int2Type< false >) | 
| template<typename T > | |
| __device__ __forceinline__ T | ThreadLoad (T *ptr, Int2Type< LOAD_VOLATILE >, Int2Type< true >) | 
| template<typename T , int MODIFIER> | |
| __device__ __forceinline__ T | ThreadLoad (T const *ptr, Int2Type< MODIFIER >, Int2Type< true >) | 
| template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T > | |
| __device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val) | 
| Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type.   | |
| template<typename OutputIteratorT , typename T > | |
| __device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val, Int2Type< STORE_DEFAULT >, Int2Type< false >) | 
| template<typename T > | |
| __device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< STORE_DEFAULT >, Int2Type< true >) | 
| template<typename T > | |
| __device__ __forceinline__ void | ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< true >) | 
| template<typename T > | |
| __device__ __forceinline__ void | ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< false >) | 
| template<typename T > | |
| __device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< STORE_VOLATILE >, Int2Type< true >) | 
| template<typename T , int MODIFIER> | |
| __device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< MODIFIER >, Int2Type< true >) | 
Variables | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_output_bins_wrapper | 
| < Input data to reduce   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_privatized_bins_wrapper | 
| The number bins per privatized histogram.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_output_histograms_wrapper | 
| Reference to final output histograms.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_privatized_histograms_wrapper | 
| Reference to privatized histograms.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > | output_decode_op_wrapper | 
| The transform operator for determining output bin-ids from privatized counter indices, one for each channel.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > | privatized_decode_op_wrapper | 
| The transform operator for determining privatized counter indices from samples, one for each channel.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT | num_row_pixels | 
| The number of multi-channel pixels per row in the region of interest.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT | num_rows | 
| The number of rows in the region of interest.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT | row_stride_samples | 
| The number of samples between starts of consecutive rows in the region of interest.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int | tiles_per_row | 
| Number of image tiles per row.   | |
| ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int GridQueue< int > | tile_queue | 
| < Drain queue descriptor for dynamically mapping tile data onto thread blocks   | |
| __shared__ AgentHistogramT::TempStorage | temp_storage | 
| OffsetT * | d_spine | 
| < [in] Input keys buffer   | |
| OffsetT | OffsetT | 
| [in] Total number of input data items   | |
| OffsetT int | current_bit | 
| [in] Bit position of current radix digit   | |
| OffsetT int int | num_bits | 
| [in] Number of bits of current radix digit   | |
| OffsetT int int GridEvenShare< OffsetT > | even_share | 
| < [in] Even-share descriptor for mapan equal number of tiles onto each thread block   | |
| AgentRadixSortUpsweepT | upsweep (temp_storage, d_keys, current_bit, num_bits) | 
| int | num_counts | 
| < [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)   | |
| int | block_offset = 0 | 
| BlockScanRunningPrefixOp< OffsetT, Sum > | prefix_op (0, Sum()) | 
| KeyT * | d_keys_out | 
| < [in] Input keys buffer   | |
| KeyT const ValueT * | d_values_in | 
| [in] Input values buffer   | |
| KeyT const ValueT ValueT * | d_values_out | 
| [in] Output values buffer   | |
| KeyT const ValueT ValueT OffsetT OffsetT | num_items = segment_end - segment_begin | 
| [in] Total number of input data items   | |
| even_share | block_end | 
| KeyT const ValueT ValueT OffsetT int int | end_bit | 
| < [in] The past-the-end (most-significant) bit index needed for key comparison   | |
| KeyT | keys [ITEMS_PER_THREAD] | 
| ValueT | values [ITEMS_PER_THREAD] | 
| UnsignedBitsT | default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY | 
| KeyT | default_key = reinterpret_cast<KeyT&>(default_key_bits) | 
| KeyT const ValueT ValueT OffsetIteratorT | d_begin_offsets | 
[in] Pointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*   | |
| KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT | d_end_offsets | 
[in] Pointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty.   | |
| KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT | int | 
| [in] The number of segments that comprise the sorting data   | |
| KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int | pass_bits | 
| < [in] Number of bits of current radix digit   | |
| OffsetT | segment_begin = d_begin_offsets[blockIdx.x] | 
| OffsetT | segment_end = d_end_offsets[blockIdx.x] | 
| OffsetT | bin_count [BINS_TRACKED_PER_THREAD] | 
| OffsetT | bin_offset [BINS_TRACKED_PER_THREAD] | 
| OutputIteratorT | d_out | 
| < [in] Pointer to the input sequence of data items   | |
| OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT | reduction_op | 
| < [in] Binary reduction functor   | |
| OutputT | block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share) | 
| OutputIteratorT OffsetT ReductionOpT OuputT | init | 
| < [in] The initial value of the reduction   | |
| UniqueOutputIteratorT | d_unique_out | 
| < Pointer to the input sequence of keys   | |
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT | d_aggregates_out | 
| Pointer to the output sequence of value aggregates (one aggregate per run)   | |
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT | d_num_runs_out | 
| Pointer to total number of runs encountered (i.e., the length of d_unique_out)   | |
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT | tile_state | 
| Tile status interface.   | |
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int | start_tile | 
| The starting tile for the current grid.   | |
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT | equality_op | 
| KeyT equality operator.   | |
| OffsetsOutputIteratorT | d_offsets_out | 
| < [in] Pointer to input sequence of data items   | |
| OffsetsOutputIteratorT LengthsOutputIteratorT | d_lengths_out | 
| [out] Pointer to output sequence of run-lengths   | |
| OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT | tile_status | 
| [in] Tile status interface   | |
| OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT EqualityOpT OffsetT int | num_tiles | 
| < [in] Total number of tiles for the entire problem   | |
| OutputIteratorT ScanTileStateT int ScanOpT | scan_op | 
| Binary scan functor.   | |
| OutputIteratorT ScanTileStateT int ScanOpT InitValueT | init_value | 
| Initial value to seed the exclusive scan.   | |
| FlagsInputIteratorT | d_flags | 
| < [in] Pointer to the input sequence of data items   | |
| FlagsInputIteratorT SelectedOutputIteratorT | d_selected_out | 
| [out] Pointer to the output sequence of selected data items   | |
| FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT | d_num_selected_out | 
[out] Pointer to the total number of items selected (i.e., length of d_selected_out)   | |
| FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT | select_op | 
| [in] Selection operator   | |
| OffsetT | spmv_params | 
| [in] SpMV input parameter bundle   | |
| OffsetT CoordinateT * | d_tile_coordinates | 
| [in] Pointer to the temporary array of tile starting coordinates   | |
| OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > * | d_tile_carry_pairs | 
| [out] Pointer to the temporary array carry-out dot product row-ids, one per block   | |
| OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > int ScanTileStateT int | num_segment_fixup_tiles | 
| < [in] Number of reduce-by-key tiles (fixup grid size)   | |
Optional outer namespace(s)
CUB namespace
| typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy>::Type, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::AgentRadixSortDownsweepT | 
Definition at line 187 of file dispatch_radix_sort.cuh.
| typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy>::Type, KeyT, OffsetT> cub::AgentRadixSortUpsweepT | 
Definition at line 92 of file dispatch_radix_sort.cuh.
| typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT> cub::AgentReduceT | 
Definition at line 88 of file dispatch_reduce.cuh.
| typedef AgentRadixSortDownsweep<SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::BlockDownsweepT | 
Definition at line 369 of file dispatch_radix_sort.cuh.
| typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadKeys | 
Definition at line 245 of file dispatch_radix_sort.cuh.
| typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadValues | 
Definition at line 252 of file dispatch_radix_sort.cuh.
| typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS, (ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM == RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM> cub::BlockRadixSortT | 
Definition at line 238 of file dispatch_radix_sort.cuh.
| typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy<BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS>, KeyT, OffsetT> cub::BlockUpsweepT | 
Definition at line 363 of file dispatch_radix_sort.cuh.
| typedef BlockScan<OffsetT, BLOCK_THREADS> cub::DigitScanT | 
Definition at line 366 of file dispatch_radix_sort.cuh.
| typedef Traits<KeyT>::UnsignedBits cub::UnsignedBitsT | 
Definition at line 255 of file dispatch_radix_sort.cuh.
| anonymous enum | 
Definition at line 348 of file dispatch_radix_sort.cuh.
| anonymous enum | 
| Enumerator | |
|---|---|
| BINS_TRACKED_PER_THREAD | Number of bin-starting offsets tracked per thread.  | 
Definition at line 371 of file dispatch_radix_sort.cuh.
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
Definition at line 56 of file block_histogram.cuh.
| enum cub::BlockHistogramMemoryPreference | 
Definition at line 58 of file agent_histogram.cuh.
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.
Definition at line 60 of file block_reduce.cuh.
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
| Enumerator | |
|---|---|
| BLOCK_SCAN_RAKING | 
 
 
  | 
| BLOCK_SCAN_RAKING_MEMOIZE | 
  | 
| BLOCK_SCAN_WARP_SCANS | 
 
 
  | 
Definition at line 57 of file block_scan.cuh.
Radix ranking algorithm
Definition at line 62 of file agent_radix_sort_downsweep.cuh.
| enum cub::ScanTileStatus | 
Enumerations of tile status
Definition at line 105 of file single_pass_scan_operators.cuh.
| cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS) | ) | const | 
< Signed integer type for global offsets
Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place.
| cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS) | ) | const | 
< Signed integer type for global offsets
Segmented radix sorting pass (one block per segment)
| cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS) | ) | const | 
< Signed integer type for global offsets
Upsweep digit-counting kernel entry point (multi-block). Computes privatized digit histograms, one per block.
| cub::__launch_bounds__ | ( | int(AgentHistogramPolicyT::BLOCK_THREADS) | ) | 
< Signed integer type for global offsets
Histogram privatized sweep kernel entry point (multi-block). Computes privatized histograms, one per thread block.
| cub::__launch_bounds__ | ( | int(AgentReduceByKeyPolicyT::BLOCK_THREADS) | ) | 
< Signed integer type for global offsets
Multi-block reduce-by-key sweep kernel entry point
| cub::__launch_bounds__ | ( | int(AgentSegmentFixupPolicyT::BLOCK_THREADS) | ) | 
< Tile status interface type
Multi-block reduce-by-key sweep kernel entry point
| cub::__launch_bounds__ | ( | int(AgentSelectIfPolicyT::BLOCK_THREADS) | ) | 
< Whether or not we push rejected items to the back of the output
Select kernel entry point (multi-block)
Performs functor-based selection if SelectOpT functor type != NullType Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)
| cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) | ) | 
< Binary reduction functor type having member T operator()(const T &a, const T &b) 
Reduce region kernel entry point (multi-block). Computes privatized reductions, one per thread block.
| cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) | ) | 
< Data element type that is convertible to the value type of OutputIteratorT 
Segmented reduction (one block per segment)
| cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS) | , | 
| 1 | |||
| ) | 
< Signed integer type for global offsets
Spine scan kernel entry point (single-block). Computes an exclusive prefix sum over the privatized digit histograms
| cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS) | , | 
| 1 | |||
| ) | const | 
< Signed integer type for global offsets
Single pass kernel entry point (single-block). Fully sorts a tile of input.
| cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS) | , | 
| 1 | |||
| ) | 
< Data element type that is convertible to the value type of OutputIteratorT 
Reduce a single tile kernel entry point (single-block). Can be used to aggregate privatized thread block reductions from a previous multi-block reduction pass.
| cub::__launch_bounds__ | ( | int(ScanPolicyT::BLOCK_THREADS) | ) | 
< Signed integer type for global offsets
Scan kernel entry point (multi-block)
| cub::__launch_bounds__ | ( | int(SpmvPolicyT::BLOCK_THREADS) | ) | 
< Whether the input parameter Beta is 0
Spmv agent entry point
| __global__ void cub::DeviceCompactInitKernel | ( | ScanTileStateT | tile_state, | 
| int | num_tiles, | ||
| NumSelectedIteratorT | d_num_selected_out | ||
| ) | 
< Output iterator type for recording the number of items selected
Initialization kernel for tile status initialization (multi-block)
| [in] | tile_state | Tile status interface | 
| [in] | num_tiles | Number of tiles | 
| [out] | d_num_selected_out | Pointer to the total number of items selected (i.e., length of d_selected_out)  | 
Definition at line 78 of file dispatch_scan.cuh.
| __global__ void cub::DeviceHistogramInitKernel | ( | ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_output_bins_wrapper, | 
| ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_output_histograms_wrapper, | ||
| GridQueue< int > | tile_queue | ||
| ) | 
< Signed integer type for global offsets
Histogram initialization kernel entry point
| num_output_bins_wrapper | Number of output histogram bins per channel | 
| d_output_histograms_wrapper | Histogram counter data having logical dimensions CounterT[NUM_ACTIVE_CHANNELS][num_bins.array[CHANNEL]]   | 
| tile_queue | Drain queue descriptor for dynamically mapping tile data onto thread blocks | 
Definition at line 67 of file dispatch_histogram.cuh.
| __global__ void cub::DeviceScanInitKernel | ( | ScanTileStateT | tile_state, | 
| int | num_tiles | ||
| ) | 
< Tile status interface type
Initialization kernel for tile status initialization (multi-block)
| [in] | tile_state | Tile status interface | 
| [in] | num_tiles | Number of tiles | 
Definition at line 64 of file dispatch_scan.cuh.
| __global__ void cub::DeviceSpmv1ColKernel | ( | SpmvParams< ValueT, OffsetT > | spmv_params | ) | 
< Signed integer type for sequence offsets
Spmv search kernel. Identifies merge path starting coordinates for each tile.
| [in] | spmv_params | SpMV input parameter bundle | 
Definition at line 68 of file dispatch_spmv_orig.cuh.
| __global__ void cub::DeviceSpmvSearchKernel | ( | int | num_merge_tiles, | 
| CoordinateT * | d_tile_coordinates, | ||
| SpmvParamsT | spmv_params | ||
| ) | 
< SpmvParams type
Spmv search kernel. Identifies merge path starting coordinates for each tile.
Constants
| [in] | num_merge_tiles | Number of SpMV merge tiles (spmv grid size) | 
| [out] | d_tile_coordinates | Pointer to the temporary array of tile starting coordinates | 
| [in] | spmv_params | SpMV input parameter bundle | 
Definition at line 104 of file dispatch_spmv_orig.cuh.
| cub::for | ( | ) | 
Definition at line 302 of file dispatch_radix_sort.cuh.
| cub::if | ( | ! | KEYS_ONLY | ) | 
Definition at line 280 of file dispatch_radix_sort.cuh.
| cub::if | ( | IS_DESCENDING | ) | 
Definition at line 415 of file dispatch_radix_sort.cuh.
| cub::if | ( | num_items |  = = 0 | ) | 
Definition at line 133 of file dispatch_reduce.cuh.
| cub::if | ( | segment_begin |  = = segment_end | ) | 
Definition at line 210 of file dispatch_reduce.cuh.
| __device__ __forceinline__ OffsetT cub::LowerBound | ( | InputIteratorT | input, | 
| OffsetT | num_items, | ||
| T | val | ||
| ) | 
Returns the offset of the first value within input which does not compare less than val. 
| [in] | input | Input sequence | 
| [in] | num_items | Input sequence length | 
| [in] | val | Search key | 
Definition at line 95 of file thread_search.cuh.
Compute a 32b mask of threads having the same least-significant LABEL_BITS of label as the calling thread. 
Definition at line 703 of file util_ptx.cuh.
| __host__ __device__ __forceinline__ void cub::MergePathSearch | ( | OffsetT | diagonal, | 
| AIteratorT | a, | ||
| BIteratorT | b, | ||
| OffsetT | a_len, | ||
| OffsetT | b_len, | ||
| CoordinateT & | path_coordinate | ||
| ) | 
Computes the begin offsets into A and B for the specific diagonal
The value type of the input iterator
Definition at line 53 of file thread_search.cuh.
| __device__ __forceinline__ void cub::NormalizeReductionOutput | ( | KeyValuePairT & | val, | 
| OffsetT | base_offset, | ||
| ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT > | |||
| ) | 
Normalize input iterator to segment offset (specialized for arg-index)
Definition at line 164 of file dispatch_reduce.cuh.
| __device__ __forceinline__ void cub::NormalizeReductionOutput | ( | T & | , | 
| OffsetT | , | ||
| IteratorT | |||
| ) | 
Normalize input iterator to segment offset.
Definition at line 154 of file dispatch_reduce.cuh.
| __device__ __forceinline__ OffsetT cub::UpperBound | ( | InputIteratorT | input, | 
| OffsetT | num_items, | ||
| T | val | ||
| ) | 
Returns the offset of the first value within input which compares greater than val. 
| [in] | input | Input sequence | 
| [in] | num_items | Input sequence length | 
| [in] | val | Search key | 
Definition at line 126 of file thread_search.cuh.
| cub::while | ( | block_offset+AgentScanT::TILE_ITEMS<= | num_counts | ) | 
Definition at line 141 of file dispatch_radix_sort.cuh.
| OffsetT cub::bin_count[BINS_TRACKED_PER_THREAD] | 
Definition at line 410 of file dispatch_radix_sort.cuh.
| cub::bin_offset | 
Definition at line 440 of file dispatch_radix_sort.cuh.
| OutputT cub::block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share) | 
Definition at line 94 of file dispatch_reduce.cuh.
| even_share cub::block_end | 
Definition at line 198 of file dispatch_radix_sort.cuh.
| int cub::block_offset = 0 | 
Definition at line 139 of file dispatch_radix_sort.cuh.
| KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int cub::current_bit | 
[in] Bit position of current radix digit
Definition at line 76 of file dispatch_radix_sort.cuh.
| AggregatesOutputIteratorT cub::d_aggregates_out | 
Pointer to the output sequence of value aggregates (one aggregate per run)
< [in] Pointer to the array carry-out dot product row-ids, one per spmv block
[in,out] Output value aggregates
Definition at line 76 of file dispatch_reduce_by_key.cuh.
| OutputIteratorT OffsetIteratorT cub::d_begin_offsets | 
[in] Pointer to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* 
Definition at line 334 of file dispatch_radix_sort.cuh.
| OutputIteratorT OffsetIteratorT OffsetIteratorT cub::d_end_offsets | 
[in] Pointer to the sequence of ending offsets of length num_segments, such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i]-1 <= d_begin_offsets[i], the ith is considered empty. 
Definition at line 335 of file dispatch_radix_sort.cuh.
| FlagsInputIteratorT cub::d_flags | 
< [in] Pointer to the input sequence of data items
[in] Pointer to the input sequence of selection flags (if applicable)
Definition at line 78 of file dispatch_select_if.cuh.
| KeyT * cub::d_keys_out | 
< [in] Input keys buffer
[in] Output keys buffer
Definition at line 164 of file dispatch_radix_sort.cuh.
| OffsetsOutputIteratorT LengthsOutputIteratorT cub::d_lengths_out | 
[out] Pointer to output sequence of run-lengths
Definition at line 78 of file dispatch_rle.cuh.
| OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT cub::d_num_runs_out | 
Pointer to total number of runs encountered (i.e., the length of d_unique_out)
[out] Pointer to total number of runs (i.e., length of d_offsets_out) 
Definition at line 77 of file dispatch_reduce_by_key.cuh.
| cub::d_num_selected_out | 
[out] Pointer to the total number of items selected (i.e., length of d_selected_out) 
Definition at line 80 of file dispatch_select_if.cuh.
| OffsetsOutputIteratorT cub::d_offsets_out | 
< [in] Pointer to input sequence of data items
[out] Pointer to output sequence of run-offsets
Definition at line 77 of file dispatch_rle.cuh.
| OutputIteratorT cub::d_out | 
< [in] Pointer to the input sequence of data items
< Input data
[out] Pointer to the output aggregate
Output data
Definition at line 71 of file dispatch_reduce.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_output_histograms_wrapper | 
Reference to final output histograms.
Definition at line 104 of file dispatch_histogram.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_privatized_histograms_wrapper | 
Reference to privatized histograms.
Definition at line 105 of file dispatch_histogram.cuh.
| FlagsInputIteratorT SelectedOutputIteratorT cub::d_selected_out | 
[out] Pointer to the output sequence of selected data items
Definition at line 79 of file dispatch_select_if.cuh.
| KeyT const ValueT ValueT OffsetT * cub::d_spine | 
< [in] Input keys buffer
[in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
[out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
Definition at line 74 of file dispatch_radix_sort.cuh.
| cub::d_tile_carry_pairs | 
[out] Pointer to the temporary array carry-out dot product row-ids, one per block
Definition at line 161 of file dispatch_spmv_orig.cuh.
| OffsetT CoordinateT* cub::d_tile_coordinates | 
[in] Pointer to the temporary array of tile starting coordinates
Definition at line 160 of file dispatch_spmv_orig.cuh.
| UniqueOutputIteratorT cub::d_unique_out | 
< Pointer to the input sequence of keys
Pointer to the output sequence of unique keys (one key per run)
Definition at line 74 of file dispatch_reduce_by_key.cuh.
| UniqueOutputIteratorT ValuesInputIteratorT cub::d_values_in | 
[in] Input values buffer
Pointer to the input sequence of corresponding values.
Definition at line 165 of file dispatch_radix_sort.cuh.
| KeyT const ValueT ValueT * cub::d_values_out | 
[in] Output values buffer
Definition at line 166 of file dispatch_radix_sort.cuh.
| cub::default_key = reinterpret_cast<KeyT&>(default_key_bits) | 
Definition at line 272 of file dispatch_radix_sort.cuh.
| UnsignedBitsT cub::default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY | 
Definition at line 271 of file dispatch_radix_sort.cuh.
| BlockDownsweepT::TempStorage cub::downsweep | 
Definition at line 385 of file dispatch_radix_sort.cuh.
| cub::end_bit | 
< [in] The past-the-end (most-significant) bit index needed for key comparison
Definition at line 219 of file dispatch_radix_sort.cuh.
| FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT EqualityOpT cub::equality_op | 
KeyT equality operator.
[in] Equality operator
[in] Equality operator for input items
Definition at line 80 of file dispatch_reduce_by_key.cuh.
| OutputIteratorT OffsetT GridEvenShare< OffsetT > cub::even_share | 
< [in] Even-share descriptor for mapan equal number of tiles onto each thread block
[in] Even-share descriptor for mapping an equal number of tiles onto each thread block
Definition at line 78 of file dispatch_radix_sort.cuh.
| OutputIteratorT OffsetIteratorT OffsetIteratorT ReductionOpT OutputT cub::init | 
< [in] The initial value of the reduction
Definition at line 118 of file dispatch_reduce.cuh.
| OutputIteratorT ScanTileStateT int ScanOpT InitValueT cub::init_value | 
Initial value to seed the exclusive scan.
Definition at line 110 of file dispatch_scan.cuh.
| OutputIteratorT OffsetIteratorT OffsetIteratorT cub::int | 
[in] The number of segments that comprise the sorting data
Definition at line 336 of file dispatch_radix_sort.cuh.
| cub::keys | 
Definition at line 267 of file dispatch_radix_sort.cuh.
[in] Number of bits of current radix digit
Definition at line 77 of file dispatch_radix_sort.cuh.
| int cub::num_counts | 
< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
< [in] Total number of bin-counts
Definition at line 120 of file dispatch_radix_sort.cuh.
| AggregatesOutputIteratorT OffsetT cub::num_items = segment_end - segment_begin | 
[in] Total number of input data items
[in] Total number of items to select from
< Total number of scan items for the entire problem
[in] Total number of input items (i.e., length of d_in)
< Total number of items to select from
Definition at line 168 of file dispatch_radix_sort.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_output_bins_wrapper | 
< Input data to reduce
The number bins per final output histogram
Definition at line 102 of file dispatch_histogram.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_privatized_bins_wrapper | 
The number bins per privatized histogram.
Definition at line 103 of file dispatch_histogram.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT cub::num_row_pixels | 
The number of multi-channel pixels per row in the region of interest.
Definition at line 108 of file dispatch_histogram.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT cub::num_rows | 
The number of rows in the region of interest.
Definition at line 109 of file dispatch_histogram.cuh.
| OffsetT CoordinateT KeyValuePair<OffsetT,ValueT> int ScanTileStateT int cub::num_segment_fixup_tiles | 
< [in] Number of reduce-by-key tiles (fixup grid size)
Definition at line 164 of file dispatch_spmv_orig.cuh.
< [in] Total number of tiles for the entire problem
[in] Total number of tiles for the entire problem
[in] Number of merge tiles
Definition at line 83 of file dispatch_rle.cuh.
| OffsetT cub::OffsetT | 
[in] Total number of input data items
Definition at line 75 of file dispatch_radix_sort.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> cub::output_decode_op_wrapper | 
The transform operator for determining output bin-ids from privatized counter indices, one for each channel.
Definition at line 106 of file dispatch_histogram.cuh.
< [in] Number of bits of current radix digit
Definition at line 338 of file dispatch_radix_sort.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> cub::privatized_decode_op_wrapper | 
The transform operator for determining privatized counter indices from samples, one for each channel.
Definition at line 107 of file dispatch_histogram.cuh.
| UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT ReductionOpT cub::reduction_op | 
< [in] Binary reduction functor
ValueT reduction operator.
[in] Binary reduction functor
Definition at line 74 of file dispatch_reduce.cuh.
| volatile OffsetT cub::reverse_counts_in[RADIX_DIGITS] | 
Definition at line 388 of file dispatch_radix_sort.cuh.
| volatile OffsetT cub::reverse_counts_out[RADIX_DIGITS] | 
Definition at line 389 of file dispatch_radix_sort.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT cub::row_stride_samples | 
The number of samples between starts of consecutive rows in the region of interest.
Definition at line 110 of file dispatch_histogram.cuh.
| DigitScanT::TempStorage cub::scan | 
Definition at line 390 of file dispatch_radix_sort.cuh.
| OutputIteratorT ScanTileStateT int ScanOpT cub::scan_op | 
Binary scan functor.
Definition at line 109 of file dispatch_scan.cuh.
| OffsetT cub::segment_begin = d_begin_offsets[blockIdx.x] | 
Definition at line 395 of file dispatch_radix_sort.cuh.
| OffsetT cub::segment_end = d_end_offsets[blockIdx.x] | 
Definition at line 396 of file dispatch_radix_sort.cuh.
| FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT cub::select_op | 
[in] Selection operator
Definition at line 82 of file dispatch_select_if.cuh.
| OffsetT cub::spmv_params | 
[in] SpMV input parameter bundle
Definition at line 159 of file dispatch_spmv_orig.cuh.
| OutputIteratorT ScanTileStateT int cub::start_tile | 
The starting tile for the current grid.
Definition at line 79 of file dispatch_reduce_by_key.cuh.
| __shared__ AgentSegmentFixupT::TempStorage cub::temp_storage | 
Definition at line 128 of file dispatch_histogram.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int GridQueue<int> cub::tile_queue | 
< Drain queue descriptor for dynamically mapping tile data onto thread blocks
Definition at line 112 of file dispatch_histogram.cuh.
Tile status interface.
< [in] Tile status interface
[in] Tile status interface for fixup reduce-by-key kernel
Definition at line 78 of file dispatch_reduce_by_key.cuh.
| FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT cub::tile_status | 
[in] Tile status interface
Definition at line 80 of file dispatch_rle.cuh.
| ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int cub::tiles_per_row | 
Number of image tiles per row.
Definition at line 111 of file dispatch_histogram.cuh.
| BlockUpsweepT::TempStorage cub::upsweep | ( | temp_storage | , | 
| d_keys | , | ||
| current_bit | , | ||
| num_bits | |||
| ) | 
Definition at line 384 of file dispatch_radix_sort.cuh.
| cub::values | 
Definition at line 268 of file dispatch_radix_sort.cuh.