Optional outer namespace(s) More...
Namespaces | |
namespace | internal |
Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations) | |
Data Structures | |
struct | AgentHistogram |
AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram . More... | |
struct | AgentHistogramPolicy |
< Whether to dequeue tiles from a global work queue More... | |
struct | AgentRadixSortDownsweep |
AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep . More... | |
struct | AgentRadixSortDownsweepPolicy |
< The number of radix bits, i.e., log2(bins) More... | |
struct | AgentRadixSortUpsweep |
AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep . More... | |
struct | AgentRadixSortUpsweepPolicy |
< The number of radix bits, i.e., log2(bins) More... | |
struct | AgentReduce |
AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction . More... | |
struct | AgentReduceByKey |
AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More... | |
struct | AgentReduceByKeyPolicy |
< The BlockScan algorithm to use More... | |
struct | AgentReducePolicy |
< Cache load modifier for reading input elements More... | |
struct | AgentRle |
AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode. More... | |
struct | AgentRlePolicy |
< The BlockScan algorithm to use More... | |
struct | AgentScan |
AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan . More... | |
struct | AgentScanPolicy |
< The BlockScan algorithm to use More... | |
struct | AgentSegmentFixup |
AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key. More... | |
struct | AgentSegmentFixupPolicy |
< The BlockScan algorithm to use More... | |
struct | AgentSelectIf |
AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide selection. More... | |
struct | AgentSelectIfPolicy |
< The BlockScan algorithm to use More... | |
struct | AgentSpmv |
AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV. More... | |
struct | AgentSpmvPolicy |
< The BlockScan algorithm to use More... | |
struct | AlignBytes |
Structure alignment. More... | |
struct | AlignBytes< const T > |
struct | AlignBytes< const volatile T > |
struct | AlignBytes< volatile T > |
class | ArgIndexInputIterator |
A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples). More... | |
struct | ArgMax |
Arg max functor (keeps the value and offset of the first occurrence of the larger item) More... | |
struct | ArgMin |
Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More... | |
struct | ArrayWrapper |
A wrapper for passing simple static arrays as kernel parameters. More... | |
struct | BaseTraits |
Basic type traits. More... | |
struct | BaseTraits< FLOATING_POINT, true, false, _UnsignedBits, T > |
struct | BaseTraits< SIGNED_INTEGER, true, false, _UnsignedBits, T > |
struct | BaseTraits< UNSIGNED_INTEGER, true, false, _UnsignedBits, T > |
struct | BinaryOpHasIdxParam |
Determine whether or not BinaryOp's functor is of the form bool operator()(const T& a, const T&b) or bool operator()(const T& a, const T&b, unsigned int idx) More... | |
class | BlockAdjacentDifference |
class | BlockDiscontinuity |
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. More... | |
class | BlockExchange |
The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block. More... | |
class | BlockHistogram |
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More... | |
struct | BlockHistogramAtomic |
The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More... | |
struct | BlockHistogramSort |
The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. More... | |
class | BlockLoad |
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More... | |
class | BlockRadixRank |
BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block. More... | |
class | BlockRadixRankMatch |
class | BlockRadixSort |
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method. More... | |
struct | BlockRakingLayout |
BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data. More... | |
class | BlockReduce |
The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block. More... | |
struct | BlockReduceRaking |
BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More... | |
struct | BlockReduceRakingCommutativeOnly |
BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block. Does not support non-commutative reduction operators. Does not support block sizes that are not a multiple of the warp size. More... | |
struct | BlockReduceWarpReductions |
BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block. Supports non-commutative reduction operators. More... | |
class | BlockScan |
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. More... | |
struct | BlockScanRaking |
BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block. More... | |
struct | BlockScanRunningPrefixOp |
< Wrapped scan operator type More... | |
struct | BlockScanWarpScans |
BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block. More... | |
class | BlockShuffle |
The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block. More... | |
class | BlockStore |
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory. More... | |
class | CacheModifiedInputIterator |
A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More... | |
class | CacheModifiedOutputIterator |
A random-access output wrapper for storing array values using a PTX cache-modifier. More... | |
struct | CachingDeviceAllocator |
A simple caching allocator for device memory allocations. More... | |
struct | CastOp |
Default cast functor. More... | |
struct | ChainedPolicy |
Helper for dispatching into a policy chain. More... | |
struct | ChainedPolicy< PTX_VERSION, PolicyT, PolicyT > |
Helper for dispatching into a policy chain (end-of-chain specialization) More... | |
class | ConstantInputIterator |
A random-access input generator for dereferencing a sequence of homogeneous values. More... | |
class | CountingInputIterator |
A random-access input generator for dereferencing a sequence of incrementing integer values. More... | |
struct | CubVector |
Exposes a member typedef Type that names the corresponding CUDA vector type if one exists. Otherwise Type refers to the CubVector structure itself, which will wrap the corresponding x , y , etc. vector fields. More... | |
struct | CubVector< T, 1 > |
struct | CubVector< T, 2 > |
struct | CubVector< T, 3 > |
struct | CubVector< T, 4 > |
struct | DeviceHistogram |
DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory. More... | |
struct | DevicePartition |
DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory. More... | |
struct | DeviceRadixSort |
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory. More... | |
struct | DeviceRadixSortPolicy |
< Signed integer type for global offsets More... | |
struct | DeviceReduce |
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. More... | |
struct | DeviceReducePolicy |
< Binary reduction functor type having member T operator()(const T &a, const T &b) More... | |
struct | DeviceRleDispatch |
< Signed integer type for global offsets More... | |
struct | DeviceRunLengthEncode |
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. More... | |
struct | DeviceScan |
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. More... | |
struct | DeviceSegmentedRadixSort |
DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. More... | |
struct | DeviceSegmentedReduce |
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. More... | |
struct | DeviceSelect |
DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. More... | |
struct | DeviceSpmv |
DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More... | |
struct | DipatchHistogram |
< Signed integer type for global offsets More... | |
class | DiscardOutputIterator |
A discard iterator. More... | |
struct | DispatchRadixSort |
< Signed integer type for global offsets More... | |
struct | DispatchReduce |
< Binary reduction functor type having member T operator()(const T &a, const T &b) More... | |
struct | DispatchReduceByKey |
< Signed integer type for global offsets More... | |
struct | DispatchScan |
< Signed integer type for global offsets More... | |
struct | DispatchSegmentedRadixSort |
< Signed integer type for global offsets More... | |
struct | DispatchSegmentedReduce |
< Binary reduction functor type having member T operator()(const T &a, const T &b) More... | |
struct | DispatchSelectIf |
< Whether or not we push rejected items to the back of the output More... | |
struct | DispatchSpmv |
< Signed integer type for global offsets More... | |
struct | DoubleBuffer |
Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth. More... | |
struct | EnableIf |
Simple enable-if (similar to Boost) More... | |
struct | EnableIf< false, T > |
struct | Equality |
Default equality functor. More... | |
struct | Equals |
Type equality test. More... | |
struct | Equals< A, A > |
struct | FpLimits |
struct | FpLimits< double > |
struct | FpLimits< float > |
class | GridBarrier |
GridBarrier implements a software global barrier among thread blocks within a CUDA grid. More... | |
class | GridBarrierLifetime |
GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation. More... | |
struct | GridEvenShare |
GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More... | |
class | GridQueue |
GridQueue is a descriptor utility for dynamic queue management. More... | |
struct | If |
Type selection (IF ? ThenType : ElseType ) More... | |
struct | If< false, ThenType, ElseType > |
struct | Inequality |
Default inequality functor. More... | |
struct | InequalityWrapper |
Inequality functor (wraps equality functor) More... | |
struct | Int2Type |
Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values) More... | |
struct | IsPointer |
Pointer vs. iterator. More... | |
struct | IsPointer< Tp * > |
struct | IsVolatile |
Volatile modifier test. More... | |
struct | IsVolatile< Tp volatile > |
struct | IterateThreadLoad |
Helper structure for templated load iteration (inductive case) More... | |
struct | IterateThreadLoad< MAX, MAX > |
Helper structure for templated load iteration (termination case) More... | |
struct | IterateThreadStore |
Helper structure for templated store iteration (inductive case) More... | |
struct | IterateThreadStore< MAX, MAX > |
Helper structure for templated store iteration (termination case) More... | |
struct | KernelConfig |
struct | KeyValuePair |
A key identifier paired with a corresponding value. More... | |
struct | Log2 |
Statically determine log2(N), rounded up. More... | |
struct | Log2< N, 0, COUNT > |
struct | Max |
Default max functor. More... | |
struct | Min |
Default min functor. More... | |
struct | Mutex |
struct | NullType |
A simple "NULL" marker type. More... | |
struct | NumericTraits |
Numeric type traits. More... | |
struct | NumericTraits< bool > |
struct | NumericTraits< char > |
struct | NumericTraits< double > |
struct | NumericTraits< float > |
struct | NumericTraits< int > |
struct | NumericTraits< long > |
struct | NumericTraits< long long > |
struct | NumericTraits< NullType > |
struct | NumericTraits< short > |
struct | NumericTraits< signed char > |
struct | NumericTraits< unsigned char > |
struct | NumericTraits< unsigned int > |
struct | NumericTraits< unsigned long > |
struct | NumericTraits< unsigned long long > |
struct | NumericTraits< unsigned short > |
struct | PowerOfTwo |
Statically determine if N is a power-of-two. More... | |
struct | ReduceByKeyOp |
< Binary reduction operator to apply to values More... | |
struct | ReduceByKeyScanTileState |
struct | ReduceByKeyScanTileState< ValueT, KeyT, false > |
struct | ReduceByKeyScanTileState< ValueT, KeyT, true > |
struct | ReduceBySegmentOp |
Reduce-by-segment functor. More... | |
struct | RemoveQualifiers |
Removes const and volatile qualifiers from type Tp . More... | |
struct | RemoveQualifiers< Tp, const Up > |
struct | RemoveQualifiers< Tp, const volatile Up > |
struct | RemoveQualifiers< Tp, volatile Up > |
struct | ScanTileState |
struct | ScanTileState< T, false > |
struct | ScanTileState< T, true > |
struct | SpmvParams |
< Signed integer type for sequence offsets More... | |
struct | Sum |
Default sum functor. More... | |
class | SwizzleScanOp |
Binary operator wrapper for switching non-commutative scan arguments. More... | |
union | TempStorage |
class | TexObjInputIterator |
A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More... | |
struct | TilePrefixCallbackOp |
struct | Traits |
Type traits. More... | |
class | TransformInputIterator |
A random-access input wrapper for transforming dereferenced values. More... | |
struct | Uninitialized |
A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions. More... | |
struct | UnitWord |
Unit-words of data movement. More... | |
struct | UnitWord< char2 > |
struct | UnitWord< const T > |
struct | UnitWord< const volatile T > |
struct | UnitWord< float2 > |
struct | UnitWord< float4 > |
struct | UnitWord< volatile T > |
class | WarpExchange |
class | WarpReduce |
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp. More... | |
struct | WarpReduceShfl |
WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp. More... | |
struct | WarpReduceSmem |
WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp. More... | |
class | WarpScan |
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp. More... | |
struct | WarpScanShfl |
WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More... | |
struct | WarpScanSmem |
WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp. More... | |
Typedefs | |
typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy >::Type, KeyT, OffsetT > | AgentRadixSortUpsweepT |
typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy >::Type, IS_DESCENDING, KeyT, ValueT, OffsetT > | AgentRadixSortDownsweepT |
typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS,(ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM==RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM > | BlockRadixSortT |
typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > | BlockLoadKeys |
typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM > | BlockLoadValues |
typedef Traits< KeyT >::UnsignedBits | UnsignedBitsT |
typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy< BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS >, KeyT, OffsetT > | BlockUpsweepT |
typedef BlockScan< OffsetT, BLOCK_THREADS > | DigitScanT |
typedef AgentRadixSortDownsweep< SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT > | BlockDownsweepT |
typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT > | AgentReduceT |
Enumerations | |
enum | BlockHistogramMemoryPreference { GMEM , SMEM , BLEND } |
enum | RadixRankAlgorithm { RADIX_RANK_BASIC , RADIX_RANK_MEMOIZE , RADIX_RANK_MATCH } |
enum | ScanTileStatus { SCAN_TILE_OOB , SCAN_TILE_INVALID = 99 , SCAN_TILE_PARTIAL , SCAN_TILE_INCLUSIVE } |
enum | BlockHistogramAlgorithm { BLOCK_HISTO_SORT , BLOCK_HISTO_ATOMIC } |
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. More... | |
enum | BlockLoadAlgorithm { BLOCK_LOAD_DIRECT , BLOCK_LOAD_VECTORIZE , BLOCK_LOAD_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE , BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED } |
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More... | |
enum | BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY , BLOCK_REDUCE_RAKING , BLOCK_REDUCE_WARP_REDUCTIONS } |
enum | BlockScanAlgorithm { BLOCK_SCAN_RAKING , BLOCK_SCAN_RAKING_MEMOIZE , BLOCK_SCAN_WARP_SCANS } |
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. More... | |
enum | BlockStoreAlgorithm { BLOCK_STORE_DIRECT , BLOCK_STORE_VECTORIZE , BLOCK_STORE_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE , BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED } |
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More... | |
enum | { BLOCK_THREADS = SegmentedPolicyT::BLOCK_THREADS , ITEMS_PER_THREAD = SegmentedPolicyT::ITEMS_PER_THREAD , RADIX_BITS = SegmentedPolicyT::RADIX_BITS , TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD , RADIX_DIGITS = 1 << RADIX_BITS , KEYS_ONLY = Equals<ValueT, NullType>::VALUE } |
enum | { BINS_TRACKED_PER_THREAD = BlockDownsweepT::BINS_TRACKED_PER_THREAD } |
enum | GridMappingStrategy { GRID_MAPPING_RAKE , GRID_MAPPING_STRIP_MINE , GRID_MAPPING_DYNAMIC } |
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More... | |
enum | CacheLoadModifier { LOAD_DEFAULT , LOAD_CA , LOAD_CG , LOAD_CS , LOAD_CV , LOAD_LDG , LOAD_VOLATILE } |
Enumeration of cache modifiers for memory load operations. More... | |
enum | CacheStoreModifier { STORE_DEFAULT , STORE_WB , STORE_CG , STORE_CS , STORE_WT , STORE_VOLATILE } |
Enumeration of cache modifiers for memory store operations. More... | |
enum | { MAX_VEC_ELEMENTS = 4 } |
enum | Category { NOT_A_NUMBER , SIGNED_INTEGER , UNSIGNED_INTEGER , FLOATING_POINT } |
Basic type traits categories. More... | |
Functions | |
template<int NUM_ACTIVE_CHANNELS, typename CounterT , typename OffsetT > | |
__global__ void | DeviceHistogramInitKernel (ArrayWrapper< int, NUM_ACTIVE_CHANNELS > num_output_bins_wrapper, ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > d_output_histograms_wrapper, GridQueue< int > tile_queue) |
< Signed integer type for global offsets | |
template<typename AgentHistogramPolicyT , int PRIVATIZED_SMEM_BINS, int NUM_CHANNELS, int NUM_ACTIVE_CHANNELS, typename SampleIteratorT , typename CounterT , typename PrivatizedDecodeOpT , typename OutputDecodeOpT , typename OffsetT > | |
__launch_bounds__ (int(AgentHistogramPolicyT::BLOCK_THREADS)) __global__ void DeviceHistogramSweepKernel(SampleIteratorT d_samples | |
< Signed integer type for global offsets | |
AgentHistogramT | agent (temp_storage, d_samples, num_output_bins_wrapper.array, num_privatized_bins_wrapper.array, d_output_histograms_wrapper.array, d_privatized_histograms_wrapper.array, output_decode_op_wrapper.array, privatized_decode_op_wrapper.array) |
agent | InitBinCounters () |
agent | ConsumeTiles (num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue) |
agent | StoreOutput () |
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename OffsetT > | |
__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortUpsweepKernel(const KeyT *d_keys | |
< Signed integer type for global offsets | |
even_share template | BlockInit< TILE_ITEMS, GRID_MAPPING_RAKE > () |
upsweep | ProcessRegion (even_share.block_offset, even_share.block_end) |
CTA_SYNC () | |
upsweep template | ExtractCounts< IS_DESCENDING > (d_spine, gridDim.x, blockIdx.x) |
template<typename ChainedPolicyT , typename OffsetT > | |
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS), 1) __global__ void RadixSortScanBinsKernel(OffsetT *d_spine | |
< Signed integer type for global offsets | |
AgentScanT | block_scan (temp_storage, d_spine, d_spine, cub::Sum(), OffsetT(0)) |
while (block_offset+AgentScanT::TILE_ITEMS<=num_counts) | |
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT > | |
__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS)) __global__ void DeviceRadixSortDownsweepKernel(const KeyT *d_keys_in | |
< Signed integer type for global offsets | |
AgentRadixSortDownsweepT (temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion(even_share.block_offset | |
template<typename ChainedPolicyT , bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetT > | |
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceRadixSortSingleTileKernel(const KeyT *d_keys_in | |
< Signed integer type for global offsets | |
BlockLoadKeys (temp_storage.load_keys).Load(d_keys_in | |
if (!KEYS_ONLY) | |
BlockRadixSortT (temp_storage.sort).SortBlockedToStriped(keys | |
Int2Type< IS_DESCENDING > () | |
Int2Type< KEYS_ONLY > ()) | |
for (int ITEM=0;ITEM< ITEMS_PER_THREAD;++ITEM) | |
template<typename ChainedPolicyT , bool ALT_DIGIT_BITS, bool IS_DESCENDING, typename KeyT , typename ValueT , typename OffsetIteratorT , typename OffsetT > | |
__launch_bounds__ (int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS :ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedRadixSortKernel(const KeyT *d_keys_in | |
< Signed integer type for global offsets | |
if (num_items<=0) return | |
BlockUpsweepT | upsweep (temp_storage.upsweep, d_keys_in, current_bit, pass_bits) |
upsweep | ProcessRegion (segment_begin, segment_end) |
upsweep | ExtractCounts (bin_count) |
if (IS_DESCENDING) | |
DigitScanT (temp_storage.scan).ExclusiveSum(bin_count | |
BlockDownsweepT | downsweep (temp_storage.downsweep, bin_offset, num_items, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, pass_bits) |
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT > | |
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceReduceKernel(InputIteratorT d_in | |
< Binary reduction functor type having member T operator()(const T &a, const T &b) | |
if (threadIdx.x==0) d_out[blockIdx.x] = reduction_op(init, block_aggregate) | |
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetT , typename ReductionOpT , typename OuputT > | |
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) __global__ void DeviceReduceSingleTileKernel(InputIteratorT d_in | |
< Data element type that is convertible to the value type of OutputIteratorT | |
if (num_items==0) | |
template<typename T , typename OffsetT , typename IteratorT > | |
__device__ __forceinline__ void | NormalizeReductionOutput (T &, OffsetT, IteratorT) |
Normalize input iterator to segment offset. | |
template<typename KeyValuePairT , typename OffsetT , typename WrappedIteratorT , typename OutputValueT > | |
__device__ __forceinline__ void | NormalizeReductionOutput (KeyValuePairT &val, OffsetT base_offset, ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT >) |
Normalize input iterator to segment offset (specialized for arg-index) | |
template<typename ChainedPolicyT , typename InputIteratorT , typename OutputIteratorT , typename OffsetIteratorT , typename OffsetT , typename ReductionOpT , typename OutputT > | |
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS)) __global__ void DeviceSegmentedReduceKernel(InputIteratorT d_in | |
< Data element type that is convertible to the value type of OutputIteratorT | |
if (segment_begin==segment_end) | |
NormalizeReductionOutput (block_aggregate, segment_begin, d_in) | |
template<typename AgentReduceByKeyPolicyT , typename KeysInputIteratorT , typename UniqueOutputIteratorT , typename ValuesInputIteratorT , typename AggregatesOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename ReductionOpT , typename OffsetT > | |
__launch_bounds__ (int(AgentReduceByKeyPolicyT::BLOCK_THREADS)) __global__ void DeviceReduceByKeyKernel(KeysInputIteratorT d_keys_in | |
< Signed integer type for global offsets | |
AgentReduceByKeyT (temp_storage, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, equality_op, reduction_op).ConsumeRange(num_items | |
template<typename AgentRlePolicyT , typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT , typename ScanTileStateT , typename EqualityOpT , typename OffsetT > | |
__launch_bounds__ (int(AgentRlePolicyT::BLOCK_THREADS)) __global__ void DeviceRleSweepKernel(InputIteratorT d_in | |
< Signed integer type for global offsets | |
AgentRleT (temp_storage, d_in, d_offsets_out, d_lengths_out, equality_op, num_items).ConsumeRange(num_tiles | |
template<typename ScanTileStateT > | |
__global__ void | DeviceScanInitKernel (ScanTileStateT tile_state, int num_tiles) |
< Tile status interface type | |
template<typename ScanTileStateT , typename NumSelectedIteratorT > | |
__global__ void | DeviceCompactInitKernel (ScanTileStateT tile_state, int num_tiles, NumSelectedIteratorT d_num_selected_out) |
< Output iterator type for recording the number of items selected | |
template<typename ScanPolicyT , typename InputIteratorT , typename OutputIteratorT , typename ScanTileStateT , typename ScanOpT , typename InitValueT , typename OffsetT > | |
__launch_bounds__ (int(ScanPolicyT::BLOCK_THREADS)) __global__ void DeviceScanKernel(InputIteratorT d_in | |
< Signed integer type for global offsets | |
AgentScanT (temp_storage, d_in, d_out, scan_op, init_value).ConsumeRange(num_items | |
template<typename AgentSelectIfPolicyT , typename InputIteratorT , typename FlagsInputIteratorT , typename SelectedOutputIteratorT , typename NumSelectedIteratorT , typename ScanTileStateT , typename SelectOpT , typename EqualityOpT , typename OffsetT , bool KEEP_REJECTS> | |
__launch_bounds__ (int(AgentSelectIfPolicyT::BLOCK_THREADS)) __global__ void DeviceSelectSweepKernel(InputIteratorT d_in | |
< Whether or not we push rejected items to the back of the output | |
AgentSelectIfT (temp_storage, d_in, d_flags, d_selected_out, select_op, equality_op, num_items).ConsumeRange(num_tiles | |
template<typename AgentSpmvPolicyT , typename ValueT , typename OffsetT > | |
__global__ void | DeviceSpmv1ColKernel (SpmvParams< ValueT, OffsetT > spmv_params) |
< Signed integer type for sequence offsets | |
template<typename SpmvPolicyT , typename OffsetT , typename CoordinateT , typename SpmvParamsT > | |
__global__ void | DeviceSpmvSearchKernel (int num_merge_tiles, CoordinateT *d_tile_coordinates, SpmvParamsT spmv_params) |
< SpmvParams type | |
template<typename SpmvPolicyT , typename ScanTileStateT , typename ValueT , typename OffsetT , typename CoordinateT , bool HAS_ALPHA, bool HAS_BETA> | |
__launch_bounds__ (int(SpmvPolicyT::BLOCK_THREADS)) __global__ void DeviceSpmvKernel(SpmvParams< ValueT | |
< Whether the input parameter Beta is 0 | |
AgentSpmvT (temp_storage, spmv_params).ConsumeTile(d_tile_coordinates | |
tile_state | InitializeStatus (num_segment_fixup_tiles) |
template<typename AgentSegmentFixupPolicyT , typename PairsInputIteratorT , typename AggregatesOutputIteratorT , typename OffsetT , typename ScanTileStateT > | |
__launch_bounds__ (int(AgentSegmentFixupPolicyT::BLOCK_THREADS)) __global__ void DeviceSegmentFixupKernel(PairsInputIteratorT d_pairs_in | |
< Tile status interface type | |
AgentSegmentFixupT (temp_storage, d_pairs_in, d_aggregates_out, cub::Equality(), cub::Sum()).ConsumeRange(num_items | |
template<typename OffsetT > | |
__global__ void | FillAndResetDrainKernel (GridQueue< OffsetT > grid_queue, OffsetT num_items) |
template<typename AIteratorT , typename BIteratorT , typename OffsetT , typename CoordinateT > | |
__host__ __device__ __forceinline__ void | MergePathSearch (OffsetT diagonal, AIteratorT a, BIteratorT b, OffsetT a_len, OffsetT b_len, CoordinateT &path_coordinate) |
template<typename InputIteratorT , typename OffsetT , typename T > | |
__device__ __forceinline__ OffsetT | LowerBound (InputIteratorT input, OffsetT num_items, T val) |
Returns the offset of the first value within input which does not compare less than val . | |
template<typename InputIteratorT , typename OffsetT , typename T > | |
__device__ __forceinline__ OffsetT | UpperBound (InputIteratorT input, OffsetT num_items, T val) |
Returns the offset of the first value within input which compares greater than val . | |
__host__ __device__ __forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) |
CUB error reporting macro (prints error messages to stderr) | |
template<int ALLOCATIONS> | |
__host__ __device__ __forceinline__ cudaError_t | AliasTemporaries (void *d_temp_storage, size_t &temp_storage_bytes, void *(&allocations)[ALLOCATIONS], size_t(&allocation_sizes)[ALLOCATIONS]) |
template<typename T > | |
__global__ void | EmptyKernel (void) |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | PtxVersion (int &ptx_version) |
Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10) | |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | SmVersion (int &sm_version, int device_ordinal) |
Retrieves the SM version (major * 100 + minor * 10) | |
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | SyncStream (cudaStream_t stream) |
template<typename KernelPtr > | |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0) |
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. | |
__device__ __forceinline__ unsigned int | SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-right then add. Returns (x >> shift ) + addend . | |
__device__ __forceinline__ unsigned int | SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-left then add. Returns (x << shift ) + addend . | |
template<typename UnsignedBits , int BYTE_LEN> | |
__device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< BYTE_LEN >) |
template<typename UnsignedBits > | |
__device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits, Int2Type< 8 >) |
template<typename UnsignedBits > | |
__device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits) |
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start . The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. | |
__device__ __forceinline__ void | BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits) |
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start . | |
__device__ __forceinline__ unsigned int | IADD3 (unsigned int x, unsigned int y, unsigned int z) |
Three-operand add. Returns x + y + z . | |
__device__ __forceinline__ int | PRMT (unsigned int a, unsigned int b, unsigned int index) |
Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. | |
__device__ __forceinline__ void | BAR (int count) |
__device__ __forceinline__ int | CTA_SYNC_AND (int p) |
__device__ __forceinline__ void | WARP_SYNC (unsigned int member_mask) |
__device__ __forceinline__ int | WARP_ANY (int predicate, unsigned int member_mask) |
__device__ __forceinline__ int | WARP_ALL (int predicate, unsigned int member_mask) |
__device__ __forceinline__ int | WARP_BALLOT (int predicate, unsigned int member_mask) |
__device__ __forceinline__ unsigned int | SHFL_UP_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask) |
__device__ __forceinline__ unsigned int | SHFL_DOWN_SYNC (unsigned int word, int src_offset, int flags, unsigned int member_mask) |
__device__ __forceinline__ unsigned int | SHFL_IDX_SYNC (unsigned int word, int src_lane, int flags, unsigned int member_mask) |
__device__ __forceinline__ float | FMUL_RZ (float a, float b) |
__device__ __forceinline__ float | FFMA_RZ (float a, float b, float c) |
__device__ __forceinline__ void | ThreadExit () |
Terminates the calling thread. | |
__device__ __forceinline__ void | ThreadTrap () |
Abort execution and generate an interrupt to the host CPU. | |
__device__ __forceinline__ int | RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z) |
Returns the row-major linear thread identifier for a multidimensional thread block. | |
__device__ __forceinline__ unsigned int | LaneId () |
Returns the warp lane ID of the calling thread. | |
__device__ __forceinline__ unsigned int | WarpId () |
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. | |
__device__ __forceinline__ unsigned int | LaneMaskLt () |
Returns the warp lane mask of all lanes less than the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskLe () |
Returns the warp lane mask of all lanes less than or equal to the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskGt () |
Returns the warp lane mask of all lanes greater than the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskGe () |
Returns the warp lane mask of all lanes greater than or equal to the calling thread. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset . For thread lanes i < src_offset, the thread's own input is returned to the thread. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset . For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleIndex (T input, int src_lane, unsigned int member_mask) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane . For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread. | |
template<int LABEL_BITS> | |
__device__ unsigned int | MatchAny (unsigned int label) |
Blocked arrangement I/O (direct) | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. | |
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. | |
template<CacheLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InternalLoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. | |
Striped arrangement I/O (direct) | |
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a striped arrangement across the thread block. | |
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range. | |
template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a striped arrangement of data across the thread block into a linear segment of items. | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. | |
Warp-striped arrangement I/O (direct) | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a warp-striped arrangement across the thread block. | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. | |
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items. | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. | |
Thread I/O (cache modified) | |
template<CacheLoadModifier MODIFIER, typename InputIteratorT > | |
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type | ThreadLoad (InputIteratorT itr) |
Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. | |
template<typename InputIteratorT > | |
__device__ __forceinline__ std::iterator_traits< InputIteratorT >::value_type | ThreadLoad (InputIteratorT itr, Int2Type< LOAD_DEFAULT >, Int2Type< false >) |
template<typename T > | |
__device__ __forceinline__ T | ThreadLoad (T *ptr, Int2Type< LOAD_DEFAULT >, Int2Type< true >) |
template<typename T > | |
__device__ __forceinline__ T | ThreadLoadVolatilePointer (T *ptr, Int2Type< true >) |
template<typename T > | |
__device__ __forceinline__ T | ThreadLoadVolatilePointer (T *ptr, Int2Type< false >) |
template<typename T > | |
__device__ __forceinline__ T | ThreadLoad (T *ptr, Int2Type< LOAD_VOLATILE >, Int2Type< true >) |
template<typename T , int MODIFIER> | |
__device__ __forceinline__ T | ThreadLoad (T const *ptr, Int2Type< MODIFIER >, Int2Type< true >) |
template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T > | |
__device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val) |
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. | |
template<typename OutputIteratorT , typename T > | |
__device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val, Int2Type< STORE_DEFAULT >, Int2Type< false >) |
template<typename T > | |
__device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< STORE_DEFAULT >, Int2Type< true >) |
template<typename T > | |
__device__ __forceinline__ void | ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< true >) |
template<typename T > | |
__device__ __forceinline__ void | ThreadStoreVolatilePtr (T *ptr, T val, Int2Type< false >) |
template<typename T > | |
__device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< STORE_VOLATILE >, Int2Type< true >) |
template<typename T , int MODIFIER> | |
__device__ __forceinline__ void | ThreadStore (T *ptr, T val, Int2Type< MODIFIER >, Int2Type< true >) |
Variables | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_output_bins_wrapper |
< Input data to reduce | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_privatized_bins_wrapper |
The number bins per privatized histogram. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_output_histograms_wrapper |
Reference to final output histograms. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_privatized_histograms_wrapper |
Reference to privatized histograms. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > | output_decode_op_wrapper |
The transform operator for determining output bin-ids from privatized counter indices, one for each channel. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > | privatized_decode_op_wrapper |
The transform operator for determining privatized counter indices from samples, one for each channel. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT | num_row_pixels |
The number of multi-channel pixels per row in the region of interest. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT | num_rows |
The number of rows in the region of interest. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT | row_stride_samples |
The number of samples between starts of consecutive rows in the region of interest. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int | tiles_per_row |
Number of image tiles per row. | |
ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< int, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > ArrayWrapper< OutputDecodeOpT, NUM_ACTIVE_CHANNELS > ArrayWrapper< PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS > OffsetT OffsetT OffsetT int GridQueue< int > | tile_queue |
< Drain queue descriptor for dynamically mapping tile data onto thread blocks | |
__shared__ AgentHistogramT::TempStorage | temp_storage |
OffsetT * | d_spine |
< [in] Input keys buffer | |
OffsetT | OffsetT |
[in] Total number of input data items | |
OffsetT int | current_bit |
[in] Bit position of current radix digit | |
OffsetT int int | num_bits |
[in] Number of bits of current radix digit | |
OffsetT int int GridEvenShare< OffsetT > | even_share |
< [in] Even-share descriptor for mapan equal number of tiles onto each thread block | |
AgentRadixSortUpsweepT | upsweep (temp_storage, d_keys, current_bit, num_bits) |
int | num_counts |
< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.) | |
int | block_offset = 0 |
BlockScanRunningPrefixOp< OffsetT, Sum > | prefix_op (0, Sum()) |
KeyT * | d_keys_out |
< [in] Input keys buffer | |
KeyT const ValueT * | d_values_in |
[in] Input values buffer | |
KeyT const ValueT ValueT * | d_values_out |
[in] Output values buffer | |
KeyT const ValueT ValueT OffsetT OffsetT | num_items = segment_end - segment_begin |
[in] Total number of input data items | |
even_share | block_end |
KeyT const ValueT ValueT OffsetT int int | end_bit |
< [in] The past-the-end (most-significant) bit index needed for key comparison | |
KeyT | keys [ITEMS_PER_THREAD] |
ValueT | values [ITEMS_PER_THREAD] |
UnsignedBitsT | default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY |
KeyT | default_key = reinterpret_cast<KeyT&>(default_key_bits) |
KeyT const ValueT ValueT OffsetIteratorT | d_begin_offsets |
[in] Pointer to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* | |
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT | d_end_offsets |
[in] Pointer to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. | |
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT | int |
[in] The number of segments that comprise the sorting data | |
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int int | pass_bits |
< [in] Number of bits of current radix digit | |
OffsetT | segment_begin = d_begin_offsets[blockIdx.x] |
OffsetT | segment_end = d_end_offsets[blockIdx.x] |
OffsetT | bin_count [BINS_TRACKED_PER_THREAD] |
OffsetT | bin_offset [BINS_TRACKED_PER_THREAD] |
OutputIteratorT | d_out |
< [in] Pointer to the input sequence of data items | |
OutputIteratorT OffsetT GridEvenShare< OffsetT > ReductionOpT | reduction_op |
< [in] Binary reduction functor | |
OutputT | block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share) |
OutputIteratorT OffsetT ReductionOpT OuputT | init |
< [in] The initial value of the reduction | |
UniqueOutputIteratorT | d_unique_out |
< Pointer to the input sequence of keys | |
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT | d_aggregates_out |
Pointer to the output sequence of value aggregates (one aggregate per run) | |
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT | d_num_runs_out |
Pointer to total number of runs encountered (i.e., the length of d_unique_out) | |
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT | tile_state |
Tile status interface. | |
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int | start_tile |
The starting tile for the current grid. | |
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT | equality_op |
KeyT equality operator. | |
OffsetsOutputIteratorT | d_offsets_out |
< [in] Pointer to input sequence of data items | |
OffsetsOutputIteratorT LengthsOutputIteratorT | d_lengths_out |
[out] Pointer to output sequence of run-lengths | |
OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT | tile_status |
[in] Tile status interface | |
OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT ScanTileStateT EqualityOpT OffsetT int | num_tiles |
< [in] Total number of tiles for the entire problem | |
OutputIteratorT ScanTileStateT int ScanOpT | scan_op |
Binary scan functor. | |
OutputIteratorT ScanTileStateT int ScanOpT InitValueT | init_value |
Initial value to seed the exclusive scan. | |
FlagsInputIteratorT | d_flags |
< [in] Pointer to the input sequence of data items | |
FlagsInputIteratorT SelectedOutputIteratorT | d_selected_out |
[out] Pointer to the output sequence of selected data items | |
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT | d_num_selected_out |
[out] Pointer to the total number of items selected (i.e., length of d_selected_out ) | |
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT | select_op |
[in] Selection operator | |
OffsetT | spmv_params |
[in] SpMV input parameter bundle | |
OffsetT CoordinateT * | d_tile_coordinates |
[in] Pointer to the temporary array of tile starting coordinates | |
OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > * | d_tile_carry_pairs |
[out] Pointer to the temporary array carry-out dot product row-ids, one per block | |
OffsetT CoordinateT KeyValuePair< OffsetT, ValueT > int ScanTileStateT int | num_segment_fixup_tiles |
< [in] Number of reduce-by-key tiles (fixup grid size) | |
Optional outer namespace(s)
CUB namespace
typedef AgentRadixSortDownsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy, typename ChainedPolicyT::ActivePolicy::DownsweepPolicy>::Type, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::AgentRadixSortDownsweepT |
Definition at line 187 of file dispatch_radix_sort.cuh.
typedef AgentRadixSortUpsweep< typename If<(ALT_DIGIT_BITS), typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy, typename ChainedPolicyT::ActivePolicy::UpsweepPolicy>::Type, KeyT, OffsetT> cub::AgentRadixSortUpsweepT |
Definition at line 92 of file dispatch_radix_sort.cuh.
typedef AgentReduce< typename ChainedPolicyT::ActivePolicy::ReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT> cub::AgentReduceT |
Definition at line 88 of file dispatch_reduce.cuh.
typedef AgentRadixSortDownsweep<SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT> cub::BlockDownsweepT |
Definition at line 369 of file dispatch_radix_sort.cuh.
typedef BlockLoad< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadKeys |
Definition at line 245 of file dispatch_radix_sort.cuh.
typedef BlockLoad< ValueT, BLOCK_THREADS, ITEMS_PER_THREAD, ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> cub::BlockLoadValues |
Definition at line 252 of file dispatch_radix_sort.cuh.
typedef BlockRadixSort< KeyT, BLOCK_THREADS, ITEMS_PER_THREAD, ValueT, ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS, (ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM == RADIX_RANK_MEMOIZE), ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM> cub::BlockRadixSortT |
Definition at line 238 of file dispatch_radix_sort.cuh.
typedef AgentRadixSortUpsweep< AgentRadixSortUpsweepPolicy<BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS>, KeyT, OffsetT> cub::BlockUpsweepT |
Definition at line 363 of file dispatch_radix_sort.cuh.
typedef BlockScan<OffsetT, BLOCK_THREADS> cub::DigitScanT |
Definition at line 366 of file dispatch_radix_sort.cuh.
typedef Traits<KeyT>::UnsignedBits cub::UnsignedBitsT |
Definition at line 255 of file dispatch_radix_sort.cuh.
anonymous enum |
Definition at line 348 of file dispatch_radix_sort.cuh.
anonymous enum |
Enumerator | |
---|---|
BINS_TRACKED_PER_THREAD | Number of bin-starting offsets tracked per thread. |
Definition at line 371 of file dispatch_radix_sort.cuh.
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
Definition at line 56 of file block_histogram.cuh.
enum cub::BlockHistogramMemoryPreference |
Definition at line 58 of file agent_histogram.cuh.
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.
Definition at line 60 of file block_reduce.cuh.
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_SCAN_RAKING |
|
BLOCK_SCAN_RAKING_MEMOIZE |
|
BLOCK_SCAN_WARP_SCANS |
|
Definition at line 57 of file block_scan.cuh.
Radix ranking algorithm
Definition at line 62 of file agent_radix_sort_downsweep.cuh.
enum cub::ScanTileStatus |
Enumerations of tile status
Definition at line 105 of file single_pass_scan_operators.cuh.
cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS) | ) | const |
< Signed integer type for global offsets
Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place.
cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS) | ) | const |
< Signed integer type for global offsets
Segmented radix sorting pass (one block per segment)
cub::__launch_bounds__ | ( | int((ALT_DIGIT_BITS) ? ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS : ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS) | ) | const |
< Signed integer type for global offsets
Upsweep digit-counting kernel entry point (multi-block). Computes privatized digit histograms, one per block.
cub::__launch_bounds__ | ( | int(AgentHistogramPolicyT::BLOCK_THREADS) | ) |
< Signed integer type for global offsets
Histogram privatized sweep kernel entry point (multi-block). Computes privatized histograms, one per thread block.
cub::__launch_bounds__ | ( | int(AgentReduceByKeyPolicyT::BLOCK_THREADS) | ) |
< Signed integer type for global offsets
Multi-block reduce-by-key sweep kernel entry point
cub::__launch_bounds__ | ( | int(AgentSegmentFixupPolicyT::BLOCK_THREADS) | ) |
< Tile status interface type
Multi-block reduce-by-key sweep kernel entry point
cub::__launch_bounds__ | ( | int(AgentSelectIfPolicyT::BLOCK_THREADS) | ) |
< Whether or not we push rejected items to the back of the output
Select kernel entry point (multi-block)
Performs functor-based selection if SelectOpT functor type != NullType Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType Otherwise performs discontinuity selection (keep unique)
cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) | ) |
< Binary reduction functor type having member T operator()(const T &a, const T &b)
Reduce region kernel entry point (multi-block). Computes privatized reductions, one per thread block.
cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS) | ) |
< Data element type that is convertible to the value
type of OutputIteratorT
Segmented reduction (one block per segment)
cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS) | , |
1 | |||
) |
< Signed integer type for global offsets
Spine scan kernel entry point (single-block). Computes an exclusive prefix sum over the privatized digit histograms
cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS) | , |
1 | |||
) | const |
< Signed integer type for global offsets
Single pass kernel entry point (single-block). Fully sorts a tile of input.
cub::__launch_bounds__ | ( | int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS) | , |
1 | |||
) |
< Data element type that is convertible to the value
type of OutputIteratorT
Reduce a single tile kernel entry point (single-block). Can be used to aggregate privatized thread block reductions from a previous multi-block reduction pass.
cub::__launch_bounds__ | ( | int(ScanPolicyT::BLOCK_THREADS) | ) |
< Signed integer type for global offsets
Scan kernel entry point (multi-block)
cub::__launch_bounds__ | ( | int(SpmvPolicyT::BLOCK_THREADS) | ) |
< Whether the input parameter Beta is 0
Spmv agent entry point
__global__ void cub::DeviceCompactInitKernel | ( | ScanTileStateT | tile_state, |
int | num_tiles, | ||
NumSelectedIteratorT | d_num_selected_out | ||
) |
< Output iterator type for recording the number of items selected
Initialization kernel for tile status initialization (multi-block)
[in] | tile_state | Tile status interface |
[in] | num_tiles | Number of tiles |
[out] | d_num_selected_out | Pointer to the total number of items selected (i.e., length of d_selected_out ) |
Definition at line 78 of file dispatch_scan.cuh.
__global__ void cub::DeviceHistogramInitKernel | ( | ArrayWrapper< int, NUM_ACTIVE_CHANNELS > | num_output_bins_wrapper, |
ArrayWrapper< CounterT *, NUM_ACTIVE_CHANNELS > | d_output_histograms_wrapper, | ||
GridQueue< int > | tile_queue | ||
) |
< Signed integer type for global offsets
Histogram initialization kernel entry point
num_output_bins_wrapper | Number of output histogram bins per channel |
d_output_histograms_wrapper | Histogram counter data having logical dimensions CounterT[NUM_ACTIVE_CHANNELS][num_bins.array[CHANNEL]] |
tile_queue | Drain queue descriptor for dynamically mapping tile data onto thread blocks |
Definition at line 67 of file dispatch_histogram.cuh.
__global__ void cub::DeviceScanInitKernel | ( | ScanTileStateT | tile_state, |
int | num_tiles | ||
) |
< Tile status interface type
Initialization kernel for tile status initialization (multi-block)
[in] | tile_state | Tile status interface |
[in] | num_tiles | Number of tiles |
Definition at line 64 of file dispatch_scan.cuh.
__global__ void cub::DeviceSpmv1ColKernel | ( | SpmvParams< ValueT, OffsetT > | spmv_params | ) |
< Signed integer type for sequence offsets
Spmv search kernel. Identifies merge path starting coordinates for each tile.
[in] | spmv_params | SpMV input parameter bundle |
Definition at line 68 of file dispatch_spmv_orig.cuh.
__global__ void cub::DeviceSpmvSearchKernel | ( | int | num_merge_tiles, |
CoordinateT * | d_tile_coordinates, | ||
SpmvParamsT | spmv_params | ||
) |
< SpmvParams type
Spmv search kernel. Identifies merge path starting coordinates for each tile.
Constants
[in] | num_merge_tiles | Number of SpMV merge tiles (spmv grid size) |
[out] | d_tile_coordinates | Pointer to the temporary array of tile starting coordinates |
[in] | spmv_params | SpMV input parameter bundle |
Definition at line 104 of file dispatch_spmv_orig.cuh.
cub::for | ( | ) |
Definition at line 302 of file dispatch_radix_sort.cuh.
cub::if | ( | ! | KEYS_ONLY | ) |
Definition at line 280 of file dispatch_radix_sort.cuh.
cub::if | ( | IS_DESCENDING | ) |
Definition at line 415 of file dispatch_radix_sort.cuh.
cub::if | ( | num_items | = = 0 | ) |
Definition at line 133 of file dispatch_reduce.cuh.
cub::if | ( | segment_begin | = = segment_end | ) |
Definition at line 210 of file dispatch_reduce.cuh.
__device__ __forceinline__ OffsetT cub::LowerBound | ( | InputIteratorT | input, |
OffsetT | num_items, | ||
T | val | ||
) |
Returns the offset of the first value within input
which does not compare less than val
.
[in] | input | Input sequence |
[in] | num_items | Input sequence length |
[in] | val | Search key |
Definition at line 95 of file thread_search.cuh.
Compute a 32b mask of threads having the same least-significant LABEL_BITS of label
as the calling thread.
Definition at line 703 of file util_ptx.cuh.
__host__ __device__ __forceinline__ void cub::MergePathSearch | ( | OffsetT | diagonal, |
AIteratorT | a, | ||
BIteratorT | b, | ||
OffsetT | a_len, | ||
OffsetT | b_len, | ||
CoordinateT & | path_coordinate | ||
) |
Computes the begin offsets into A and B for the specific diagonal
The value type of the input iterator
Definition at line 53 of file thread_search.cuh.
__device__ __forceinline__ void cub::NormalizeReductionOutput | ( | KeyValuePairT & | val, |
OffsetT | base_offset, | ||
ArgIndexInputIterator< WrappedIteratorT, OffsetT, OutputValueT > | |||
) |
Normalize input iterator to segment offset (specialized for arg-index)
Definition at line 164 of file dispatch_reduce.cuh.
__device__ __forceinline__ void cub::NormalizeReductionOutput | ( | T & | , |
OffsetT | , | ||
IteratorT | |||
) |
Normalize input iterator to segment offset.
Definition at line 154 of file dispatch_reduce.cuh.
__device__ __forceinline__ OffsetT cub::UpperBound | ( | InputIteratorT | input, |
OffsetT | num_items, | ||
T | val | ||
) |
Returns the offset of the first value within input
which compares greater than val
.
[in] | input | Input sequence |
[in] | num_items | Input sequence length |
[in] | val | Search key |
Definition at line 126 of file thread_search.cuh.
cub::while | ( | block_offset+AgentScanT::TILE_ITEMS<= | num_counts | ) |
Definition at line 141 of file dispatch_radix_sort.cuh.
OffsetT cub::bin_count[BINS_TRACKED_PER_THREAD] |
Definition at line 410 of file dispatch_radix_sort.cuh.
cub::bin_offset |
Definition at line 440 of file dispatch_radix_sort.cuh.
OutputT cub::block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share) |
Definition at line 94 of file dispatch_reduce.cuh.
even_share cub::block_end |
Definition at line 198 of file dispatch_radix_sort.cuh.
int cub::block_offset = 0 |
Definition at line 139 of file dispatch_radix_sort.cuh.
KeyT const ValueT ValueT OffsetIteratorT OffsetIteratorT int cub::current_bit |
[in] Bit position of current radix digit
Definition at line 76 of file dispatch_radix_sort.cuh.
AggregatesOutputIteratorT cub::d_aggregates_out |
Pointer to the output sequence of value aggregates (one aggregate per run)
< [in] Pointer to the array carry-out dot product row-ids, one per spmv block
[in,out] Output value aggregates
Definition at line 76 of file dispatch_reduce_by_key.cuh.
OutputIteratorT OffsetIteratorT cub::d_begin_offsets |
[in] Pointer to the sequence of beginning offsets of length num_segments
, such that d_begin_offsets[i]
is the first element of the ith data segment in d_keys_*
and d_values_*
Definition at line 334 of file dispatch_radix_sort.cuh.
OutputIteratorT OffsetIteratorT OffsetIteratorT cub::d_end_offsets |
[in] Pointer to the sequence of ending offsets of length num_segments
, such that d_end_offsets[i]-1
is the last element of the ith data segment in d_keys_*
and d_values_*
. If d_end_offsets[i]-1
<= d_begin_offsets[i]
, the ith is considered empty.
Definition at line 335 of file dispatch_radix_sort.cuh.
FlagsInputIteratorT cub::d_flags |
< [in] Pointer to the input sequence of data items
[in] Pointer to the input sequence of selection flags (if applicable)
Definition at line 78 of file dispatch_select_if.cuh.
KeyT * cub::d_keys_out |
< [in] Input keys buffer
[in] Output keys buffer
Definition at line 164 of file dispatch_radix_sort.cuh.
OffsetsOutputIteratorT LengthsOutputIteratorT cub::d_lengths_out |
[out] Pointer to output sequence of run-lengths
Definition at line 78 of file dispatch_rle.cuh.
OffsetsOutputIteratorT LengthsOutputIteratorT NumRunsOutputIteratorT cub::d_num_runs_out |
Pointer to total number of runs encountered (i.e., the length of d_unique_out)
[out] Pointer to total number of runs (i.e., length of d_offsets_out
)
Definition at line 77 of file dispatch_reduce_by_key.cuh.
cub::d_num_selected_out |
[out] Pointer to the total number of items selected (i.e., length of d_selected_out
)
Definition at line 80 of file dispatch_select_if.cuh.
OffsetsOutputIteratorT cub::d_offsets_out |
< [in] Pointer to input sequence of data items
[out] Pointer to output sequence of run-offsets
Definition at line 77 of file dispatch_rle.cuh.
OutputIteratorT cub::d_out |
< [in] Pointer to the input sequence of data items
< Input data
[out] Pointer to the output aggregate
Output data
Definition at line 71 of file dispatch_reduce.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_output_histograms_wrapper |
Reference to final output histograms.
Definition at line 104 of file dispatch_histogram.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> cub::d_privatized_histograms_wrapper |
Reference to privatized histograms.
Definition at line 105 of file dispatch_histogram.cuh.
FlagsInputIteratorT SelectedOutputIteratorT cub::d_selected_out |
[out] Pointer to the output sequence of selected data items
Definition at line 79 of file dispatch_select_if.cuh.
KeyT const ValueT ValueT OffsetT * cub::d_spine |
< [in] Input keys buffer
[in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
[out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
Definition at line 74 of file dispatch_radix_sort.cuh.
cub::d_tile_carry_pairs |
[out] Pointer to the temporary array carry-out dot product row-ids, one per block
Definition at line 161 of file dispatch_spmv_orig.cuh.
OffsetT CoordinateT* cub::d_tile_coordinates |
[in] Pointer to the temporary array of tile starting coordinates
Definition at line 160 of file dispatch_spmv_orig.cuh.
UniqueOutputIteratorT cub::d_unique_out |
< Pointer to the input sequence of keys
Pointer to the output sequence of unique keys (one key per run)
Definition at line 74 of file dispatch_reduce_by_key.cuh.
UniqueOutputIteratorT ValuesInputIteratorT cub::d_values_in |
[in] Input values buffer
Pointer to the input sequence of corresponding values.
Definition at line 165 of file dispatch_radix_sort.cuh.
KeyT const ValueT ValueT * cub::d_values_out |
[in] Output values buffer
Definition at line 166 of file dispatch_radix_sort.cuh.
cub::default_key = reinterpret_cast<KeyT&>(default_key_bits) |
Definition at line 272 of file dispatch_radix_sort.cuh.
UnsignedBitsT cub::default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY |
Definition at line 271 of file dispatch_radix_sort.cuh.
BlockDownsweepT::TempStorage cub::downsweep |
Definition at line 385 of file dispatch_radix_sort.cuh.
cub::end_bit |
< [in] The past-the-end (most-significant) bit index needed for key comparison
Definition at line 219 of file dispatch_radix_sort.cuh.
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT EqualityOpT cub::equality_op |
KeyT equality operator.
[in] Equality operator
[in] Equality operator for input items
Definition at line 80 of file dispatch_reduce_by_key.cuh.
OutputIteratorT OffsetT GridEvenShare< OffsetT > cub::even_share |
< [in] Even-share descriptor for mapan equal number of tiles onto each thread block
[in] Even-share descriptor for mapping an equal number of tiles onto each thread block
Definition at line 78 of file dispatch_radix_sort.cuh.
OutputIteratorT OffsetIteratorT OffsetIteratorT ReductionOpT OutputT cub::init |
< [in] The initial value of the reduction
Definition at line 118 of file dispatch_reduce.cuh.
OutputIteratorT ScanTileStateT int ScanOpT InitValueT cub::init_value |
Initial value to seed the exclusive scan.
Definition at line 110 of file dispatch_scan.cuh.
OutputIteratorT OffsetIteratorT OffsetIteratorT cub::int |
[in] The number of segments that comprise the sorting data
Definition at line 336 of file dispatch_radix_sort.cuh.
cub::keys |
Definition at line 267 of file dispatch_radix_sort.cuh.
[in] Number of bits of current radix digit
Definition at line 77 of file dispatch_radix_sort.cuh.
int cub::num_counts |
< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
< [in] Total number of bin-counts
Definition at line 120 of file dispatch_radix_sort.cuh.
AggregatesOutputIteratorT OffsetT cub::num_items = segment_end - segment_begin |
[in] Total number of input data items
[in] Total number of items to select from
< Total number of scan items for the entire problem
[in] Total number of input items (i.e., length of d_in
)
< Total number of items to select from
Definition at line 168 of file dispatch_radix_sort.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_output_bins_wrapper |
< Input data to reduce
The number bins per final output histogram
Definition at line 102 of file dispatch_histogram.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> cub::num_privatized_bins_wrapper |
The number bins per privatized histogram.
Definition at line 103 of file dispatch_histogram.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT cub::num_row_pixels |
The number of multi-channel pixels per row in the region of interest.
Definition at line 108 of file dispatch_histogram.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT cub::num_rows |
The number of rows in the region of interest.
Definition at line 109 of file dispatch_histogram.cuh.
OffsetT CoordinateT KeyValuePair<OffsetT,ValueT> int ScanTileStateT int cub::num_segment_fixup_tiles |
< [in] Number of reduce-by-key tiles (fixup grid size)
Definition at line 164 of file dispatch_spmv_orig.cuh.
< [in] Total number of tiles for the entire problem
[in] Total number of tiles for the entire problem
[in] Number of merge tiles
Definition at line 83 of file dispatch_rle.cuh.
OffsetT cub::OffsetT |
[in] Total number of input data items
Definition at line 75 of file dispatch_radix_sort.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> cub::output_decode_op_wrapper |
The transform operator for determining output bin-ids from privatized counter indices, one for each channel.
Definition at line 106 of file dispatch_histogram.cuh.
< [in] Number of bits of current radix digit
Definition at line 338 of file dispatch_radix_sort.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> cub::privatized_decode_op_wrapper |
The transform operator for determining privatized counter indices from samples, one for each channel.
Definition at line 107 of file dispatch_histogram.cuh.
UniqueOutputIteratorT ValuesInputIteratorT AggregatesOutputIteratorT NumRunsOutputIteratorT ScanTileStateT int EqualityOpT ReductionOpT cub::reduction_op |
< [in] Binary reduction functor
ValueT reduction operator.
[in] Binary reduction functor
Definition at line 74 of file dispatch_reduce.cuh.
volatile OffsetT cub::reverse_counts_in[RADIX_DIGITS] |
Definition at line 388 of file dispatch_radix_sort.cuh.
volatile OffsetT cub::reverse_counts_out[RADIX_DIGITS] |
Definition at line 389 of file dispatch_radix_sort.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT cub::row_stride_samples |
The number of samples between starts of consecutive rows in the region of interest.
Definition at line 110 of file dispatch_histogram.cuh.
DigitScanT::TempStorage cub::scan |
Definition at line 390 of file dispatch_radix_sort.cuh.
OutputIteratorT ScanTileStateT int ScanOpT cub::scan_op |
Binary scan functor.
Definition at line 109 of file dispatch_scan.cuh.
OffsetT cub::segment_begin = d_begin_offsets[blockIdx.x] |
Definition at line 395 of file dispatch_radix_sort.cuh.
OffsetT cub::segment_end = d_end_offsets[blockIdx.x] |
Definition at line 396 of file dispatch_radix_sort.cuh.
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT SelectOpT cub::select_op |
[in] Selection operator
Definition at line 82 of file dispatch_select_if.cuh.
OffsetT cub::spmv_params |
[in] SpMV input parameter bundle
Definition at line 159 of file dispatch_spmv_orig.cuh.
OutputIteratorT ScanTileStateT int cub::start_tile |
The starting tile for the current grid.
Definition at line 79 of file dispatch_reduce_by_key.cuh.
__shared__ AgentSegmentFixupT::TempStorage cub::temp_storage |
Definition at line 128 of file dispatch_histogram.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int GridQueue<int> cub::tile_queue |
< Drain queue descriptor for dynamically mapping tile data onto thread blocks
Definition at line 112 of file dispatch_histogram.cuh.
Tile status interface.
< [in] Tile status interface
[in] Tile status interface for fixup reduce-by-key kernel
Definition at line 78 of file dispatch_reduce_by_key.cuh.
FlagsInputIteratorT SelectedOutputIteratorT NumSelectedIteratorT ScanTileStateT cub::tile_status |
[in] Tile status interface
Definition at line 80 of file dispatch_rle.cuh.
ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<int, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> OffsetT OffsetT OffsetT int cub::tiles_per_row |
Number of image tiles per row.
Definition at line 111 of file dispatch_histogram.cuh.
BlockUpsweepT::TempStorage cub::upsweep | ( | temp_storage | , |
d_keys | , | ||
current_bit | , | ||
num_bits | |||
) |
Definition at line 384 of file dispatch_radix_sort.cuh.
cub::values |
Definition at line 268 of file dispatch_radix_sort.cuh.