QTrk
|
CUDA implementation of the QueuedTracker interface. More...
#include <QueuedCUDATracker.h>
Classes | |
struct | Device |
Structure to maintain data for each GPU. More... | |
struct | KernelProfileTime |
Structure used to hold profiling data. More... | |
struct | Stream |
Structure to maintain data for each stream. Shell around CUDA native streams. More... | |
Public Types | |
typedef Image4DMemory< float > | ImageLUT |
Datatype to hold image lookup tables. More... | |
![]() | |
typedef std::map< std::string, std::string > | ConfigValueMap |
Datastructure used to return additional settings in a string-string key-value pair mapping. More... | |
Public Member Functions | |
QueuedCUDATracker (const QTrkComputedConfig &cc, int batchSize=-1) | |
Initialize a QueuedCUDATracker instance. More... | |
~QueuedCUDATracker () | |
Delete an instance of a CUDA Tracker. More... | |
void | EnableTextureCache (bool useTextureCache) |
Enable or disable the use of textures. More... | |
void | SetLocalizationMode (LocMode_t locType) override |
Select which algorithm is to be used. More... | |
void | ScheduleLocalization (void *data, int pitch, QTRK_PixelDataType pdt, const LocalizationJob *jobInfo) override |
Add a job to the queue to be processed. A job entails running the required algorithms on a single region of interest. More... | |
void | ClearResults () override |
Clear results. More... | |
void | SetRadialZLUT (float *data, int numLUTs, int planes) override |
Set the radial lookup tables to be used for z tracking. More... | |
void | SetRadialWeights (float *zcmp) override |
Set radial weights used for comparing LUT profiles. More... | |
void | GetRadialZLUT (float *data) override |
Get the radial lookup tables used for z tracking. More... | |
void | GetRadialZLUTSize (int &count, int &planes, int &radialSteps) override |
Get the dimensions of the radial lookup table data. More... | |
int | FetchResults (LocalizationResult *results, int maxResults) override |
Fetch available results. More... | |
void | BeginLUT (uint flags) override |
Setup to begin building a lookup table. More... | |
void | BuildLUT (void *data, int pitch, QTRK_PixelDataType pdt, int plane, vector2f *known_pos=0) override |
Add a new lookup table plane. More... | |
void | FinalizeLUT () override |
Finalize the lookup tables in memory. More... | |
void | EnableRadialZLUTCompareProfile (bool enabled) |
Set a flag to enable saving of error curves. More... | |
void | GetRadialZLUTCompareProfile (float *dst) |
Get saved error curve. More... | |
std::string | GetProfileReport () override |
Get the output of performance profiling. More... | |
void | Flush () override |
Stop waiting for more jobs to do, and just process the current batch. More... | |
int | GetQueueLength (int *maxQueueLen) override |
Get the lengths of the queue of jobs to be handled. More... | |
bool | IsIdle () override |
Test to see if the tracker is idle. More... | |
int | GetResultCount () override |
Get the number of finished localization jobs (=results) available in memory. More... | |
void | SetPixelCalibrationImages (float *offset, float *gain) override |
Set the pixel calibration images. More... | |
void | SetPixelCalibrationFactors (float offsetFactor, float gainFactor) override |
Set the pixel calibration factors. More... | |
ConfigValueMap | GetConfigValues () override |
Get the used additional configurations. More... | |
void | SetConfigValue (std::string name, std::string value) override |
Set an additional setting. More... | |
![]() | |
QueuedTracker () | |
virtual | ~QueuedTracker () |
void | ScheduleImageData (ImageData *data, const LocalizationJob *jobInfo) |
Quick function to schedule a single ROI from an ImageData object. More... | |
virtual int | ScheduleFrame (void *imgptr, int pitch, int width, int height, ROIPosition *positions, int numROI, QTRK_PixelDataType pdt, const LocalizationJob *jobInfo) |
Schedule an entire frame at once, allowing for further optimizations. More... | |
virtual void | GetImageZLUTSize (int *dims) |
Get the dimensions of the image lookup table data. More... | |
virtual void | GetImageZLUT (float *dst) |
Get the image lookup tables used. More... | |
virtual bool | SetImageZLUT (float *src, float *radial_zlut, int *dims) |
Set the image lookup tables to be used for z tracking. More... | |
virtual std::string | GetWarnings () |
Get a report of encountered errors. More... | |
virtual bool | GetDebugImage (int ID, int *w, int *h, float **pData) |
Get the debug image for a specific thread. More... | |
ImageData | DebugImage (int ID) |
Get the debug image as an ImageData object. More... | |
void | ScheduleLocalization (uchar *data, int pitch, QTRK_PixelDataType pdt, uint frame, uint timestamp, vector3f *initial, uint zlutIndex) |
Add an image to the queue to be processed. Creates a job. More... | |
void | ComputeZBiasCorrection (int bias_planes, CImageData *result, int smpPerPixel, bool useSplineInterp) |
float | ZLUTBiasCorrection (float z, int zlut_planes, int bead) |
void | SetZLUTBiasCorrection (const CImageData &data) |
CImageData * | GetZLUTBiasCorrection () |
Public Attributes | |
KernelProfileTime | time |
KernelProfileTime | cpu_time |
int | batchesDone |
Number of fully completed batches. More... | |
std::string | deviceReport |
String holding a human-readable description of used GPUs. Filled during InitializeDeviceList. More... | |
![]() | |
QTrkComputedConfig | cfg |
The settings used by this instance of QueuedTracker. More... | |
Protected Member Functions | |
dim3 | blocks () |
Calculate the number of thread blocks of size numThreads needed to have 1 thread per job. More... | |
dim3 | blocks (int workItems) |
Calculate the number of thread blocks of size numThreads needed to have 1 thread per workItems . More... | |
dim3 | threads () |
Get the CUDA native datatype with the threadblock dimensions to use. More... | |
void | SchedulingThreadMain () |
Loop executed by the scheduling thread which executes threads when needed. More... | |
template<typename TImageSampler > | |
void | ExecuteBatch (Stream *s) |
Execute the queued batch on a stream. More... | |
Stream * | GetReadyStream () |
Get a stream that is not currently executing, and still has room for images. More... | |
void | InitializeDeviceList () |
Build the list of devices to be used based on the QTrkSettings::cuda_device flag. More... | |
Stream * | CreateStream (Device *device, int streamIndex) |
Initialize a stream instance. More... | |
void | CopyStreamResults (Stream *s) |
Copy localization results from device to host memory. Also updates profiling times. More... | |
void | StreamUpdateZLUTSize (Stream *s) |
Update zlut vector dimensions. Use when settings change. More... | |
void | CPU_ApplyOffsetGain (CPUTracker *trk, int beadIndex) |
Use the CPU-based tracker to apply the pixel calibrations. More... | |
Static Protected Member Functions | |
static void | SchedulingThreadEntryPoint (void *param) |
Entry point for thread creation. More... | |
Protected Attributes | |
int | batchSize |
Amount of images to be sent at once per stream. More... | |
int | numThreads |
Number of threads to use in a general thread block. More... | |
std::vector< Stream * > | streams |
Vector of usable streams. More... | |
std::list< LocalizationResult > | results |
Vector of completed results. More... | |
int | resultCount |
Number of results available. More... | |
LocMode_t | localizeMode |
Flags for localization choices. See LocalizeModeEnum. More... | |
Threads::Mutex | resultMutex |
Mutex for result memory accesses. More... | |
Threads::Mutex | jobQueueMutex |
Mutex for job queue accesses. More... | |
std::vector< Device * > | devices |
Vector of device instances used. More... | |
bool | useTextureCache |
Flag to use texture cache. Default is true. Disable using EnableTextureCache. More... | |
float | gc_offsetFactor |
Factor by which to scale the pixel calibration offset. More... | |
float | gc_gainFactor |
Factor by which to scale the gain. More... | |
std::vector< float > | gc_offset |
Vector with offsets used for pixel correction. More... | |
std::vector< float > | gc_gain |
Vector with gains used for pixel correction. More... | |
Threads::Mutex | gc_mutex |
Mutex for pixel calibration operations. More... | |
uint | zlut_build_flags |
Flags for ZLUT building. Not actually used yet. More... | |
QI | qi |
The QI instance used to perform the 2D localization using the quadrant interpolation algorithm. More... | |
QI | qalign |
Instance of QI used specifically for quadrant alignment. More... | |
cudaDeviceProp | deviceProp |
Variable used to save device properties obtained from the CUDA API. More... | |
Threads::Handle * | schedulingThread |
Handle to the scheduling thread for later reference. More... | |
Atomic< bool > | quitScheduler |
Thread shutdown flag with built-in thread safety (atomic). More... | |
![]() | |
CImageData * | zlut_bias_correction |
CUDA implementation of the QueuedTracker interface.
Exploits the inherent parallel computing power of GPU hardware to increase the speed at which tracking can be performed. Optimizations will be listed on a per-function basis. Speeds of almost 40000 100x100 ROI/s with 5 QI iterations and Z localization have been achieved on a setup with a GTX1080 card.
The currently implemented tracking algorithms are COM, Quadrant Interpolation and 2D Gaussian with Max-Likelihood estimation.
It will automatically use all available CUDA devices if using the QTrkCUDA_UseAll
value for QTrkSettings::cuda_device.
Method:
Thread-Safety:
We assume 2 threads concurrently accessing the tracker functions.
Issues:
Definition at line 137 of file QueuedCUDATracker.h.
typedef Image4DMemory<float> QueuedCUDATracker::ImageLUT |
Datatype to hold image lookup tables.
Definition at line 198 of file QueuedCUDATracker.h.
QueuedCUDATracker::QueuedCUDATracker | ( | const QTrkComputedConfig & | cc, |
int | batchSize = -1 |
||
) |
Initialize a QueuedCUDATracker instance.
[in] | cc | The settings to be used. |
[in] | batchSize | The number of ROIs to be handled in one batch. Default (-1) calculates optimum. See batchSize for optimization info. |
runtime_error | Init error: GPU does not support CUDA capability 2.0 or higher. |
runtime_error | Init error: Failed to create GPU streams. |
We use QTrkComputedConfig::numThreads for the number of CUDA streams. Default (numThreads
< 1) is 4 streams per CUDA device.
Number of threads per thread block is a single warp so each block can be executed concurrently on a single SM. Assuming no code branching, which should hold for kernels doing nothing but the same calculations on different locations in memory.
Calculate maximum batchSize based on available texture memory. Factor 0.99 is empirical, 100% of maximum does not work for some reason.
Initialize the QI module to handle the Quadrant Interpolation algorithm.
Pre-calculate the radial sampling grid based on radial and angular step settings from the passed QTrkComputedConfig.
Initialize the available CUDA devices. Copy the sampling table to device memory.
Initialize streams.
Texture cache is enabled by default. Use EnableTextureCache to disable if required.
Start the scheduling thread running SchedulingThreadMain.
Run an empty kernel to startup driver and get rid of first run overhead.
Definition at line 103 of file QueuedCUDATracker.cu.
QueuedCUDATracker::~QueuedCUDATracker | ( | ) |
Delete an instance of a CUDA Tracker.
Definition at line 195 of file QueuedCUDATracker.cu.
|
overridevirtual |
Setup to begin building a lookup table.
Sets the flags by which the lookup table is built. Flags are defined in a uint bitmask format as:
Name | Value | Description |
---|---|---|
0 | Default, no special flags. | |
BUILDLUT_IMAGELUT | 1 (2^0) | Build an image LUT. An image LUT seems to be a work in progress to save ROIs rather than profiles in the LUT. |
BUILDLUT_FOURIER | 2 (2^1) | Build a fourier LUT. |
BUILDLUT_NORMALIZE | 4 (2^2) | Normalize radial profiles. Irrelevant, since FinalizeLUT always normalizes. |
BUILDLUT_BIASCORRECT | 8 (2^3) | Enable bias correction. Currently only partly implemented, don't use. |
[in] | flags | UINT interpreted as a series of bits to set settings. |
Implements QueuedTracker.
Definition at line 442 of file QueuedCUDATracker.cu.
|
inlineprotected |
Calculate the number of thread blocks of size numThreads needed to have 1 thread per job.
Use in conjunction with threads as:
Definition at line 376 of file QueuedCUDATracker.h.
|
inlineprotected |
Calculate the number of thread blocks of size numThreads needed to have 1 thread per workItems
.
Use in conjunction with threads as:
[in] | workItems | Number of total threads needed. |
Definition at line 388 of file QueuedCUDATracker.h.
|
overridevirtual |
Add a new lookup table plane.
Takes a stack of ROI images through data
. Determines and adds the profile for each ROI to its respective LUT.
[in] | data | Pointer to the start of an image stack. |
[in] | pitch | Width of the data in memory so offsets can be calculated. |
[in] | pdt | Pixel data type for the data. See QTRK_PixelDataType. |
[in] | plane | The plane number the ROIs are taken for. |
[in] | known_pos | Center position from which to start making the radial profile. A standard QI localization with applied settings is performed if 0 (default). |
Implements QueuedTracker.
Definition at line 447 of file QueuedCUDATracker.cu.
|
overridevirtual |
Clear results.
Implements QueuedTracker.
Definition at line 861 of file QueuedCUDATracker.cu.
|
protected |
Copy localization results from device to host memory. Also updates profiling times.
[in] | s | The stream from which to copy results. |
Definition at line 674 of file QueuedCUDATracker.cu.
|
protected |
Use the CPU-based tracker to apply the pixel calibrations.
This is only used for LUT building because that entirely still runs on the CPU (because speed is not a limit).
[in] | trk | Instance of CPUTracker to use |
[in] | beadIndex | Number of the ROI to scale |
Definition at line 739 of file QueuedCUDATracker.cu.
|
protected |
Initialize a stream instance.
[in] | device | Pointer to the device instance this stream runs on. |
[in] | streamIndex | The number of this stream. |
Definition at line 309 of file QueuedCUDATracker.cu.
|
inlinevirtual |
Set a flag to enable saving of error curves.
Errors obtained by comparing a radial profile to a ZLUT will be kept in memory rather than destroyed. Only saves for one localization. Error curve can be retreived by GetRadialZLUTCompareProfile.
[in] | enabled | Flag (boolean) to save error curves. Default is false. |
Implements QueuedTracker.
Definition at line 179 of file QueuedCUDATracker.h.
|
inline |
Enable or disable the use of textures.
Texture cache provides a significant speedup due to 2D L1 caching as compared to normal linear memory caching (List of CUDA references).
[in] | useTextureCache | Boolean to enable texture cache usage. |
Definition at line 158 of file QueuedCUDATracker.h.
|
protected |
Execute the queued batch on a stream.
This entails the following steps:
[in] | s | Pointer to the stream to execute. |
Definition at line 560 of file QueuedCUDATracker.cu.
|
overridevirtual |
Fetch available results.
[in] | results | Array of pre-allocated LocalizationResult to which to add the results. |
[in] | maxResults | Maximum number of results to fetch. Corresponds to maximum size of dstResult . |
Implements QueuedTracker.
Definition at line 705 of file QueuedCUDATracker.cu.
|
overridevirtual |
Finalize the lookup tables in memory.
Normalizes the profiles for radial lookup tables and calculates derivates and adds boundary conditions for image LUTs.
Implements QueuedTracker.
Definition at line 494 of file QueuedCUDATracker.cu.
|
overridevirtual |
Stop waiting for more jobs to do, and just process the current batch.
Implements QueuedTracker.
Definition at line 519 of file QueuedCUDATracker.cu.
|
overridevirtual |
Get the used additional configurations.
Implements QueuedTracker.
Definition at line 881 of file QueuedCUDATracker.cu.
|
overridevirtual |
Get the output of performance profiling.
Reimplemented from QueuedTracker.
Definition at line 869 of file QueuedCUDATracker.cu.
|
overridevirtual |
Get the lengths of the queue of jobs to be handled.
[out] | maxQueueLen | Pre-allocated integer that returns the maximum size of the queue if nonzero. |
Implements QueuedTracker.
Definition at line 379 of file QueuedCUDATracker.cu.
|
overridevirtual |
Get the radial lookup tables used for z tracking.
[out] | dst | Pointer to the pre-allocated memory in which to save the data. |
Implements QueuedTracker.
Definition at line 834 of file QueuedCUDATracker.cu.
|
inlinevirtual |
Get saved error curve.
See EnableRadialZLUTCompareProfile.
[in] | dst | Pointer to the pre-allocated memory in which to save the error curve. Size is count * planes . |
Implements QueuedTracker.
Definition at line 180 of file QueuedCUDATracker.h.
|
overridevirtual |
Get the dimensions of the radial lookup table data.
[out] | count | Reference to pre-allocated int. Returns number of lookup tables. |
[out] | planes | Reference to pre-allocated int. Returns number of planes per lookup table. |
[out] | radialsteps | Reference to pre-allocated int. Returns number of steps per plane. |
Implements QueuedTracker.
Definition at line 846 of file QueuedCUDATracker.cu.
|
protected |
Get a stream that is not currently executing, and still has room for images.
Definition at line 349 of file QueuedCUDATracker.cu.
|
overridevirtual |
Get the number of finished localization jobs (=results) available in memory.
Implements QueuedTracker.
Definition at line 853 of file QueuedCUDATracker.cu.
|
protected |
Build the list of devices to be used based on the QTrkSettings::cuda_device flag.
Definition at line 76 of file QueuedCUDATracker.cu.
|
overridevirtual |
Test to see if the tracker is idle.
That is, GetQueueLength == 0.
Implements QueuedTracker.
Definition at line 373 of file QueuedCUDATracker.cu.
|
overridevirtual |
Add a job to the queue to be processed. A job entails running the required algorithms on a single region of interest.
If a localization can not be added to the queue because it is full, the thread will be put to sleep and periodically try again.
[in] | data | Pointer to the data. Type specified by pdt . |
[in] | pitch | Distance in bytes between two successive rows of pixels (e.g. address of (0,0) - address of (0,1)). |
[in] | pdt | Type of data , specified by QTRK_PixelDataType. |
[in] | jobInfo | Structure with metadata for the ROI to be handled. See LocalizationJob. |
Implements QueuedTracker.
Definition at line 405 of file QueuedCUDATracker.cu.
|
staticprotected |
Entry point for thread creation.
Definition at line 212 of file QueuedCUDATracker.cu.
|
protected |
Loop executed by the scheduling thread which executes threads when needed.
Definition at line 217 of file QueuedCUDATracker.cu.
|
overridevirtual |
Set an additional setting.
[in] | name | Name of the setting. |
[in] | value | Value of the setting. |
Implements QueuedTracker.
Definition at line 888 of file QueuedCUDATracker.cu.
|
overridevirtual |
Select which algorithm is to be used.
[in] | locType | An integer used as a bitmask for settings based on LocalizeModeEnum. |
Implements QueuedTracker.
Definition at line 395 of file QueuedCUDATracker.cu.
|
overridevirtual |
Set the pixel calibration factors.
The factors can be used to increase or decrease the effects of the images supplied through SetPixelCalibrationImages for further finetuning. These only have an effect when an image is actually set through that function.
[in] | offsetFactor | Factor by which to scale the offset values. |
[in] | gainFactor | Factor by which to scale the gain values. |
Implements QueuedTracker.
Definition at line 753 of file QueuedCUDATracker.cu.
|
overridevirtual |
Set the pixel calibration images.
These images are used to scale the input image to get rid of background influences in the image. The values are per-pixel-per-ROI. Result = gain*(pixel+offset).
[in] | offset | Array with the offset values to use per pixel. Size and order is [width*height*numbeads]. |
[in] | gain | Array with gain values to use per pixel. Size and order is [width*height*numbeads]. |
Implements QueuedTracker.
Definition at line 718 of file QueuedCUDATracker.cu.
|
overridevirtual |
Set radial weights used for comparing LUT profiles.
[in] | zcmp | Array of radial weights to use. zcmp has to have zlut_radialsteps elements. |
Implements QueuedTracker.
Definition at line 798 of file QueuedCUDATracker.cu.
|
overridevirtual |
Set the radial lookup tables to be used for z tracking.
Data
can be zero to allocate ZLUT data. LUTs should have been created before by BuildLUT, but not necessarily by the current instance as long as setting match.
[in] | data | Pointer to the start of the ZLUT data. |
[in] | count | Number of ZLUTs in the dataset. |
[in] | planes | Number of planes per ZLUT. |
Implements QueuedTracker.
Definition at line 787 of file QueuedCUDATracker.cu.
|
protected |
Update zlut vector dimensions. Use when settings change.
[in] | s | The stream for which to update. |
Definition at line 805 of file QueuedCUDATracker.cu.
|
inlineprotected |
Get the CUDA native datatype with the threadblock dimensions to use.
Use in conjunction with blocks as:
Definition at line 399 of file QueuedCUDATracker.h.
int QueuedCUDATracker::batchesDone |
Number of fully completed batches.
Definition at line 487 of file QueuedCUDATracker.h.
|
protected |
Amount of images to be sent at once per stream.
Higher batchsize = higher speeds. Reason why it's faster:
Definition at line 359 of file QueuedCUDATracker.h.
KernelProfileTime QueuedCUDATracker::cpu_time |
Definition at line 486 of file QueuedCUDATracker.h.
|
protected |
Variable used to save device properties obtained from the CUDA API.
Definition at line 418 of file QueuedCUDATracker.h.
std::string QueuedCUDATracker::deviceReport |
String holding a human-readable description of used GPUs. Filled during InitializeDeviceList.
Definition at line 489 of file QueuedCUDATracker.h.
|
protected |
Vector of device instances used.
Definition at line 407 of file QueuedCUDATracker.h.
|
protected |
Vector with gains used for pixel correction.
Definition at line 412 of file QueuedCUDATracker.h.
|
protected |
Factor by which to scale the gain.
Definition at line 410 of file QueuedCUDATracker.h.
|
protected |
Mutex for pixel calibration operations.
Definition at line 413 of file QueuedCUDATracker.h.
|
protected |
Vector with offsets used for pixel correction.
Definition at line 411 of file QueuedCUDATracker.h.
|
protected |
Factor by which to scale the pixel calibration offset.
Definition at line 409 of file QueuedCUDATracker.h.
|
protected |
Mutex for job queue accesses.
Definition at line 406 of file QueuedCUDATracker.h.
|
protected |
Flags for localization choices. See LocalizeModeEnum.
Definition at line 404 of file QueuedCUDATracker.h.
|
protected |
Number of threads to use in a general thread block.
Used by the blocks and threads functions to quickly calculate parameterspace covering kernel execution dimensions.
Definition at line 365 of file QueuedCUDATracker.h.
|
protected |
Instance of QI used specifically for quadrant alignment.
Definition at line 417 of file QueuedCUDATracker.h.
|
protected |
The QI instance used to perform the 2D localization using the quadrant interpolation algorithm.
Definition at line 416 of file QueuedCUDATracker.h.
|
protected |
Thread shutdown flag with built-in thread safety (atomic).
Definition at line 421 of file QueuedCUDATracker.h.
|
protected |
Number of results available.
Definition at line 403 of file QueuedCUDATracker.h.
|
protected |
Mutex for result memory accesses.
Definition at line 405 of file QueuedCUDATracker.h.
|
protected |
Vector of completed results.
Definition at line 402 of file QueuedCUDATracker.h.
|
protected |
Handle to the scheduling thread for later reference.
Definition at line 420 of file QueuedCUDATracker.h.
|
protected |
Vector of usable streams.
Definition at line 401 of file QueuedCUDATracker.h.
KernelProfileTime QueuedCUDATracker::time |
Definition at line 486 of file QueuedCUDATracker.h.
|
protected |
Flag to use texture cache. Default is true. Disable using EnableTextureCache.
Definition at line 408 of file QueuedCUDATracker.h.
|
protected |
Flags for ZLUT building. Not actually used yet.
Definition at line 414 of file QueuedCUDATracker.h.