CUDA implementation of the QueuedTracker interface. More...

#include <QueuedCUDATracker.h>

Inheritance diagram for QueuedCUDATracker:

Classes
struct	Device
	Structure to maintain data for each GPU. More...

struct	KernelProfileTime
	Structure used to hold profiling data. More...

struct	Stream
	Structure to maintain data for each stream. Shell around CUDA native streams. More...

Public Types
typedef Image4DMemory< float >	ImageLUT
	Datatype to hold image lookup tables. More...

Public Types inherited from QueuedTracker
typedef std::map< std::string, std::string >	ConfigValueMap
	Datastructure used to return additional settings in a string-string key-value pair mapping. More...

Public Member Functions
	QueuedCUDATracker (const QTrkComputedConfig &cc, int batchSize=-1)
	Initialize a QueuedCUDATracker instance. More...

	~QueuedCUDATracker ()
	Delete an instance of a CUDA Tracker. More...

void	EnableTextureCache (bool useTextureCache)
	Enable or disable the use of textures. More...

void	SetLocalizationMode (LocMode_t locType) override
	Select which algorithm is to be used. More...

void	ScheduleLocalization (void data, int pitch, QTRK_PixelDataType pdt, const LocalizationJob jobInfo) override
	Add a job to the queue to be processed. A job entails running the required algorithms on a single region of interest. More...

void	ClearResults () override
	Clear results. More...

void	SetRadialZLUT (float *data, int numLUTs, int planes) override
	Set the radial lookup tables to be used for z tracking. More...

void	SetRadialWeights (float *zcmp) override
	Set radial weights used for comparing LUT profiles. More...

void	GetRadialZLUT (float *data) override
	Get the radial lookup tables used for z tracking. More...

void	GetRadialZLUTSize (int &count, int &planes, int &radialSteps) override
	Get the dimensions of the radial lookup table data. More...

int	FetchResults (LocalizationResult *results, int maxResults) override
	Fetch available results. More...

void	BeginLUT (uint flags) override
	Setup to begin building a lookup table. More...

void	BuildLUT (void data, int pitch, QTRK_PixelDataType pdt, int plane, vector2f known_pos=0) override
	Add a new lookup table plane. More...

void	FinalizeLUT () override
	Finalize the lookup tables in memory. More...

void	EnableRadialZLUTCompareProfile (bool enabled)
	Set a flag to enable saving of error curves. More...

void	GetRadialZLUTCompareProfile (float *dst)
	Get saved error curve. More...

std::string	GetProfileReport () override
	Get the output of performance profiling. More...

void	Flush () override
	Stop waiting for more jobs to do, and just process the current batch. More...

int	GetQueueLength (int *maxQueueLen) override
	Get the lengths of the queue of jobs to be handled. More...

bool	IsIdle () override
	Test to see if the tracker is idle. More...

int	GetResultCount () override
	Get the number of finished localization jobs (=results) available in memory. More...

void	SetPixelCalibrationImages (float offset, float gain) override
	Set the pixel calibration images. More...

void	SetPixelCalibrationFactors (float offsetFactor, float gainFactor) override
	Set the pixel calibration factors. More...

ConfigValueMap	GetConfigValues () override
	Get the used additional configurations. More...

void	SetConfigValue (std::string name, std::string value) override
	Set an additional setting. More...

Public Member Functions inherited from QueuedTracker
	QueuedTracker ()

virtual	~QueuedTracker ()

void	ScheduleImageData (ImageData data, const LocalizationJob jobInfo)
	Quick function to schedule a single ROI from an ImageData object. More...

virtual int	ScheduleFrame (void imgptr, int pitch, int width, int height, ROIPosition positions, int numROI, QTRK_PixelDataType pdt, const LocalizationJob *jobInfo)
	Schedule an entire frame at once, allowing for further optimizations. More...

virtual void	GetImageZLUTSize (int *dims)
	Get the dimensions of the image lookup table data. More...

virtual void	GetImageZLUT (float *dst)
	Get the image lookup tables used. More...

virtual bool	SetImageZLUT (float src, float radial_zlut, int *dims)
	Set the image lookup tables to be used for z tracking. More...

virtual std::string	GetWarnings ()
	Get a report of encountered errors. More...

virtual bool	GetDebugImage (int ID, int w, int h, float **pData)
	Get the debug image for a specific thread. More...

ImageData	DebugImage (int ID)
	Get the debug image as an ImageData object. More...

void	ScheduleLocalization (uchar data, int pitch, QTRK_PixelDataType pdt, uint frame, uint timestamp, vector3f initial, uint zlutIndex)
	Add an image to the queue to be processed. Creates a job. More...

void	ComputeZBiasCorrection (int bias_planes, CImageData *result, int smpPerPixel, bool useSplineInterp)

float	ZLUTBiasCorrection (float z, int zlut_planes, int bead)

void	SetZLUTBiasCorrection (const CImageData &data)

CImageData *	GetZLUTBiasCorrection ()

Public Attributes
KernelProfileTime	time

KernelProfileTime	cpu_time

int	batchesDone
	Number of fully completed batches. More...

std::string	deviceReport
	String holding a human-readable description of used GPUs. Filled during InitializeDeviceList. More...

Public Attributes inherited from QueuedTracker
QTrkComputedConfig	cfg
	The settings used by this instance of QueuedTracker. More...

Protected Member Functions
dim3	blocks ()
	Calculate the number of thread blocks of size numThreads needed to have 1 thread per job. More...

dim3	blocks (int workItems)
	Calculate the number of thread blocks of size numThreads needed to have 1 thread per `workItems`. More...

dim3	threads ()
	Get the CUDA native datatype with the threadblock dimensions to use. More...

void	SchedulingThreadMain ()
	Loop executed by the scheduling thread which executes threads when needed. More...

template<typename TImageSampler >
void	ExecuteBatch (Stream *s)
	Execute the queued batch on a stream. More...

Stream *	GetReadyStream ()
	Get a stream that is not currently executing, and still has room for images. More...

void	InitializeDeviceList ()
	Build the list of devices to be used based on the QTrkSettings::cuda_device flag. More...

Stream *	CreateStream (Device *device, int streamIndex)
	Initialize a stream instance. More...

void	CopyStreamResults (Stream *s)
	Copy localization results from device to host memory. Also updates profiling times. More...

void	StreamUpdateZLUTSize (Stream *s)
	Update zlut vector dimensions. Use when settings change. More...

void	CPU_ApplyOffsetGain (CPUTracker *trk, int beadIndex)
	Use the CPU-based tracker to apply the pixel calibrations. More...

Static Protected Member Functions
static void	SchedulingThreadEntryPoint (void *param)
	Entry point for thread creation. More...

Protected Attributes
int	batchSize
	Amount of images to be sent at once per stream. More...

int	numThreads
	Number of threads to use in a general thread block. More...

std::vector< Stream * >	streams
	Vector of usable streams. More...

std::list< LocalizationResult >	results
	Vector of completed results. More...

int	resultCount
	Number of results available. More...

LocMode_t	localizeMode
	Flags for localization choices. See LocalizeModeEnum. More...

Threads::Mutex	resultMutex
	Mutex for result memory accesses. More...

Threads::Mutex	jobQueueMutex
	Mutex for job queue accesses. More...

std::vector< Device * >	devices
	Vector of device instances used. More...

bool	useTextureCache
	Flag to use texture cache. Default is true. Disable using EnableTextureCache. More...

float	gc_offsetFactor
	Factor by which to scale the pixel calibration offset. More...

float	gc_gainFactor
	Factor by which to scale the gain. More...

std::vector< float >	gc_offset
	Vector with offsets used for pixel correction. More...

std::vector< float >	gc_gain
	Vector with gains used for pixel correction. More...

Threads::Mutex	gc_mutex
	Mutex for pixel calibration operations. More...

uint	zlut_build_flags
	Flags for ZLUT building. Not actually used yet. More...

QI	qi
	The QI instance used to perform the 2D localization using the quadrant interpolation algorithm. More...

QI	qalign
	Instance of QI used specifically for quadrant alignment. More...

cudaDeviceProp	deviceProp
	Variable used to save device properties obtained from the CUDA API. More...

Threads::Handle *	schedulingThread
	Handle to the scheduling thread for later reference. More...

Atomic< bool >	quitScheduler
	Thread shutdown flag with built-in thread safety (atomic). More...

Protected Attributes inherited from QueuedTracker
CImageData *	zlut_bias_correction

Detailed Description

CUDA implementation of the QueuedTracker interface.

Exploits the inherent parallel computing power of GPU hardware to increase the speed at which tracking can be performed. Optimizations will be listed on a per-function basis. Speeds of almost 40000 100x100 ROI/s with 5 QI iterations and Z localization have been achieved on a setup with a GTX1080 card.

The currently implemented tracking algorithms are COM, Quadrant Interpolation and 2D Gaussian with Max-Likelihood estimation.

It will automatically use all available CUDA devices if using the QTrkCUDA_UseAll value for QTrkSettings::cuda_device.

Method:

ScheduleLocalization - Load images into host-side image buffer
SchedulingThreadMain - Scheduling thread executes any batch that is filled
ExecuteBatch - Running batch
1. Async copy host-side buffer to device
2. Bind image to texture memory
3. Run COM kernel
4. QI loop:
  1. Run QI kernels: Sample from texture into quadrant profiles
  2. Run CUFFT. Each iteration per axis does 2x forward FFT, and 1x backward FFT.
  3. Run QI kernel: Compute positions
5. Compute ZLUT profiles
6. Depending on localize flags:
  1. copy ZLUT profiles (for ComputeBuildZLUT flag)
  2. generate compare profile kernel + compute Z kernel (for ComputeZ flag)
7. Unbind image from texture memory
8. Async copy results to host

Thread-Safety:

We assume 2 threads concurrently accessing the tracker functions.

Queueing thread: ScheduleLocalization, SetRadialZLUT, GetRadialZLUT, Flush, IsIdle
Fetching thread: FetchResults, GetResultCount, ClearResults
Mutexes:
1. jobQueueMutex - controlling access to state and jobs. Used by ScheduleLocalization, scheduler thread, and GetQueueLen
2. resultMutex - controlling access to the results list, Locked by the scheduler whenever results come available, and by calling threads when they run GetResults/Count

Issues:

Due to FPU operations on texture coordinates, there are small numerical differences between localizations of the same image at a different position in the batch

Definition at line 137 of file QueuedCUDATracker.h.

Member Typedef Documentation

§ ImageLUT

typedef Image4DMemory<float> QueuedCUDATracker::ImageLUT

Datatype to hold image lookup tables.

Definition at line 198 of file QueuedCUDATracker.h.

Constructor & Destructor Documentation

§ QueuedCUDATracker()

QueuedCUDATracker::QueuedCUDATracker	(	const QTrkComputedConfig &	cc,
		int	batchSize = `-1`
	)

Initialize a QueuedCUDATracker instance.

Parameters

[in]	cc	The settings to be used.
[in]	batchSize	The number of ROIs to be handled in one batch. Default (-1) calculates optimum. See batchSize for optimization info.

Exceptions

runtime_error	Init error: GPU does not support CUDA capability 2.0 or higher.
runtime_error	Init error: Failed to create GPU streams.

We use QTrkComputedConfig::numThreads for the number of CUDA streams. Default (numThreads < 1) is 4 streams per CUDA device.

Number of threads per thread block is a single warp so each block can be executed concurrently on a single SM. Assuming no code branching, which should hold for kernels doing nothing but the same calculations on different locations in memory.

Calculate maximum batchSize based on available texture memory. Factor 0.99 is empirical, 100% of maximum does not work for some reason.

Initialize the QI module to handle the Quadrant Interpolation algorithm.

Pre-calculate the radial sampling grid based on radial and angular step settings from the passed QTrkComputedConfig.

Initialize the available CUDA devices. Copy the sampling table to device memory.

Todo:: Experiment with global values (settings) in global and/or shared device memory.

Initialize streams.

Texture cache is enabled by default. Use EnableTextureCache to disable if required.

Start the scheduling thread running SchedulingThreadMain.

Run an empty kernel to startup driver and get rid of first run overhead.

Definition at line 103 of file QueuedCUDATracker.cu.

     : resultMutex("result"), jobQueueMutex("jobqueue")
 {
     cfg = cc;
 
     InitializeDeviceList();
 
     if (cfg.numThreads < 1) {
         cfg.numThreads = devices.size()*4;
     }
     int numStreams = cfg.numThreads;
 
     cudaGetDeviceProperties(&deviceProp, devices[0]->index);
 
     // Commented out because disabling timeout protection causes a bug that disables stream concurrency.
     // Only useful for debugging anyway; no single kernel should take longer than 2 seconds unless breakpointed.
     //if (deviceProp.kernelExecTimeoutEnabled) 
     //  throw std::runtime_error(SPrintf("CUDA Tracker init error: CUDA Kernel execution timeout is enabled for %s. Disable WDDM Time-out Detection and Recovery (TDR) in the nVidia NSight Monitor before running this code", deviceProp.name));
     if (deviceProp.major < 2)
         throw std::runtime_error("CUDA Tracker init error: GPU not supported, capability < 2.0");
 
     numThreads = deviceProp.warpSize;
     
     if(batchSize<0) 
         batchSize = (int)(deviceProp.maxTexture2D[1]/cfg.height * 0.99f); 
     this->batchSize = batchSize;
 
     dbgprintf("CUDA Hardware: %s. \n# of CUDA processors: %d. Using %d streams\n", deviceProp.name, deviceProp.multiProcessorCount, numStreams);
     dbgprintf("Warp size: %d. Max threads: %d, Batch size: %d\n", deviceProp.warpSize, deviceProp.maxThreadsPerBlock, batchSize);
     // dbgprintf("Mem: %u MB. Per block: %u B\n", deviceProp.totalGlobalMem/1024/1024, deviceProp.sharedMemPerBlock); // Total memory available in Mbytes
     
     qi.Init(cfg, batchSize);
 
     std::vector<float2> zlut_radialgrid(cfg.zlut_angularsteps);
     for (int i=0;i<cfg.zlut_angularsteps;i++) {
         float ang = 2*3.141593f*i/(float)cfg.zlut_angularsteps;
         zlut_radialgrid[i]=make_float2(cos(ang),sin(ang));
     }
 
     for (uint i=0;i<devices.size();i++) {
         Device* d = devices[i];
         cudaSetDevice(d->index);
         qi.InitDevice(&d->qi_instance, cfg);
         d->zlut_trigtable = zlut_radialgrid;
     }
     
     //dbgprintf("\n");
     //outputTotalGPUMemUse("pre-streams");
 
     streams.reserve(numStreams);
     try {
         for (int i=0;i<numStreams;i++)
             streams.push_back( CreateStream( devices[i%devices.size()], i ) );
     }
     catch(...) {
         DeleteAllElems(streams);
         throw std::runtime_error("CUDA Tracker init error: Failed to create GPU streams.");
     }
 
     //dbgprintf("\n");
     //streams[0]->OutputMemoryUse();
     //dbgprintf("\n");
     //outputTotalGPUMemUse("post-streams");
     //dbgprintf("\n");
 
     batchesDone = 0;
     useTextureCache = true;
     resultCount = 0;
     zlut_build_flags=0;
 
     quitScheduler = false;
     schedulingThread = Threads::Create(SchedulingThreadEntryPoint, this);
 
     gc_offsetFactor = gc_gainFactor = 1.0f;
     localizeMode = LT_OnlyCOM;
 
     ForceCUDAKernelsToLoad <<< dim3(),dim3() >>> ();
 }

§ ~QueuedCUDATracker()

QueuedCUDATracker::~QueuedCUDATracker ( )

Delete an instance of a CUDA Tracker.

Definition at line 195 of file QueuedCUDATracker.cu.

 {
     quitScheduler = true;
     Threads::WaitAndClose(schedulingThread);
 
     DeleteAllElems(streams);
     DeleteAllElems(devices);
 }

Member Function Documentation

§ BeginLUT()

void QueuedCUDATracker::BeginLUT ( uint flags )

overridevirtual

Setup to begin building a lookup table.

Sets the flags by which the lookup table is built. Flags are defined in a uint bitmask format as:

Name	Value	Description
	0	Default, no special flags.
BUILDLUT_IMAGELUT	1 (2^0)	Build an image LUT. An image LUT seems to be a work in progress to save ROIs rather than profiles in the LUT.
BUILDLUT_FOURIER	2 (2^1)	Build a fourier LUT.
BUILDLUT_NORMALIZE	4 (2^2)	Normalize radial profiles. Irrelevant, since FinalizeLUT always normalizes.
BUILDLUT_BIASCORRECT	8 (2^3)	Enable bias correction. Currently only partly implemented, don't use.

Parameters

[in] flags UINT interpreted as a series of bits to set settings.

Implements QueuedTracker.

Definition at line 442 of file QueuedCUDATracker.cu.

 {
     zlut_build_flags = flags;
 }

§ blocks() [1/2]

dim3 QueuedCUDATracker::blocks ( )

inlineprotected

Calculate the number of thread blocks of size numThreads needed to have 1 thread per job.

Use in conjunction with threads as:

kernel <<< blocks(), threads() >>> (a, b);

Returns: CUDA 3D data for use in kernel calls.

Definition at line 376 of file QueuedCUDATracker.h.

376 { return dim3((batchSize+numThreads-1)/numThreads); }

QueuedCUDATracker::batchSize

int batchSize

Amount of images to be sent at once per stream.

Definition: QueuedCUDATracker.h:359

QueuedCUDATracker::numThreads

int numThreads

Number of threads to use in a general thread block.

Definition: QueuedCUDATracker.h:365

§ blocks() [2/2]

dim3 QueuedCUDATracker::blocks ( int workItems )

inlineprotected

Calculate the number of thread blocks of size numThreads needed to have 1 thread per workItems.

Use in conjunction with threads as:

kernel <<< blocks(items), threads() >>> (a, b);

Parameters

[in] workItems Number of total threads needed.

Returns: CUDA 3D data for use in kernel calls.

Definition at line 388 of file QueuedCUDATracker.h.

388 { return dim3((workItems+numThreads-1)/numThreads); }

QueuedCUDATracker::numThreads

int numThreads

Number of threads to use in a general thread block.

Definition: QueuedCUDATracker.h:365

§ BuildLUT()

void QueuedCUDATracker::BuildLUT	(	void *	data,
		int	pitch,
		QTRK_PixelDataType	pdt,
		int	plane,
		vector2f *	known_pos = `0`
	)

overridevirtual

Add a new lookup table plane.

Takes a stack of ROI images through data. Determines and adds the profile for each ROI to its respective LUT.

Parameters

[in]	data	Pointer to the start of an image stack.
[in]	pitch	Width of the data in memory so offsets can be calculated.
[in]	pdt	Pixel data type for the data. See QTRK_PixelDataType.
[in]	plane	The plane number the ROIs are taken for.
[in]	known_pos	Center position from which to start making the radial profile. A standard QI localization with applied settings is performed if 0 (default).

Note: Uses the CPU tracker for LUT building!

Implements QueuedTracker.

Definition at line 447 of file QueuedCUDATracker.cu.

 {
     // Copy to image 
     Device* d = streams[0]->device;
     cudaSetDevice(d->index);
 
     int nbeads = d->radial_zlut.count; // jobcount
     std::vector<vector2f> positions(nbeads);
 
     float *profiles = new float [nbeads * cfg.zlut_radialsteps];
     memset(profiles, 0, sizeof(float) * nbeads * cfg.zlut_radialsteps);
 
     // Should be fast enough for zlut building
     parallel_for(nbeads, [&] (int i) {
         CPUTracker trk (cfg.width,cfg.height);
         void *img_data = (uchar*)data + pitch * cfg.height * i;
 
         if (pdt == QTrkFloat)
             trk.SetImage((float*)img_data, pitch);
         else if (pdt == QTrkU8)
             trk.SetImage8Bit((uchar*)img_data, pitch);
         else
             trk.SetImage16Bit((ushort*)img_data,pitch);
 
         CPU_ApplyOffsetGain(&trk, i);
 
         if(known_pos) {
             positions[i] = known_pos[i];
         } else {
             vector2f com = trk.ComputeMeanAndCOM();
             bool bhit;
             positions[i] = trk.ComputeQI(com, cfg.qi_iterations, cfg.qi_radialsteps, cfg.qi_angstepspq, cfg.qi_angstep_factor, cfg.qi_minradius, cfg.qi_maxradius, bhit);
             trk.ComputeRadialProfile(&profiles[i * cfg.zlut_radialsteps], cfg.zlut_radialsteps, cfg.zlut_angularsteps, cfg.zlut_minradius, cfg.zlut_maxradius, positions[i], false, 0, true);
         }
     });
     
     // add to device 0 LUT
     device_vec<float> d_profiles (nbeads * cfg.zlut_radialsteps);
     d_profiles.copyToDevice(profiles, nbeads * cfg.zlut_radialsteps);
     delete[] profiles;
 
     dim3 numThreads(4, 64);
     dim3 numBlocks( (nbeads + numThreads.x - 1) / numThreads.x, (cfg.zlut_radialsteps + numThreads.y - 1) / numThreads.y);
     AddProfilesToZLUT<<< numBlocks, numThreads, 0, streams[0]->stream >>> (d_profiles.data, nbeads, cfg.zlut_radialsteps, plane, d->radial_zlut);
     cudaStreamSynchronize(streams[0]->stream);
 }

§ ClearResults()

void QueuedCUDATracker::ClearResults ( )

overridevirtual

Clear results.

Implements QueuedTracker.

Definition at line 861 of file QueuedCUDATracker.cu.

 {
     resultMutex.lock();
     results.clear();
     resultCount=0;
     resultMutex.unlock();
 }

§ CopyStreamResults()

void QueuedCUDATracker::CopyStreamResults ( Stream * s )

protected

Copy localization results from device to host memory. Also updates profiling times.

Parameters

[in] s The stream from which to copy results.

Definition at line 674 of file QueuedCUDATracker.cu.

 {
     resultMutex.lock();
     for (int a=0;a<s->JobCount();a++) {
         LocalizationJob& j = s->jobs[a];
         LocalizationResult r;
         r.job = j;
         r.firstGuess =  vector2f( s->com[a].x, s->com[a].y );
         r.pos = vector3f( s->results[a].x , s->results[a].y, s->results[a].z);
         r.imageMean = s->imgMeans[a];
         r.pos.z = ZLUTBiasCorrection(s->results[a].z, devices[0]->radial_zlut.h, j.zlutIndex);
         results.push_back(r);
     }
     resultCount+=s->JobCount();
     resultMutex.unlock();
 
     // Update times
     float qi, com, imagecopy, zcomp, getResults;
     cudaEventElapsedTime(&imagecopy, s->batchStart, s->imageCopyDone);
     cudaEventElapsedTime(&com, s->imageCopyDone, s->comDone);
     cudaEventElapsedTime(&qi, s->comDone, s->qiDone);
     cudaEventElapsedTime(&zcomp, s->qiDone, s->zcomputeDone);
     cudaEventElapsedTime(&getResults, s->zcomputeDone, s->localizationDone);
     time.com += com;
     time.qi += qi;
     time.imageCopy += imagecopy;
     time.zcompute += zcomp;
     time.getResults += getResults;
     batchesDone ++;
 }

§ CPU_ApplyOffsetGain()

void QueuedCUDATracker::CPU_ApplyOffsetGain	(	CPUTracker *	trk,
		int	beadIndex
	)

protected

Use the CPU-based tracker to apply the pixel calibrations.

This is only used for LUT building because that entirely still runs on the CPU (because speed is not a limit).

Parameters

[in]	trk	Instance of CPUTracker to use
[in]	beadIndex	Number of the ROI to scale

Definition at line 739 of file QueuedCUDATracker.cu.

 {
     if (!gc_offset.empty() || !gc_gain.empty()) {
         int index = cfg.width*cfg.height*beadIndex;
 
         gc_mutex.lock();
         float gf = gc_gainFactor, of = gc_offsetFactor;
         gc_mutex.unlock();
 
         trk->ApplyOffsetGain(gc_offset.empty() ? 0:  &gc_offset[index] , gc_gain.empty() ? 0: &gc_gain[index], of, gf);
 //      if (j->job.frame%100==0)
     }
 }

§ CreateStream()

QueuedCUDATracker::Stream * QueuedCUDATracker::CreateStream	(	Device *	device,
		int	streamIndex
	)

protected

Initialize a stream instance.

Parameters

[in]	device	Pointer to the device instance this stream runs on.
[in]	streamIndex	The number of this stream.

Definition at line 309 of file QueuedCUDATracker.cu.

 {
     Stream* s = new Stream(streamIndex);
 
     try {
         s->device = device;
         cudaSetDevice(device->index);
         cudaStreamCreate(&s->stream);
 
         s->images = cudaImageListf::alloc(cfg.width, cfg.height, batchSize);
         s->images.allocateHostImageBuffer(s->hostImageBuf);
 
         s->jobs.reserve(batchSize);
         s->results.init(batchSize);
         s->com.init(batchSize);
         s->d_com.init(batchSize);
         s->d_resultpos.init(batchSize);
         s->locParams.init(batchSize);
         s->imgMeans.init(batchSize);
         s->d_imgmeans.init(batchSize);
         s->d_locParams.init(batchSize);
         s->d_radialprofiles.init(cfg.zlut_radialsteps*batchSize);
 
         qi.InitStream(&s->qi_instance, cfg, s->stream, batchSize);
         qalign.InitStream(&s->qalign_instance, cfg, s->stream, batchSize);
 
         cudaEventCreate(&s->localizationDone);
         cudaEventCreate(&s->comDone);
         cudaEventCreate(&s->imageCopyDone);
         cudaEventCreate(&s->zcomputeDone);
         cudaEventCreate(&s->qiDone);
         cudaEventCreate(&s->batchStart);
     } catch (...) {
         delete s;
         throw;
     }
     return s;
 }

§ EnableRadialZLUTCompareProfile()

void QueuedCUDATracker::EnableRadialZLUTCompareProfile ( bool enabled )

inlinevirtual

Set a flag to enable saving of error curves.

Errors obtained by comparing a radial profile to a ZLUT will be kept in memory rather than destroyed. Only saves for one localization. Error curve can be retreived by GetRadialZLUTCompareProfile.

Note: Not implemented for CUDA.

Parameters

[in] enabled Flag (boolean) to save error curves. Default is false.

Implements QueuedTracker.

Definition at line 179 of file QueuedCUDATracker.h.

179 {}

§ EnableTextureCache()

void QueuedCUDATracker::EnableTextureCache ( bool useTextureCache )

inline

Enable or disable the use of textures.

Texture cache provides a significant speedup due to 2D L1 caching as compared to normal linear memory caching (List of CUDA references).

Parameters

[in] useTextureCache Boolean to enable texture cache usage.

Definition at line 158 of file QueuedCUDATracker.h.

158 { this->useTextureCache=useTextureCache; }

QueuedCUDATracker::useTextureCache

bool useTextureCache

Flag to use texture cache. Default is true. Disable using EnableTextureCache.

Definition: QueuedCUDATracker.h:408

§ ExecuteBatch()

template<typename TImageSampler >

void QueuedCUDATracker::ExecuteBatch ( Stream * s )

protected

Execute the queued batch on a stream.

This entails the following steps:

Async copy host-side buffer to device
Bind image to texture memory
Run COM kernel
QI loop:
1. Run QI kernels: Sample from texture into quadrant profiles
2. Run CUFFT. Each iteration per axis does 2x forward FFT, and 1x backward FFT.
3. Run QI kernel: Compute positions
Compute ZLUT profiles
Depending on localize flags:
1. copy ZLUT profiles (for ComputeBuildZLUT flag)
2. generate compare profile kernel + compute Z kernel (for ComputeZ flag)
Unbind image from texture memory
Async copy results to host

Parameters

[in] s Pointer to the stream to execute.

Definition at line 560 of file QueuedCUDATracker.cu.

 {
     if (s->JobCount()==0)
         return;
     //dbgprintf("Sending %d images to GPU stream %p...\n", s->jobCount, s->stream);
 
     Device *d = s->device;
     cudaSetDevice(d->index);
 
     BaseKernelParams kp;
     kp.imgmeans = s->d_imgmeans.data;
     kp.images = s->images;
     kp.njobs = s->JobCount();
     kp.locParams = s->d_locParams.data;
 
     cudaEventRecord(s->batchStart, s->stream);
 //  dbgprintf("copying %d jobs to gpu\n", s->JobCount());
 
     s->d_locParams.copyToDevice(s->locParams.data(), s->JobCount(), true, s->stream);
 
     {ScopedCPUProfiler p(&cpu_time.imageCopy);
         s->images.copyToDevice(s->hostImageBuf.data(), true, s->stream); 
     }
 
     if (!d->calib_gain.isEmpty() || !d->calib_offset.isEmpty()) {
         dim3 numThreads(16, 16, 2);
         dim3 numBlocks((cfg.width + numThreads.x - 1 ) / numThreads.x,
                 (cfg.height + numThreads.y - 1) / numThreads.y,
                 (s->JobCount() + numThreads.z - 1) / numThreads.z);
 
         gc_mutex.lock();
         float of = gc_offsetFactor, gf = gc_gainFactor;
         gc_mutex.unlock();
 
         ApplyOffsetGain <<< numBlocks, numThreads, 0, s->stream >>> 
             (kp, s->device->calib_gain, s->device->calib_offset, gf, of);
     }
 
     cudaEventRecord(s->imageCopyDone, s->stream);
 
 
     TImageSampler::BindTexture(s->images);
     { ScopedCPUProfiler p(&cpu_time.com);
         BgCorrectedCOM<TImageSampler> <<< blocks(s->JobCount()), threads(), 0, s->stream >>> 
             (s->JobCount(), s->images, s->d_com.data, cfg.com_bgcorrection, s->d_imgmeans.data);
         checksum(s->d_com.data, 1, s->JobCount(), "com");
     }
     cudaEventRecord(s->comDone, s->stream);
 
     // DbgOutputVectorToFile("D:\\TestImages\\imgmeans.csv", s->d_imgmeans, false);
 
     device_vec<float3> *curpos = &s->d_com;
     if (s->localizeFlags & LT_QI) {
         ScopedCPUProfiler p(&cpu_time.qi);
         qi.Execute <TImageSampler> (kp, cfg, &s->qi_instance, &s->device->qi_instance, &s->d_com, &s->d_resultpos);
         curpos = &s->d_resultpos;
     }
 
     if (s->localizeFlags & LT_Gaussian2D) {
         G2MLE_Compute<TImageSampler> <<< blocks(s->JobCount()), threads(), 0, s->stream >>>
             (kp, cfg.gauss2D_sigma, cfg.gauss2D_iterations, s->d_com.data, s->d_resultpos.data, 0, 0);
         curpos = &s->d_resultpos;
     }
 
     cudaEventRecord(s->qiDone, s->stream);
 
     int numZIterations = (s->localizeFlags & LT_ZLUTAlign) ? 3 : 1;
     for (int i=0;i<numZIterations;i++) {
         {ScopedCPUProfiler p(&cpu_time.zcompute);
         ZLUTParams zp;
         zp.planes = d->radial_zlut.h;
         zp.angularSteps = cfg.zlut_angularsteps;
         zp.maxRadius = cfg.zlut_maxradius;
         zp.minRadius = cfg.zlut_minradius;
         zp.img = d->radial_zlut;
         zp.trigtable = d->zlut_trigtable.data;
         zp.zcmpwindow = d->zcompareWindow.data;
 
         // Compute radial profiles
         if (s->localizeFlags & LT_LocalizeZ) {
             dim3 numThreads(16, 16);
             dim3 numBlocks( (s->JobCount() + numThreads.x - 1) / numThreads.x, 
                     (cfg.zlut_radialsteps + numThreads.y - 1) / numThreads.y);
             ZLUT_RadialProfileKernel<TImageSampler> <<< numBlocks , numThreads, 0, s->stream >>>
                 (s->JobCount(), s->images, zp, curpos->data, s->d_radialprofiles.data, s->d_imgmeans.data);
             ZLUT_NormalizeProfiles<<< blocks(s->JobCount()), threads(), 0, s->stream >>> (s->JobCount(), zp, s->d_radialprofiles.data);
         }
         // Compute Z 
         if (s->localizeFlags & LT_LocalizeZ) {
             dim3 numThreads(8, 16);
             ZLUT_ComputeProfileMatchScores <<< dim3( (s->JobCount() + numThreads.x - 1) / numThreads.x, (zp.planes  + numThreads.y - 1) / numThreads.y), numThreads, 0, s->stream >>> 
                 (s->JobCount(), zp, s->d_radialprofiles.data, s->d_zlutcmpscores.data, s->d_locParams.data);
             ZLUT_ComputeZ <<< blocks(s->JobCount()), threads(), 0, s->stream >>> (s->JobCount(), zp, curpos->data, s->d_zlutcmpscores.data);
         }}
 
         if (i>0) {
             ScopedCPUProfiler p(&cpu_time.zlutAlign);
             qalign.Execute<TImageSampler> (kp, cfg, &s->qalign_instance, &s->device->qalign_instance, &s->d_resultpos, &s->d_resultpos);
         }
     }
 
     TImageSampler::UnbindTexture(s->images);
     cudaEventRecord(s->zcomputeDone, s->stream);
 
     { ScopedCPUProfiler p(&cpu_time.getResults);
         s->d_com.copyToHost(s->com.data(), true, s->stream);
         curpos->copyToHost(s->results.data(), true, s->stream);
         s->d_imgmeans.copyToHost(s->imgMeans.data(), true, s->stream);
     }
 
     // Make sure we can query the all done signal
     cudaEventRecord(s->localizationDone, s->stream);
 }

§ FetchResults()

int QueuedCUDATracker::FetchResults	(	LocalizationResult *	results,
		int	maxResults
	)

overridevirtual

Fetch available results.

Note: Removes results from internal QueuedTracker memory.

Parameters

[in]	results	Array of pre-allocated LocalizationResult to which to add the results.
[in]	maxResults	Maximum number of results to fetch. Corresponds to maximum size of `dstResult`.

Returns: Number of fetched results.

Implements QueuedTracker.

Definition at line 705 of file QueuedCUDATracker.cu.

 {
     resultMutex.lock();
     int numResults = 0;
     while (numResults < maxResults && !results.empty()) {
         dstResults[numResults++] = results.front();
         results.pop_front();
         resultCount--;
     }
     resultMutex.unlock();
     return numResults;
 }

§ FinalizeLUT()

void QueuedCUDATracker::FinalizeLUT ( )

overridevirtual

Finalize the lookup tables in memory.

Normalizes the profiles for radial lookup tables and calculates derivates and adds boundary conditions for image LUTs.

Implements QueuedTracker.

Definition at line 494 of file QueuedCUDATracker.cu.

 {
     Device* srcd = devices[0];
     cudaSetDevice(srcd->index);
 
     cudaImageListf& src = srcd->radial_zlut;
     float *tmp = new float [src.w*src.h*src.count];
     srcd->radial_zlut.copyToHost(tmp);
 
     for (int i=0;i<src.h*src.count;i++){
         NormalizeRadialProfile(&tmp[src.w*i],src.w);
     }
 
     SetRadialZLUT(tmp, src.count, src.h);
     
     for(int bead = 0; bead < src.count; bead++){
         for(int step = 0; step < cfg.zlut_radialsteps; step++){
             dbgprintf("%f ",tmp[(bead * cfg.zlut_radialsteps)+step]);
         }
     }
     dbgprintf("\n"); //*/
 
     delete[] tmp;
 }

§ Flush()

void QueuedCUDATracker::Flush ( )

overridevirtual

Stop waiting for more jobs to do, and just process the current batch.

Implements QueuedTracker.

Definition at line 519 of file QueuedCUDATracker.cu.

 {
     jobQueueMutex.lock();
     for (uint i=0;i<streams.size();i++) {
         if(streams[i]->JobCount()>0 && streams[i]->state != Stream::StreamExecuting)
             streams[i]->state = Stream::StreamPendingExec;
     }
     jobQueueMutex.unlock();
 }

§ GetConfigValues()

QueuedCUDATracker::ConfigValueMap QueuedCUDATracker::GetConfigValues ( )

overridevirtual

Get the used additional configurations.

Implements QueuedTracker.

Definition at line 881 of file QueuedCUDATracker.cu.

 {
     ConfigValueMap cvm;
     cvm["use_texturecache"] = useTextureCache ? "1" : "0";
     return cvm;
 }

§ GetProfileReport()

std::string QueuedCUDATracker::GetProfileReport ( )

overridevirtual

Get the output of performance profiling.

Note: Profiling is only implemented in CUDA at the moment.

Returns: String with the parsed profiling output.

Reimplemented from QueuedTracker.

Definition at line 869 of file QueuedCUDATracker.cu.

 {
     float f = 1.0f/batchesDone;
 
     return deviceReport + "Time profiling: [GPU], [CPU] \n" +
         SPrintf("%d batches done of size %d, on %d streams", batchesDone, batchSize, streams.size()) + "\n" +
         SPrintf("Image copying: %.2f,\t%.2f ms\n", time.imageCopy*f, cpu_time.imageCopy*f) +
         SPrintf("QI:            %.2f,\t%.2f ms\n", time.qi*f, cpu_time.qi*f) +
         SPrintf("COM:           %.2f,\t%.2f ms\n", time.com*f, cpu_time.com*f) +
         SPrintf("Z Computing:   %.2f,\t%.2f ms\n", time.zcompute*f, cpu_time.zcompute*f);
 }

§ GetQueueLength()

int QueuedCUDATracker::GetQueueLength ( int * maxQueueLen )

overridevirtual

Get the lengths of the queue of jobs to be handled.

Parameters

[out] maxQueueLen Pre-allocated integer that returns the maximum size of the queue if nonzero.

Returns: Number of jobs currently being handled and in the queue.

Implements QueuedTracker.

Definition at line 379 of file QueuedCUDATracker.cu.

 {
     jobQueueMutex.lock();
     int qlen = 0;
     for (uint a=0;a<streams.size();a++){
         qlen += streams[a]->JobCount();
     }
     jobQueueMutex.unlock();
 
     if (maxQueueLen) {
         *maxQueueLen = streams.size()*batchSize;
     }
 
     return qlen;
 }

§ GetRadialZLUT()

void QueuedCUDATracker::GetRadialZLUT ( float * dst )

overridevirtual

Get the radial lookup tables used for z tracking.

Parameters

[out] dst Pointer to the pre-allocated memory in which to save the data.

Implements QueuedTracker.

Definition at line 834 of file QueuedCUDATracker.cu.

 {
     cudaImageListf* zlut = &devices[0]->radial_zlut;
 
     if (zlut->data) {
         for (int i=0;i<zlut->count;i++) {
             float* img = &data[i*cfg.zlut_radialsteps*zlut->h];
             zlut->copyImageToHost(i, img);
         }
     }
 }

§ GetRadialZLUTCompareProfile()

void QueuedCUDATracker::GetRadialZLUTCompareProfile ( float * dst )

inlinevirtual

Get saved error curve.

See EnableRadialZLUTCompareProfile.

Note: Not implemented for CUDA.

Parameters

[in] dst Pointer to the pre-allocated memory in which to save the error curve. Size is count * planes.

Implements QueuedTracker.

Definition at line 180 of file QueuedCUDATracker.h.

180 {} // dst = [count * planes]

§ GetRadialZLUTSize()

void QueuedCUDATracker::GetRadialZLUTSize	(	int &	count,
		int &	planes,
		int &	radialsteps
	)

overridevirtual

Get the dimensions of the radial lookup table data.

Parameters

[out]	count	Reference to pre-allocated int. Returns number of lookup tables.
[out]	planes	Reference to pre-allocated int. Returns number of planes per lookup table.
[out]	radialsteps	Reference to pre-allocated int. Returns number of steps per plane.

Implements QueuedTracker.

Definition at line 846 of file QueuedCUDATracker.cu.

 {
     count = devices[0]->radial_zlut.count;
     planes = devices[0]->radial_zlut.h;
     rsteps = cfg.zlut_radialsteps;
 }

§ GetReadyStream()

QueuedCUDATracker::Stream * QueuedCUDATracker::GetReadyStream ( )

protected

Get a stream that is not currently executing, and still has room for images.

Definition at line 349 of file QueuedCUDATracker.cu.

 {
     while (true) {
         jobQueueMutex.lock();
 
         Stream *best = 0;
         for (uint i=0;i<streams.size();i++) 
         {
             Stream*s = streams[i];          
             if (s->state == Stream::StreamIdle) {
                 if (!best || (s->JobCount() > best->JobCount()))
                     best = s;
             }
         }
 
         jobQueueMutex.unlock();
 
         if (best) 
             return best;
 
         Threads::Sleep(1);
     }
 }

§ GetResultCount()

int QueuedCUDATracker::GetResultCount ( )

overridevirtual

Get the number of finished localization jobs (=results) available in memory.

Returns: The number of available results.

Implements QueuedTracker.

Definition at line 853 of file QueuedCUDATracker.cu.

 {
     resultMutex.lock();
     int r = resultCount;
     resultMutex.unlock();
     return r;
 }

§ InitializeDeviceList()

void QueuedCUDATracker::InitializeDeviceList ( )

protected

Build the list of devices to be used based on the QTrkSettings::cuda_device flag.

Definition at line 76 of file QueuedCUDATracker.cu.

 {
     int numDevices;
     cudaGetDeviceCount(&numDevices);
 
     // Select the most powerful one
     if (cfg.cuda_device == QTrkCUDA_UseBest) {
         cfg.cuda_device = GetBestCUDADevice();
         devices.push_back(new Device(cfg.cuda_device));
     } else if(cfg.cuda_device == QTrkCUDA_UseAll) {
         // Use all devices
         for (int i=0;i<numDevices;i++)
             devices.push_back(new Device(i));
     } else if (cfg.cuda_device == QTrkCUDA_UseList) {
         for (uint i=0;i<cudaDeviceList.size();i++)
             devices.push_back(new Device(cudaDeviceList[i]));
     } else {
         devices.push_back (new Device(cfg.cuda_device));
     }
     deviceReport = "Using devices: ";
     for (uint i=0;i<devices.size();i++) {
         cudaDeviceProp p; 
         cudaGetDeviceProperties(&p, devices[i]->index);
         deviceReport += SPrintf("%s%s", p.name, i<devices.size()-1?", ":"\n");
     }
 }

§ IsIdle()

bool QueuedCUDATracker::IsIdle ( )

overridevirtual

Test to see if the tracker is idle.

That is, GetQueueLength == 0.

Returns: Boolean indicating if the tracker is idle.

Implements QueuedTracker.

Definition at line 373 of file QueuedCUDATracker.cu.

 {
     int ql = GetQueueLength(0);
     return ql == 0;
 }

§ ScheduleLocalization()

void QueuedCUDATracker::ScheduleLocalization	(	void *	data,
		int	pitch,
		QTRK_PixelDataType	pdt,
		const LocalizationJob *	jobInfo
	)

overridevirtual

Add a job to the queue to be processed. A job entails running the required algorithms on a single region of interest.

If a localization can not be added to the queue because it is full, the thread will be put to sleep and periodically try again.

Parameters

[in]	data	Pointer to the data. Type specified by `pdt`.
[in]	pitch	Distance in bytes between two successive rows of pixels (e.g. address of (0,0) - address of (0,1)).
[in]	pdt	Type of `data`, specified by QTRK_PixelDataType.
[in]	jobInfo	Structure with metadata for the ROI to be handled. See LocalizationJob.

Implements QueuedTracker.

Definition at line 405 of file QueuedCUDATracker.cu.

 {
     Stream* s = GetReadyStream();
 
     jobQueueMutex.lock();
     int jobIndex = s->jobs.size();
     LocalizationJob job = *jobInfo;
     s->jobs.push_back(job);
     s->localizeFlags = localizeMode; // which kernels to run
     s->locParams[jobIndex].zlutIndex = jobInfo->zlutIndex;
 
     if (s->jobs.size() == batchSize)
         s->state = Stream::StreamPendingExec;
     jobQueueMutex.unlock();
 
     s->imageBufMutex.lock();
     // Copy the image to the batch image buffer (CPU side)
     float* hostbuf = &s->hostImageBuf[cfg.height*cfg.width*jobIndex];
     CopyImageToFloat( (uchar*)data, cfg.width, cfg.height, pitch, pdt, hostbuf);
     if (localizeMode & LT_ClearFirstFourPixels) {
         for(int i=0;i<4;i++) hostbuf[i]=0;
     }
     s->imageBufMutex.unlock();
 
     //dbgprintf("Job: %d\n", jobIndex);
 }

§ SchedulingThreadEntryPoint()

void QueuedCUDATracker::SchedulingThreadEntryPoint ( void * param )

staticprotected

Entry point for thread creation.

Definition at line 212 of file QueuedCUDATracker.cu.

 {
     ((QueuedCUDATracker*)param)->SchedulingThreadMain();
 }

§ SchedulingThreadMain()

void QueuedCUDATracker::SchedulingThreadMain ( )

protected

Loop executed by the scheduling thread which executes threads when needed.

Definition at line 217 of file QueuedCUDATracker.cu.

 {
     std::vector<Stream*> activeStreams;
 
     while (!quitScheduler) {
         jobQueueMutex.lock();
         Stream* s = 0;
         for (uint i=0;i<streams.size();i++) 
             if (streams[i]->state == Stream::StreamPendingExec) {
                 s=streams[i];
                 s->state = Stream::StreamExecuting;
         //      dbgprintf("Executing stream %p [%d]. %d jobs\n", s, i, s->JobCount());
                 break;
             }
         jobQueueMutex.unlock();
 
         if (s) {
             s->imageBufMutex.lock();
 
             // Launch filled batches, or if flushing launch every batch with nonzero jobs
             if (useTextureCache)
                 ExecuteBatch<ImageSampler_Tex> (s);
             else
                 ExecuteBatch<ImageSampler_MemCopy> (s);
             s->imageBufMutex.unlock();
             activeStreams.push_back(s);
         }
         
         // Fetch results
         for (uint a=0;a<activeStreams.size();a++) {
             Stream* s = activeStreams[a];
             if (s->IsExecutionDone()) {
         //      dbgprintf("Stream %p done.\n", s);
                 CopyStreamResults(s);
                 s->localizeFlags = 0; // reset this for the next batch
                 jobQueueMutex.lock();
                 s->jobs.clear();
                 s->state = Stream::StreamIdle;
                 jobQueueMutex.unlock();
                 activeStreams.erase(std::find(activeStreams.begin(),activeStreams.end(),s));
                 break;
             }
         }
 
         Threads::Sleep(1);
     }
 }

§ SetConfigValue()

void QueuedCUDATracker::SetConfigValue	(	std::string	name,
		std::string	value
	)

overridevirtual

Set an additional setting.

Parameters

[in]	name	Name of the setting.
[in]	value	Value of the setting.

Implements QueuedTracker.

Definition at line 888 of file QueuedCUDATracker.cu.

 {
     if (name == "use_texturecache")
         useTextureCache = atoi(value.c_str()) != 0;
 }

§ SetLocalizationMode()

void QueuedCUDATracker::SetLocalizationMode ( LocMode_t locType )

overridevirtual

Select which algorithm is to be used.

Parameters

[in] locType An integer used as a bitmask for settings based on LocalizeModeEnum.

Implements QueuedTracker.

Definition at line 395 of file QueuedCUDATracker.cu.

 {
     Flush();
     while (!IsIdle());
 
     jobQueueMutex.lock();
     localizeMode = mode;
     jobQueueMutex.unlock();
 }

§ SetPixelCalibrationFactors()

void QueuedCUDATracker::SetPixelCalibrationFactors	(	float	offsetFactor,
		float	gainFactor
	)

overridevirtual

Set the pixel calibration factors.

The factors can be used to increase or decrease the effects of the images supplied through SetPixelCalibrationImages for further finetuning. These only have an effect when an image is actually set through that function.

Parameters

[in]	offsetFactor	Factor by which to scale the offset values.
[in]	gainFactor	Factor by which to scale the gain values.

Implements QueuedTracker.

Definition at line 753 of file QueuedCUDATracker.cu.

 {
     gc_mutex.lock();
     gc_gainFactor = gainFactor;
     gc_offsetFactor = offsetFactor;
     gc_mutex.unlock();
 }

§ SetPixelCalibrationImages()

void QueuedCUDATracker::SetPixelCalibrationImages	(	float *	offset,
		float *	gain
	)

overridevirtual

Set the pixel calibration images.

These images are used to scale the input image to get rid of background influences in the image. The values are per-pixel-per-ROI. Result = gain*(pixel+offset).

Parameters

[in]	offset	Array with the offset values to use per pixel. Size and order is [widthheightnumbeads].
[in]	gain	Array with gain values to use per pixel. Size and order is [widthheightnumbeads].

Implements QueuedTracker.

Definition at line 718 of file QueuedCUDATracker.cu.

 {
     for (uint i=0;i<devices.size();i++) {
         devices[i]->SetPixelCalibrationImages(offset, gain, cfg.width, cfg.height);
     }
 
     // Copy to CPU side buffers for BuildLUT
     int nelem = devices[0]->radial_zlut.count * cfg.width * cfg.height;
     if (offset && gc_offset.size()!=nelem) { 
         gc_offset.resize(nelem);
         gc_offset.assign(offset,offset+nelem);
     }
     if (!offset) gc_offset.clear();
 
     if (gain && gc_gain.size()!=nelem) {
         gc_gain.reserve(nelem);
         gc_gain.assign(gain, gain+nelem);
     }
     if (!gain) gc_gain.clear();
 }

§ SetRadialWeights()

void QueuedCUDATracker::SetRadialWeights ( float * zcmp )

overridevirtual

Set radial weights used for comparing LUT profiles.

Parameters

[in] zcmp Array of radial weights to use. zcmp has to have zlut_radialsteps elements.

Implements QueuedTracker.

Definition at line 798 of file QueuedCUDATracker.cu.

 {
     for (uint i=0;i<devices.size();i++) {
         devices[i]->SetRadialWeights(zcmp);
     }
 }

§ SetRadialZLUT()

void QueuedCUDATracker::SetRadialZLUT	(	float *	data,
		int	count,
		int	planes
	)

overridevirtual

Set the radial lookup tables to be used for z tracking.

Data can be zero to allocate ZLUT data. LUTs should have been created before by BuildLUT, but not necessarily by the current instance as long as setting match.

Parameters

[in]	data	Pointer to the start of the ZLUT data.
[in]	count	Number of ZLUTs in the dataset.
[in]	planes	Number of planes per ZLUT.

Implements QueuedTracker.

Definition at line 787 of file QueuedCUDATracker.cu.

 {
     for (uint i=0;i<devices.size();i++) {
         devices[i]->SetRadialZLUT(data, cfg.zlut_radialsteps, planes, numLUTs);
     }
 
     for (uint i=0;i<streams.size();i++) {
         StreamUpdateZLUTSize(streams[i]);
     }
 }

§ StreamUpdateZLUTSize()

void QueuedCUDATracker::StreamUpdateZLUTSize ( Stream * s )

protected

Update zlut vector dimensions. Use when settings change.

Parameters

[in] s The stream for which to update.

Definition at line 805 of file QueuedCUDATracker.cu.

 {       
     cudaSetDevice(s->device->index);
     s->d_zlutcmpscores.init(s->device->radial_zlut.h * batchSize);
     // radialsteps never changes
 }

§ threads()

dim3 QueuedCUDATracker::threads ( )

inlineprotected

Get the CUDA native datatype with the threadblock dimensions to use.

Use in conjunction with blocks as:

kernel <<< blocks(), threads() >>> (a, b);

Returns: CUDA 3D data for use in kernel calls.

Definition at line 399 of file QueuedCUDATracker.h.

399 { return dim3(numThreads); }

QueuedCUDATracker::numThreads

int numThreads

Number of threads to use in a general thread block.

Definition: QueuedCUDATracker.h:365

Member Data Documentation

§ batchesDone

int QueuedCUDATracker::batchesDone

Number of fully completed batches.

Definition at line 487 of file QueuedCUDATracker.h.

§ batchSize

int QueuedCUDATracker::batchSize

protected

Amount of images to be sent at once per stream.

Higher batchsize = higher speeds. Reason why it's faster:

More threads & blocks are created, allowing more efficient memory latency hiding by warp switching, or in other words, higher occupancy is achieved.
Bigger batches are copied at a time, achieving higher effective PCIe bus bandwidth.

Definition at line 359 of file QueuedCUDATracker.h.

§ cpu_time

KernelProfileTime QueuedCUDATracker::cpu_time

Definition at line 486 of file QueuedCUDATracker.h.

§ deviceProp

cudaDeviceProp QueuedCUDATracker::deviceProp

protected

Variable used to save device properties obtained from the CUDA API.

Definition at line 418 of file QueuedCUDATracker.h.

§ deviceReport

std::string QueuedCUDATracker::deviceReport

String holding a human-readable description of used GPUs. Filled during InitializeDeviceList.

Definition at line 489 of file QueuedCUDATracker.h.

§ devices

std::vector<Device*> QueuedCUDATracker::devices

protected

Vector of device instances used.

Definition at line 407 of file QueuedCUDATracker.h.

§ gc_gain

std::vector<float> QueuedCUDATracker::gc_gain

protected

Vector with gains used for pixel correction.

Definition at line 412 of file QueuedCUDATracker.h.

§ gc_gainFactor

float QueuedCUDATracker::gc_gainFactor

protected

Factor by which to scale the gain.

Definition at line 410 of file QueuedCUDATracker.h.

§ gc_mutex

Threads::Mutex QueuedCUDATracker::gc_mutex

protected

Mutex for pixel calibration operations.

Definition at line 413 of file QueuedCUDATracker.h.

§ gc_offset

std::vector<float> QueuedCUDATracker::gc_offset

protected

Vector with offsets used for pixel correction.

Definition at line 411 of file QueuedCUDATracker.h.

§ gc_offsetFactor

float QueuedCUDATracker::gc_offsetFactor

protected

Factor by which to scale the pixel calibration offset.

Definition at line 409 of file QueuedCUDATracker.h.

§ jobQueueMutex

Threads::Mutex QueuedCUDATracker::jobQueueMutex

protected

Mutex for job queue accesses.

Definition at line 406 of file QueuedCUDATracker.h.

§ localizeMode

LocMode_t QueuedCUDATracker::localizeMode

protected

Flags for localization choices. See LocalizeModeEnum.

Definition at line 404 of file QueuedCUDATracker.h.

§ numThreads

int QueuedCUDATracker::numThreads

protected

Number of threads to use in a general thread block.

Used by the blocks and threads functions to quickly calculate parameterspace covering kernel execution dimensions.

Definition at line 365 of file QueuedCUDATracker.h.

§ qalign

QI QueuedCUDATracker::qalign

protected

Instance of QI used specifically for quadrant alignment.

Definition at line 417 of file QueuedCUDATracker.h.

§ qi

QI QueuedCUDATracker::qi

protected

The QI instance used to perform the 2D localization using the quadrant interpolation algorithm.

Definition at line 416 of file QueuedCUDATracker.h.

§ quitScheduler

Atomic<bool> QueuedCUDATracker::quitScheduler

protected

Thread shutdown flag with built-in thread safety (atomic).

Definition at line 421 of file QueuedCUDATracker.h.

§ resultCount

int QueuedCUDATracker::resultCount

protected

Number of results available.

Definition at line 403 of file QueuedCUDATracker.h.

§ resultMutex

Threads::Mutex QueuedCUDATracker::resultMutex

protected

Mutex for result memory accesses.

Definition at line 405 of file QueuedCUDATracker.h.

§ results

std::list<LocalizationResult> QueuedCUDATracker::results

protected

Vector of completed results.

Definition at line 402 of file QueuedCUDATracker.h.

§ schedulingThread

Threads::Handle* QueuedCUDATracker::schedulingThread

protected

Handle to the scheduling thread for later reference.

Definition at line 420 of file QueuedCUDATracker.h.

§ streams

std::vector<Stream*> QueuedCUDATracker::streams

protected

Vector of usable streams.

Definition at line 401 of file QueuedCUDATracker.h.

§ time

KernelProfileTime QueuedCUDATracker::time

Definition at line 486 of file QueuedCUDATracker.h.

§ useTextureCache

bool QueuedCUDATracker::useTextureCache

protected

Flag to use texture cache. Default is true. Disable using EnableTextureCache.

Definition at line 408 of file QueuedCUDATracker.h.

§ zlut_build_flags

uint QueuedCUDATracker::zlut_build_flags

protected

Flags for ZLUT building. Not actually used yet.

Definition at line 414 of file QueuedCUDATracker.h.

The documentation for this class was generated from the following files:

cudatrack/QueuedCUDATracker.h
cudatrack/QueuedCUDATracker.cu

Classes

Public Types

Public Member Functions

Public Attributes

Protected Member Functions

Static Protected Member Functions

Protected Attributes

Detailed Description

Member Typedef Documentation

§ ImageLUT

Constructor & Destructor Documentation

§ QueuedCUDATracker()

§ ~QueuedCUDATracker()

Member Function Documentation

§ BeginLUT()

§ blocks() [1/2]

§ blocks() [2/2]

§ BuildLUT()

§ ClearResults()

§ CopyStreamResults()

§ CPU_ApplyOffsetGain()

§ CreateStream()

§ EnableRadialZLUTCompareProfile()

§ EnableTextureCache()

§ ExecuteBatch()

§ FetchResults()

§ FinalizeLUT()

§ Flush()

§ GetConfigValues()

§ GetProfileReport()

§ GetQueueLength()

§ GetRadialZLUT()

§ GetRadialZLUTCompareProfile()

§ GetRadialZLUTSize()

§ GetReadyStream()

§ GetResultCount()

§ InitializeDeviceList()

§ IsIdle()

§ ScheduleLocalization()

§ SchedulingThreadEntryPoint()

§ SchedulingThreadMain()

§ SetConfigValue()

§ SetLocalizationMode()

§ SetPixelCalibrationFactors()

§ SetPixelCalibrationImages()

§ SetRadialWeights()

§ SetRadialZLUT()

§ StreamUpdateZLUTSize()

§ threads()

Member Data Documentation

§ batchesDone

§ batchSize

§ cpu_time

§ deviceProp

§ deviceReport

§ devices

§ gc_gain

§ gc_gainFactor

§ gc_mutex

§ gc_offset

§ gc_offsetFactor

§ jobQueueMutex

§ localizeMode

§ numThreads

§ qalign

§ qi

§ quitScheduler

§ resultCount

§ resultMutex

§ results

§ schedulingThread

§ streams

§ time

§ useTextureCache

§ zlut_build_flags