QTrk
Classes | Public Member Functions | Private Member Functions | Private Attributes | List of all members
QI Class Reference

Class to specifically perform quadrant interpolation calculations on the GPU. More...

#include <QI.h>

Classes

struct  DeviceInstance
 Contains all QI memory space that is allocated per device (shared between streams). More...
 
struct  StreamInstance
 Contains all QI memory space that is allocated per stream. More...
 

Public Member Functions

template<typename TImageSampler >
void Execute (BaseKernelParams &p, const QTrkComputedConfig &cfg, StreamInstance *s, DeviceInstance *d, device_vec< float3 > *initial, device_vec< float3 > *output)
 Execute a batch of QI calculations. Runs Iterate the correct number of times and recalculates the number of radial spokes for every iteration, based on QTrkComputedConfig::qi_angstep_factor. More...
 
void InitDevice (DeviceInstance *d, QTrkComputedConfig &cc)
 Ready a device for QI calculations. More...
 
void InitStream (StreamInstance *s, QTrkComputedConfig &cc, cudaStream_t stream, int batchSize)
 Ready a stream for QI calculations. More...
 
void Init (QTrkComputedConfig &cfg, int batchSize)
 Initialize this QI instance. More...
 

Private Member Functions

template<typename TImageSampler >
void Iterate (BaseKernelParams &p, device_vec< float3 > *initial, device_vec< float3 > *output, StreamInstance *s, DeviceInstance *d, int angularSteps)
 Perform one QI iteration. More...
 
dim3 blocks (int njobs)
 Same function as QueuedCUDATracker::blocks. More...
 
dim3 threads ()
 Same function as QueuedCUDATracker::threads. More...
 

Private Attributes

QIParams params
 Structure with settings relevant to quadrant interpolation. More...
 
int qi_FFT_length
 Parameter for length required for arrays going into FFT. 2 * radial steps. More...
 
int batchSize
 See QueuedCUDATracker::batchSize. Local copy. More...
 
int numThreads
 See QueuedCUDATracker::numThreads. More...
 

Detailed Description

Class to specifically perform quadrant interpolation calculations on the GPU.

Maintains all memory and functions related to QI to keep QueuedCUDATracker more clean.

Definition at line 20 of file QI.h.

Member Function Documentation

§ blocks()

dim3 QI::blocks ( int  njobs)
inlineprivate

Same function as QueuedCUDATracker::blocks.

Definition at line 153 of file QI.h.

153 { return dim3((njobs+numThreads-1)/numThreads); }
int numThreads
See QueuedCUDATracker::numThreads.
Definition: QI.h:150

§ Execute()

template<typename TImageSampler >
void QI::Execute ( BaseKernelParams p,
const QTrkComputedConfig cfg,
QI::StreamInstance s,
QI::DeviceInstance d,
device_vec< float3 > *  initial,
device_vec< float3 > *  output 
)

Execute a batch of QI calculations. Runs Iterate the correct number of times and recalculates the number of radial spokes for every iteration, based on QTrkComputedConfig::qi_angstep_factor.

Parameters
[in]pReference to BaseKernelParams with parameters for this call.
[in]cfgReference to the settings to use.
[in]sThe QI::StreamInstance to use.
[in]dThe QI::DeviceInstance to use.
[in]initialVector with the initial positions from which to start the radial samplings on the first iteration.
[out]outputVector that will be filled with the final results.

Definition at line 268 of file QI_impl.h.

269 {
270  float angsteps = cfg.qi_angstepspq / powf(cfg.qi_angstep_factor, cfg.qi_iterations);
271 
272  for (int a=0;a<cfg.qi_iterations;a++) {
273  device_vec<float3>* dst = a==0 ? initial : output;
274  Iterate< TImageSampler > (p, dst, output, s, d, std::max(MIN_RADPROFILE_SMP_COUNT, (int)angsteps) );
275  angsteps *= cfg.qi_angstep_factor;
276  }
277 }
#define MIN_RADPROFILE_SMP_COUNT
Minimum number of samples for a profile radial bin. Below this the image mean will be used...
Definition: QueuedTracker.h:74
int qi_angstepspq
Number of angular steps to sample on per quadrant.
Definition: qtrk_c_api.h:203
int qi_iterations
Number of times to run the QI algorithm, sampling around the last found position. ...
Definition: qtrk_c_api.h:140
float qi_angstep_factor
Factor to reduce angular steps on lower iterations. Default 1.0 (no effect).
Definition: qtrk_c_api.h:157

§ Init()

void QI::Init ( QTrkComputedConfig cfg,
int  batchSize 
)

Initialize this QI instance.

Copy relevant settings to local datastructures.

Parameters
[in]cfgThe configuration used for the algorithm.
[in]batchSizeThe calculation batch size. See batchSize.

Definition at line 387 of file QI_impl.h.

388 {
389  QIParams& qi = params;
391  qi.iterations = cfg.qi_iterations;
392  qi.maxRadius = cfg.qi_maxradius;
393  qi.minRadius = cfg.qi_minradius;
394  qi.radialSteps = cfg.qi_radialsteps;
395 
396  cudaDeviceProp prop;
397  cudaGetDeviceProperties(&prop, 0);
398  numThreads = prop.warpSize;
399 
400  this->batchSize = batchSize;
401 }
int iterations
Definition: QI.h:11
int maxAngularSteps
Maximum amount of angular steps. Used to enable usage of QTrkSettings::qi_angstep_factor.
Definition: QI.h:12
float qi_minradius
Distance in pixels from the bead center from which to start sampling profiles. Default 1...
Definition: qtrk_c_api.h:141
float maxRadius
Definition: QI.h:9
QIParams params
Structure with settings relevant to quadrant interpolation.
Definition: QI.h:142
int radialSteps
Definition: QI.h:10
Structure to hold QI settings. See QTrkComputedConfig for more info on most settings.
Definition: QI.h:7
float qi_maxradius
Max radius in pixels of the sampling circle.
Definition: qtrk_c_api.h:204
int batchSize
See QueuedCUDATracker::batchSize. Local copy.
Definition: QI.h:149
int qi_radialsteps
Number of radial steps to sample on.
Definition: qtrk_c_api.h:202
int qi_angstepspq
Number of angular steps to sample on per quadrant.
Definition: qtrk_c_api.h:203
int qi_iterations
Number of times to run the QI algorithm, sampling around the last found position. ...
Definition: qtrk_c_api.h:140
float minRadius
Definition: QI.h:8
int numThreads
See QueuedCUDATracker::numThreads.
Definition: QI.h:150

§ InitDevice()

void QI::InitDevice ( DeviceInstance d,
QTrkComputedConfig cc 
)

Ready a device for QI calculations.

Calculate required parameters and fill a DeviceInstance.

Parameters
[in,out]dThe instance to be initialized.
[in]ccThe configuration used for the algorithm.

Definition at line 344 of file QI_impl.h.

345 {
346  std::vector<float2> qi_radialgrid(cc.qi_angstepspq);
347  for (int i=0;i<cc.qi_angstepspq;i++) {
348  float ang = 0.5f*3.141593f*(i+0.5f)/(float)cc.qi_angstepspq;
349  qi_radialgrid[i]=make_float2(cos(ang), sin(ang));
350  }
351  d->qi_trigtable = qi_radialgrid;
352 
353  std::vector<float> rweights = ComputeRadialBinWindow(cc.qi_radialsteps);
354  d->d_radialweights = rweights;
355 
356  QIParams dp = params;
357  dp.cos_sin_table = d->qi_trigtable.data;
358 
359  cudaMalloc(&d->d_qiparams, sizeof(QIParams));
360  cudaMemcpy(d->d_qiparams, &dp, sizeof(QIParams), cudaMemcpyHostToDevice);
361 }
QIParams params
Structure with settings relevant to quadrant interpolation.
Definition: QI.h:142
std::vector< float > ComputeRadialBinWindow(int rsteps)
Calculate the radial weights for ZLUT profile comparisons.
Definition: utils.cpp:706
Structure to hold QI settings. See QTrkComputedConfig for more info on most settings.
Definition: QI.h:7
int qi_radialsteps
Number of radial steps to sample on.
Definition: qtrk_c_api.h:202
int qi_angstepspq
Number of angular steps to sample on per quadrant.
Definition: qtrk_c_api.h:203
float2 * cos_sin_table
Reference to QI::DeviceInstance::qi_trigtable for use in kernels. Radial sampling points...
Definition: QI.h:13

§ InitStream()

void QI::InitStream ( StreamInstance s,
QTrkComputedConfig cc,
cudaStream_t  stream,
int  batchSize 
)

Ready a stream for QI calculations.

Calculate required parameters and fill a StreamInstance.

Parameters
[in,out]sThe instance to be initialized.
[in]ccThe configuration used for the algorithm.
[in]streamThe normal CUDA stream this QI Stream will relate to.
[in]batchSizeThe calculation batch size. See batchSize.

Definition at line 363 of file QI_impl.h.

364 {
365  int fftlen = cc.qi_radialsteps*2;
366  s->stream = stream;
367  s->d_quadrants.init(fftlen*batchSize*2); // 4 quadrants * radialSteps * batchSize
368  s->d_QIprofiles.init(batchSize*2*fftlen); // (2 axis) * (2 radialsteps) = 4 * nr = 2 * fftlen
369  s->d_QIprofiles_reverse.init(batchSize*2*fftlen);
370  s->d_shiftbuffer.init(fftlen * batchSize);
371 
372  // 2* batchSize, since X & Y both need an FFT transform
373  // cufftPlanMany and 1d with batch argument are equivalent in memory usage and speed
374  // Using Many because the batch argument for 1d is strictly speaking deprecated (see cufft.h)
375  cufftResult_t r = cufftPlanMany(&s->fftPlan, 1, &fftlen, 0, 1, fftlen, 0, 1, fftlen, CUFFT_C2C, batchSize*2);
376  //cufftResult_t r = cufftPlan1d(&s->fftPlan, fftlen, CUFFT_C2C, batchSize*2);
377 
378  if(r != CUFFT_SUCCESS) {
379  throw std::runtime_error( SPrintf("CUFFT plan creation failed. FFT len: %d. Batchsize: %d\n", fftlen, batchSize*2));
380  }
381  CheckCUDAError(cufftSetCompatibilityMode(s->fftPlan, CUFFT_COMPATIBILITY_NATIVE));
382  CheckCUDAError(cufftSetStream(s->fftPlan, stream));
383 
384  this->qi_FFT_length = fftlen;
385 }
int qi_FFT_length
Parameter for length required for arrays going into FFT. 2 * radial steps.
Definition: QI.h:148
void CheckCUDAError(cufftResult_t err)
Definition: gpu_utils.h:36
int batchSize
See QueuedCUDATracker::batchSize. Local copy.
Definition: QI.h:149
int qi_radialsteps
Number of radial steps to sample on.
Definition: qtrk_c_api.h:202
std::string SPrintf(const char *fmt,...)
Definition: utils.cpp:132

§ Iterate()

template<typename TImageSampler >
void QI::Iterate ( BaseKernelParams p,
device_vec< float3 > *  initial,
device_vec< float3 > *  output,
StreamInstance s,
DeviceInstance d,
int  angularSteps 
)
private

Perform one QI iteration.

This is where the actual algorithm is executed. See [3] for details.

The kernels are called in the following order:

  • QI_ComputeQuadrants - Calculate the 4 quadrant profiles
  • QI_QuadrantsToProfiles - Convert the quadrant profiles into the concatenated and their respective reverse profiles
  • Forward FFT (CUFFT) - Calculate the fourier transforms of the profiles and reverse profiles
  • QI_MultiplyWithConjugate - Multiply the transforms to calculate their autocorrelation
  • Reverse FFT (CUFFT) - Transform the multiplied transforms back to the time-domain autocorrelation
  • QI_OffsetPositions - Calculate and apply the X and Y shifts from the autocorrelations
Parameters
[in]pKernelParamaters to use.
[in]initialThe initial positions to use as sampling centers.
[out]outputVector with the resulting positions.
[in]sThe stream to execute.
[in]dThe device on which to execute.
[in]angularStepsThe number of angular steps to use.

Huge speedup achieved by calculating each quadrant's radial bin in a seperate thread. This leads to more, smaller kernels running concurrently so more memory latency hiding can be achieved through higher occupancy. For a visual display of this, try running the NVidia Visual Profiler with both versions of quadrant profile kernels.

Definition at line 280 of file QI_impl.h.

281 {
282  /* Old method of calculating quadrant profiles.
283  QIParams dp = params;
284  dp.cos_sin_table = d->qi_trigtable.data;
285 
286  QI_ComputeProfile <TImageSampler> <<< blocks(p.njobs), threads(), 0, s->stream >>> (p, initial->data,
287  s->d_quadrants.data, s->d_QIprofiles.data, s->d_QIprofiles_reverse.data, dp, d->d_radialweights.data, angularSteps);
288  // DbgOutputVectorToFile("D:\\TestImages\\gpu_qi_prof_old.csv", s->d_quadrants, true);
289  */
290 
294  dim3 qdrThreads(16, 8, 4);
295  dim3 qdrDim( (p.njobs + qdrThreads.x - 1) / qdrThreads.x, (params.radialSteps + qdrThreads.y - 1) / qdrThreads.y);
296 
297  QI_ComputeQuadrants<TImageSampler> <<< qdrDim , qdrThreads, 0, s->stream >>>
298  (p, initial->data, s->d_quadrants.data, d->d_qiparams, angularSteps);
299 
300  QI_QuadrantsToProfiles <<< blocks(p.njobs), threads(), 0, s->stream >>>
301  (p, s->d_quadrants.data, s->d_QIprofiles.data, s->d_QIprofiles_reverse.data, d->d_radialweights.data, d->d_qiparams);
302  // DbgOutputVectorToFile("D:\\TestImages\\gpu_qi_prof_new.csv", s->d_quadrants, true);
303 
304 #ifdef QI_DEBUG
305  DbgCopyResult(s->d_QIprofiles, cmp_gpu_qi_prof);
306 #endif
307 
308  cufftComplex* prof = (cufftComplex*)s->d_QIprofiles.data;
309  cufftComplex* revprof = (cufftComplex*)s->d_QIprofiles_reverse.data;
310  CheckCUDAError(cufftExecC2C(s->fftPlan, prof, prof, CUFFT_FORWARD));
311  CheckCUDAError(cufftExecC2C(s->fftPlan, revprof, revprof, CUFFT_FORWARD));
312 
313  int nval = qi_FFT_length * 2 * batchSize;
314  int nthread = 256;
315  QI_MultiplyWithConjugate<<< dim3( (nval + nthread - 1)/nthread ), dim3(nthread), 0, s->stream >>>(nval, prof, revprof);
316  /*
317  Experiment to see the effect of different block sizes.
318  Seems that if the whole parameter space is covered, block sizes don't matter for execution speed.
319 
320  dim3 threadsDim = dim3(16, 16);
321  int numBlocks = nval / (threadsDim.x * threadsDim.y);
322  dim3 blocksDim = dim3( sqrt((double)numBlocks)+1, sqrt((double)numBlocks)+1 );
323 
324  if(nval > threadsDim.x * threadsDim.y * blocksDim.x * blocksDim.y){
325  printf("\nNot whole space covered: nval > total threads\n");
326  printf("nval: %d, nthread: %d, block (%d,%d), threads (%d,%d)\n", nval, nthread, blocksDim.x, blocksDim.y, threadsDim.x, threadsDim.y);
327  }
328  QI_MultiplyWithConjugate<<< blocksDim, threadsDim, 0, s->stream >>> (nval, prof, revprof);
329  */
330 
331  CheckCUDAError(cufftExecC2C(s->fftPlan, prof, prof, CUFFT_INVERSE));
332 
333 #ifdef QI_DEBUG
334  DbgCopyResult(s->d_QIprofiles, cmp_gpu_qi_fft_out);
335 #endif
336 
337  float2* d_offsets=0;
338  float pixelsPerProfLen = (params.maxRadius-params.minRadius)/params.radialSteps;
339  dim3 nBlocks=blocks(p.njobs), nThreads=threads();
340  QI_OffsetPositions<<<nBlocks, nThreads, 0, s->stream>>>
341  (p.njobs, initial->data, newpos->data, prof, qi_FFT_length, d_offsets, pixelsPerProfLen, s->d_shiftbuffer.data);
342 }
std::vector< std::complex< float > > cmp_gpu_qi_fft_out
Definition: test.cu:672
int qi_FFT_length
Parameter for length required for arrays going into FFT. 2 * radial steps.
Definition: QI.h:148
float maxRadius
Definition: QI.h:9
__global__ void QI_QuadrantsToProfiles(BaseKernelParams kp, float *quadrants, float2 *profiles, float2 *reverseProfiles, float *d_radialweights, const QIParams *params)
Definition: QI_impl.h:215
QIParams params
Structure with settings relevant to quadrant interpolation.
Definition: QI.h:142
int radialSteps
Definition: QI.h:10
void CheckCUDAError(cufftResult_t err)
Definition: gpu_utils.h:36
__global__ void QI_MultiplyWithConjugate(int n, cufftComplex *a, cufftComplex *b)
Definition: QI_impl.h:102
dim3 threads()
Same function as QueuedCUDATracker::threads.
Definition: QI.h:155
__global__ void QI_OffsetPositions(int njobs, float3 *current, float3 *dst, cufftComplex *autoconv, int fftLength, float2 *offsets, float pixelsPerProfLen, float *shiftbuf)
Definition: QI_impl.h:130
int batchSize
See QueuedCUDATracker::batchSize. Local copy.
Definition: QI.h:149
void DbgCopyResult(device_vec< float2 > src, std::vector< std::complex< float > > &dst)
Definition: gpu_utils.h:268
std::vector< float > cmp_gpu_qi_prof
Definition: test.cu:669
int njobs
Number of jobs in the batch.
float minRadius
Definition: QI.h:8
dim3 blocks(int njobs)
Same function as QueuedCUDATracker::blocks.
Definition: QI.h:153

§ threads()

dim3 QI::threads ( )
inlineprivate

Same function as QueuedCUDATracker::threads.

Definition at line 155 of file QI.h.

155 { return dim3(numThreads); }
int numThreads
See QueuedCUDATracker::numThreads.
Definition: QI.h:150

Member Data Documentation

§ batchSize

int QI::batchSize
private

See QueuedCUDATracker::batchSize. Local copy.

Definition at line 149 of file QI.h.

§ numThreads

int QI::numThreads
private

See QueuedCUDATracker::numThreads.

Definition at line 150 of file QI.h.

§ params

QIParams QI::params
private

Structure with settings relevant to quadrant interpolation.

Definition at line 142 of file QI.h.

§ qi_FFT_length

int QI::qi_FFT_length
private

Parameter for length required for arrays going into FFT. 2 * radial steps.

Used to be nearest power of two, but since switching to cuFFT, this is not needed for speed optimization anymore.

Definition at line 148 of file QI.h.


The documentation for this class was generated from the following files: