c++ - Double-checking understanding of memory coalescing in CUDA -
suppose define arrays visible gpu:
double* doublearr = createcudadouble(fieldlen); float* floatarr = createcudafloat(fieldlen); char* chararr = createcudachar(fieldlen);
now, have following cuda thread:
void thread(){ int o = getoffset(); // same threads in launch double d = doublearr[threadidx.x + o]; float f = floatarr[threadidx.x + o]; char c = chararr[threadidx.x + o]; }
i'm not quite sure whether correctly interpret documentation, , critical design: will memory accesses double, float , char nicely coalesced? (guess: yes, fit sizeof(type) * blocksize.x / (transaction size)
transactions, plus maybe 1 transaction @ upper , lower boundary.)
yes, cases have shown, , assuming createcudaxxxxx
translates kind of ordinary cudamalloc
type operation, should nicely coalesce.
if have ordinary 1d device arrays allocated via cudamalloc
, in general should have coalescing behavior across threads if our load pattern includes array index of form:
data_array[some_constant + threadidx.x];
it not matter data type array - coalesce nicely.
however, performance perspective, global loads (assuming l1 miss) occur in minimum 128-byte granularity. therefore loading larger sizes per thread (say, int
, float
, double
, float4
, etc.) may give better performance. caches tend mitigate difference, if loads across large enough number of warps.
it's pretty easy verify on particular piece of code profiler. there many ways depending on profiler choose, example nvprof can do:
nvprof --metric gld_efficiency ./my_exe
and return average percentage number more or less reflects percentage of optimal coalescing occurring on global loads.
this presentation cite additional background info on memory optimization.
i suppose come along , notice pattern:
data_array[some_constant + threadidx.x];
roughly corresponds access type shown on slides 40-41 of above presentation. , aha!! efficiency drops 50%-80%. true, if single warp-load being considered. however, referring slide 40, see "first" load require 2 cachelines loaded. after that however, additional loads (moving right, simplicity) require 1 additional/new cacheline per warp-load (assuming existence of l1 or l2 cache, , reasonable locality, i.e. lack of thrashing). therefore, on reasonably large array (more 128 bytes), average requirement 1 new cacheline per warp, corresponds 100% efficiency.
Comments
Post a Comment