c++ - Double-checking understanding of memory coalescing in CUDA -


suppose define arrays visible gpu:

double* doublearr = createcudadouble(fieldlen); float* floatarr = createcudafloat(fieldlen); char* chararr = createcudachar(fieldlen); 

now, have following cuda thread:

void thread(){   int o = getoffset(); // same threads in launch   double d = doublearr[threadidx.x + o];   float f = floatarr[threadidx.x + o];   char c = chararr[threadidx.x + o]; } 

i'm not quite sure whether correctly interpret documentation, , critical design: will memory accesses double, float , char nicely coalesced? (guess: yes, fit sizeof(type) * blocksize.x / (transaction size) transactions, plus maybe 1 transaction @ upper , lower boundary.)

yes, cases have shown, , assuming createcudaxxxxx translates kind of ordinary cudamalloc type operation, should nicely coalesce.

if have ordinary 1d device arrays allocated via cudamalloc, in general should have coalescing behavior across threads if our load pattern includes array index of form:

data_array[some_constant + threadidx.x]; 

it not matter data type array - coalesce nicely.

however, performance perspective, global loads (assuming l1 miss) occur in minimum 128-byte granularity. therefore loading larger sizes per thread (say, int, float, double, float4, etc.) may give better performance. caches tend mitigate difference, if loads across large enough number of warps.

it's pretty easy verify on particular piece of code profiler. there many ways depending on profiler choose, example nvprof can do:

nvprof --metric gld_efficiency ./my_exe 

and return average percentage number more or less reflects percentage of optimal coalescing occurring on global loads.

this presentation cite additional background info on memory optimization.

i suppose come along , notice pattern:

data_array[some_constant + threadidx.x]; 

roughly corresponds access type shown on slides 40-41 of above presentation. , aha!! efficiency drops 50%-80%. true, if single warp-load being considered. however, referring slide 40, see "first" load require 2 cachelines loaded. after that however, additional loads (moving right, simplicity) require 1 additional/new cacheline per warp-load (assuming existence of l1 or l2 cache, , reasonable locality, i.e. lack of thrashing). therefore, on reasonably large array (more 128 bytes), average requirement 1 new cacheline per warp, corresponds 100% efficiency.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -