c++ - Same operations taking different time -


i in process of optimizing code n-body simulator, , when profiling code, have seen this:

enter image description here

these 2 lines,


float diffx = (pnode->centerofmassx - pbody->posx); float diffy = (pnode->centerofmassy - pbody->posy); 

where pnode pointer object of type node have defined, , contains (with other things) 2 floats, centerofmassx , centerofmassy

where pbody pointer object of type body have defined, , contains (with other things) 2 floats, posx , posy.


should take same amount of time, not. in fact first line accounts 0.46% of function samples, second accounts 5.20%.

now can see second line has 3 instructions, , first has one.

my question why these seemingly same thing in practice different things?

as stated, profiler listing 1 assembly instruction first line, 3 second line. however, because optimizer can move code around lot, isn't meaningful. looks code optimized load of values registers first, , perform subtractions. performs action first line, action second line (the loads), followed action first line , action second line (the subtractions). since difficult represent, best approximation of line corresponds assembly code when displaying disassembly inline code.

take note first load executed , may still in cpu pipeline when next load instruction executing. second load has no dependence on registers used in first load. however, first subtraction does. instruction requires previous load instruction far enough in pipeline result can used 1 of operands of subtraction. cause stall in cpu while pipeline lets load finish.

all of reinforces concept of memory optimizations being more important on modern cpus cpu optimizations. if, example, had loaded required values registers 15 instructions earlier, subtractions might have occurred quicker.

generally best thing can optimizations keep cache fresh memory going using, , making sure gets updated possible, , not right before memory needed. beyond that, optimizations complicated.

of course, of further complicated modern cpus might ahead 40-60 instructions out of order execution.

to optimize further, might consider using library vector , matrix operations in optimized manner. using 1 of these libraries, might possible use 2 vector instructions instead of 4 scalar instructions.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -