java - Why does crawling a folder in Linux gets faster with each iteration? -


i have program in need crawl specific folder delete contents. using files.walkfiletree method achieve same. on ubuntu 14.04, 64bit 4gb ram, although program runs fine, first time start crawling folder, takes long time. however, on subsequent crawls of same folder, time decreases drastically, , settles down. here output of simple system.currenttimeinmillis() calls check time spent:

key:

  • deletion tells time took crawl folder called 'output' , delete contents(around ~5000 files).
  • copy tells time took copy contents of small folder on place.
  • crawl tells time took crawl directory called 'content', parse ~5000 files create corresponding objects + deletion + copy [see above].
  • parse tells time took write contents of these ~5000 objects separate files. times in milliseconds.

folder a

on first time:

deletion: 100100 copy: 53 crawl: 143244 parse: 4307 

on second time:

deletion: 486 copy: 3 crawl: 1424 parse: 4581 

on third time:

deletion: 567 copy: 16 crawl: 1999 parse: 4027 

folder b on first time:

deletion: 88971 copy: 47 crawl: 137623 parse: 4125 

on second time:

deletion: 443 copy: 31 crawl: 1631 parse: 3986 

on third time:

deletion: 434 copy: 4 crawl: 1648 parse: 4048 

it noted crawl time includes time deletion , copying. 3 operations performed on same folder, same contents, minutes apart each other. happens each new folder try program on. issue file system in place? or code?

the code use benchmark here:

public void publish(boolean fastparse) throws ioexception, interruptedexception {         infohandler info = new infohandler();         long start = system.currenttimemillis();         if (fastparse == true) {             crawler.fastreadintoqueue(paths.get(directorycrawler.source_directory).normalize());         }         else {             crawler.readintoqueue(paths.get(directorycrawler.source_directory).normalize());         }         long read = system.currenttimemillis();         info.findlatestposts(filequeue);         //info.findnavigationpages(filequeue);         long ps = system.currenttimemillis();         parser = new parser();         parser.parse(filequeue);         info.writeinfofile();         long pe = system.currenttimemillis();         system.out.println("crawl: " + (read - start));         system.out.println("parse: " + (pe - ps));     } 

and class defines readintoqueue method can found here: https://github.com/pawandubey/griffin/blob/master/src/main/java/com/pawandubey/griffin/directorycrawler.java

it's because linux pretty aggressively caching everything. follows principle — free ram wasted ram. not worry, once app needs more ram, linux drops cache.

you can run free system, like:

$ free -h              total       used       free     shared    buffers     cached mem:          2,9g       2,7g       113m        55m        92k       1,0g -/+ buffers/cache:       1,8g       1,1g swap:         1,9g       100m       1,8g 

in first row, «used» column (2,7g) amount of memory used system, including cache. second row, same column amount of used memory applications. can think of difference between rows free memory, because once app needed it, system, mentioned, drops cache.

you may read more here


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -