hadoop - Downloading list of files in parallel in Apache Pig -


i have simple text file contains list of folders on ftp servers. each line separate folder. each folder contains couple of thousand images. want connect each folder, store files inside foder in sequencefile , remove folder ftp server. have written simple pig udf this. here is:

dirs = load '/var/location.txt' using pigstorage(); results = foreach dirs generate download_whole_folder_into_single_sequence_file($0); /* don't need results bag. dummy bag */ 

the problem i'm not sure if each line of input processed in separate mapper. input file not huge file couple of hundred lines. if pure map/reduce use nlineinputformat , process each line in separate mapper. how can achieve same thing in pig?

pig lets write own load functions, let specify inputformat you'll using. write own.

that said, job described sounds involve single map-reduce step. since using pig wouldn't reduce complexity in case, , you'd have write custom code use pig, i'd suggest doing in vanilla map-reduce. if total file size gigabytes or less, i'd directly on single host. it's simpler not use map reduce if don't have to.

i typically use map-reduce first load data hdfs, , pig data processing. pig doesn't add benefits on vanilla hadoop loading data imo, it's wrapper around inputformat/recordreader additional methods need implement. plus it's technically possible pig loader called multiple times. that's gotcha don't need worry using hadoop map-reduce directly.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -