hadoop - Downloading list of files in parallel in Apache Pig -
i have simple text file contains list of folders on ftp servers. each line separate folder. each folder contains couple of thousand images. want connect each folder, store files inside foder in sequencefile
, remove folder ftp server. have written simple pig udf this. here is:
dirs = load '/var/location.txt' using pigstorage(); results = foreach dirs generate download_whole_folder_into_single_sequence_file($0); /* don't need results bag. dummy bag */
the problem i'm not sure if each line of input processed in separate mapper. input file not huge file couple of hundred lines. if pure map/reduce
use nlineinputformat
, process each line in separate mapper
. how can achieve same thing in pig?
pig lets write own load functions, let specify inputformat you'll using. write own.
that said, job described sounds involve single map-reduce step. since using pig wouldn't reduce complexity in case, , you'd have write custom code use pig, i'd suggest doing in vanilla map-reduce. if total file size gigabytes or less, i'd directly on single host. it's simpler not use map reduce if don't have to.
i typically use map-reduce first load data hdfs, , pig data processing. pig doesn't add benefits on vanilla hadoop loading data imo, it's wrapper around inputformat/recordreader additional methods need implement. plus it's technically possible pig loader called multiple times. that's gotcha don't need worry using hadoop map-reduce directly.
Comments
Post a Comment