bioinformatics - populating a matrix with a list, where each vector in the list may be 1 - 7 elements [R] -
say have ';' separated information in vector, want split apart, using strsplit. data
contains information looks this:
[1] "k__fungi; p__ascomycota; c__eurotiomycetes; o__unidentified; f__unidentified; g__unidentified; s__eurotiomycetes sp" [2] "k__fungi; p__basidiomycota; c__agaricomycetes; o__agaricales; f__mycenaceae; g__unidentified; s__mycenaceae sp" [3] "k__fungi; p__ascomycota" [4] "none" [5] "k__fungi; p__glomeromycota; c__glomeromycetes; o__glomerales; f__glomeraceae; g__glomus; s__glomus macrocarpum" [6] "k__fungi; p__basidiomycota; c__agaricomycetes; o__agaricales; f__inocybaceae; g__inocybe"
i use strsplit
separate out information this:
list<- strsplit(data,split=";")
the output of is
[[1]] [1] "k__fungi" " p__ascomycota" " c__eurotiomycetes" " o__unidentified" " f__unidentified" " g__unidentified" " s__eurotiomycetes sp" [[2]] [1] "k__fungi" " p__basidiomycota" " c__agaricomycetes" " o__agaricales" " f__mycenaceae" " g__unidentified" " s__mycenaceae sp" [[3]] [1] "k__fungi" " p__ascomycota" [[4]] [1] "none" [[5]] [1] "k__fungi" " p__glomeromycota" " c__glomeromycetes" " o__glomerales" " f__glomeraceae" " g__glomus" " s__glomus macrocarpum" [[6]] [1] "k__fungi" " p__basidiomycota" " c__agaricomycetes" " o__agaricales" " f__inocybaceae" " g__inocybe"
i want push information matrix formatted length of original data object, , 7 named columns. generate empty matrix this:
out<- matrix(nrow=(length(data)),ncol=7) colnames(out)<-c("kingdom","phylum","class","order","family","genus","species")
the empty matrix ends looking this:
kingdom phylum class order family genus species [1,] na na na na na na na [2,] na na na na na na na [3,] na na na na na na na [4,] na na na na na na na [5,] na na na na na na na [6,] na na na na na na na
i want insert information list
matrix, such if first vector in list has 7 elements, 7 columns in row 1 have entries. however, if vector in list has 2 elements, first 2 columns in matrix row have entries, , rest remain na
values.
**note: intentionally avoiding loops. had loop solution, fails when scale data set 100,000 lines.
you may try
library(stringi) m1 <- stri_list2matrix(list, byrow=true) colnames(m1) <- c("kingdom","phylum","class","order","family","genus","species")
or instead of using strsplit
, can directly read read.table
read.table(text=data, sep=";", fill=true, stringsasfactors=false, na.strings='')
or using devel version of data.table
library(data.table)#v1.9.5+ setdt(list(data))[,tstrsplit(v1, '; ')]
Comments
Post a Comment