How do I find duplicates within two columns in a csv file, then combine them in Python? -


i working large dataset of protein-protein interactions, have in .csv file. first 2 columns interacting proteins, , order not matter (ie a/b same b/a, duplicates). there third column, source these interactions published. duplicate pairs can same source, or different sources.

for duplicates 2 or more sources, how can combine them, in third column have of sources listed 1 interaction? (i.e. interaction a/b, duplicates a/b , b/a).

here example of columns:

interactor         interactor b              source                    b                         mary (2005) c                    d                         john (2004) b                                            mary (2005)                    b                         steve (1993) d                    c                         steve (1993) 

in case, need

interactor         interactor b              source                    b                         mary (2005), steve (1993) c                    d                         john (2004), steve (1993) 

thanks!

you aggregate them using sorted tuple dictionary key (to make a, b , b, a equivalent, tuples can used dictionary key, since it's immutable , hashable - lists not) . use set store aggregated values , avoid duplicates.

i'd throw in defaultdict make nicer aggregate values:

from collections import defaultdict import csv  # ... read values using csv reader (assuming name csv_reader)  result = defaultdict(set) row in csv_reader:     # create same key `a, b` , `b, a`     key = tuple(sorted([row[0], row[1]]))     result[key].add(row[2])  # result should contain aggregated values 

Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -