How do I find duplicates within two columns in a csv file, then combine them in Python? -
i working large dataset of protein-protein interactions, have in .csv file. first 2 columns interacting proteins, , order not matter (ie a/b same b/a, duplicates). there third column, source these interactions published. duplicate pairs can same source, or different sources.
for duplicates 2 or more sources, how can combine them, in third column have of sources listed 1 interaction? (i.e. interaction a/b, duplicates a/b , b/a).
here example of columns:
interactor interactor b source b mary (2005) c d john (2004) b mary (2005) b steve (1993) d c steve (1993)
in case, need
interactor interactor b source b mary (2005), steve (1993) c d john (2004), steve (1993)
thanks!
you aggregate them using sorted tuple
dictionary key (to make a, b
, b, a
equivalent, tuples can used dictionary key, since it's immutable , hashable - lists not) . use set
store aggregated values , avoid duplicates.
i'd throw in defaultdict
make nicer aggregate values:
from collections import defaultdict import csv # ... read values using csv reader (assuming name csv_reader) result = defaultdict(set) row in csv_reader: # create same key `a, b` , `b, a` key = tuple(sorted([row[0], row[1]])) result[key].add(row[2]) # result should contain aggregated values
Comments
Post a Comment