regex - Comparing gold standard csv file and extracted values csv files in Python -
this data mining task automating scoring of quality of extraction. there gold standard csv might consist of the fields like
golden_standard.csv
| id | description | amount | date | |----|-------------------------|---------|------------| | 1 | description. | $150.54 | 12/12/2012 | | 2 | other description. | $200 | 10/10/2015 | | 3 | other description. | $25 | 11/11/2014 | | 4 | description | $11.35 | 01/01/2015 | | 5 | description. | $20 | 03/03/2013 |
, , there 2 possible extraction results files:
extract1.csv
| id | description | date | |----|-------------------------|------------| | 1 | description. | 12/12/2012 | | 2 | other description. | 10/10/2015 | | 3 | other description. | 11/11/2014 | | 4 | 122333222233332221 | 11/11/2014 | | 5 | description. | 03/03/2013 |
extract2.csv
| id | description | amount | date | |----|-------------------------|---------|------------| | 1 | description. | $150.54 | 12/12/2012 | | 2 | other description. | $200 | 10/10/2015 | | - | ----------------------- | ----- | ---------- | | 5 | description. | $20 | 03/03/2013 |
extract3.csv
| garbage | more garbage | | garbage | more garbage |
and have program report extract 1 missing column , values not matched in column 2.
for second case, missing entries , rows mismatched.
in last case, resulting csv screwed up, still want program detect meaningful abberation.
does have quick , clever way in python kind of comparison?
i have regular, longish row-by-row , column-by-column iterative way post here, thinking there might quicker, more elegant pythonic way this.
any appreciated.
disclaimer: approach uses pandas
library.
first, data set-up.
gold_std.csv
id,description,amount,date 1,some description.,$150.54,12/12/2012 2,some other description.,$200,10/10/2015 3,other description.,$25,11/11/2014 4,my description,$11.35,01/01/2015 5,your description.,$20,03/03/2013
extract1.csv
id,description,date 1,some description.,12/12/2012 2,some other description.,10/10/2015 3,other description.,11/11/2014 4,122333222233332221,11/11/2014 5,your description.,03/03/2013
extract2.csv
id,description,amount,date 1,some description.,$150.54,12/12/2012 2,some other description.,$200,10/10/2015 3,other description.,$25,11/11/2014 5,your description.,$20,03/03/2013
second, code.
import pandas pd def compare_extract(extract_name, reference='gold_std.csv'): gold = pd.read_csv(reference) ext = pd.read_csv(extract_name) gc = set(gold.columns) header = ext.columns extc = set(header) if gc != extc: missing = ", ".join(list(gc - extc)) print "extract has following missing columns: {}".format(missing) else: print "extract has same column standard. checking abberant rows..." gold_list = gold.values.tolist() ext_list = ext.values.tolist() # non-pandaic approach because possible no same ids we're relying # on set operations instead. bit hackish, actually. diff = list(set(map(tuple, gold_list)) - set(map(tuple, ext_list))) df = pd.dataframe(diff, columns=header) print "the following rows not in extract: " print df
third, test runs.
e1 = 'extract1.csv' compare_extract(e1) # extract has following missing columns: amount e2 = 'extract2.csv' compare_extract(e2) # extract has same column standard. checking abberant rows... # following rows not in extract: # id description amount date # 0 4 description $11.35 01/01/2015
finally, last extract bit arbitrary. think 1 you're better off writing non-pandas
algorithm.
Comments
Post a Comment