regex - Comparing gold standard csv file and extracted values csv files in Python -


this data mining task automating scoring of quality of extraction. there gold standard csv might consist of the fields like

golden_standard.csv

| id | description             | amount  | date       | |----|-------------------------|---------|------------| | 1  | description.       | $150.54 | 12/12/2012 | | 2  | other description. | $200    | 10/10/2015 | | 3  | other description.      | $25     | 11/11/2014 | | 4  | description          | $11.35  | 01/01/2015 | | 5  | description.       | $20     | 03/03/2013 | 

, , there 2 possible extraction results files:

extract1.csv

| id | description             | date       | |----|-------------------------|------------| | 1  | description.       | 12/12/2012 | | 2  | other description. | 10/10/2015 | | 3  | other description.      | 11/11/2014 | | 4  | 122333222233332221      | 11/11/2014 | | 5  | description.       | 03/03/2013 | 

extract2.csv

| id | description             | amount  | date       | |----|-------------------------|---------|------------| | 1  | description.       | $150.54 | 12/12/2012 | | 2  | other description. | $200    | 10/10/2015 | | -  | ----------------------- | -----   | ---------- | | 5  | description.       | $20     | 03/03/2013 | 

extract3.csv

| garbage  | more garbage       | | garbage  | more garbage       |  

and have program report extract 1 missing column , values not matched in column 2.

for second case, missing entries , rows mismatched.

in last case, resulting csv screwed up, still want program detect meaningful abberation.

does have quick , clever way in python kind of comparison?

i have regular, longish row-by-row , column-by-column iterative way post here, thinking there might quicker, more elegant pythonic way this.

any appreciated.

disclaimer: approach uses pandas library.

first, data set-up.

gold_std.csv

id,description,amount,date 1,some description.,$150.54,12/12/2012 2,some other description.,$200,10/10/2015 3,other description.,$25,11/11/2014 4,my description,$11.35,01/01/2015 5,your description.,$20,03/03/2013 

extract1.csv

id,description,date 1,some description.,12/12/2012 2,some other description.,10/10/2015 3,other description.,11/11/2014 4,122333222233332221,11/11/2014 5,your description.,03/03/2013 

extract2.csv

id,description,amount,date 1,some description.,$150.54,12/12/2012 2,some other description.,$200,10/10/2015 3,other description.,$25,11/11/2014 5,your description.,$20,03/03/2013 

second, code.

import pandas pd  def compare_extract(extract_name, reference='gold_std.csv'):      gold = pd.read_csv(reference)     ext = pd.read_csv(extract_name)      gc = set(gold.columns)     header = ext.columns     extc = set(header)      if gc != extc:         missing = ", ".join(list(gc - extc))         print "extract has following missing columns: {}".format(missing)     else:         print "extract has same column standard. checking abberant rows..."         gold_list = gold.values.tolist()         ext_list = ext.values.tolist()         # non-pandaic approach because possible no same ids we're relying         # on set operations instead. bit hackish, actually.         diff = list(set(map(tuple, gold_list)) - set(map(tuple, ext_list)))         df = pd.dataframe(diff, columns=header)         print "the following rows not in extract: "         print df 

third, test runs.

e1 = 'extract1.csv' compare_extract(e1) # extract has following missing columns: amount  e2 = 'extract2.csv' compare_extract(e2) # extract has same column standard. checking abberant rows... # following rows not in extract:  #    id     description  amount        date # 0   4  description  $11.35  01/01/2015 

finally, last extract bit arbitrary. think 1 you're better off writing non-pandas algorithm.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -