r - Setting weightages for Jarowinkler in compare.linkage -


i'm using compare.linkage method in record linkage package in r compare similarity of 2 set of strings. default string comparing method jarowinkler 3 default weightages set @ 1/3, 1/3 , 1/3.

i want overwrite default weightages 4/9, 4/9 , 1/9. how do that? in advance.

the default script is:

rpairs <- compare.linkage(stringset1, stringset2, strcmp = true, strcmpfun = jarowinkler) 

you have create own comparison function, compares 2 strings. in function can call jarowinkler. easiest way create closure:

jw <- function(w_1, w_2, w_3) {   function(str1, str2) {     jarowinkler(str1, str2, w_1, w_2, w_3)   } } 

this function pass weight parameters want use. function returns comparison function can use in compare.linkage call:

rpairs <- compare.linkage(stringset1, stringset2,   strcmp = true, strcmpfun = jw(4/9, 4/9, 1/9)) 

the jaro-winkler algorithm counts number of characters match (withing bandwidth) m. 2 strings john , johan there 4 characters match (j, o, h , n). taking selected characters:

john jonh 

it counts number of transpositions t. in case there 1 transposition (the h , n switched).

the jaro similarity given by:

1/3 * (w1 * m/l1 + w2 * m/l2 + w3 * (m-t)/m))  

with l1 , l2 lengths of 2 strings. weights equal 1/3 results in score between 0 , 1 (1=perfect match).

the jaro-winkler measure adds 'bonus' characters match @ beginning of string there less errors @ beginning (the measure created names). more information see example m.p.j van der loo (2014), stringdist package approximate string matching.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -