r - Setting weightages for Jarowinkler in compare.linkage -
i'm using compare.linkage method in record linkage package in r compare similarity of 2 set of strings. default string comparing method jarowinkler 3 default weightages set @ 1/3, 1/3 , 1/3.
i want overwrite default weightages 4/9, 4/9 , 1/9. how do that? in advance.
the default script is:
rpairs <- compare.linkage(stringset1, stringset2, strcmp = true, strcmpfun = jarowinkler)
you have create own comparison function, compares 2 strings. in function can call jarowinkler
. easiest way create closure:
jw <- function(w_1, w_2, w_3) { function(str1, str2) { jarowinkler(str1, str2, w_1, w_2, w_3) } }
this function pass weight parameters want use. function returns comparison function can use in compare.linkage
call:
rpairs <- compare.linkage(stringset1, stringset2, strcmp = true, strcmpfun = jw(4/9, 4/9, 1/9))
the jaro-winkler algorithm counts number of characters match (withing bandwidth) m
. 2 strings john
, johan
there 4 characters match (j
, o
, h
, n
). taking selected characters:
john jonh
it counts number of transpositions t
. in case there 1 transposition (the h
, n
switched).
the jaro similarity given by:
1/3 * (w1 * m/l1 + w2 * m/l2 + w3 * (m-t)/m))
with l1
, l2
lengths of 2 strings. weights equal 1/3 results in score between 0 , 1 (1=perfect match).
the jaro-winkler measure adds 'bonus' characters match @ beginning of string there less errors @ beginning (the measure created names). more information see example m.p.j van der loo (2014), stringdist package approximate string matching.
Comments
Post a Comment