fuzzy_rbind — fuzzy_rbind • messy.cats

fuzzy_rbind() binds dataframes based on columns with slightly different names.

fuzzy_rbind(
  df1,
  df2,
  threshold,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1)
)

Arguments

df1	The first dataframe to be bound.
df2	The second dataframe to be bound.
threshold	The maximum string distance between column names, if the distance between columns is greater than this threshold the columns will not be bound.
method	The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw', Default: 'jw'
q	Size of the q-gram used in string distance calculation. Default: 1
p	Only used with method "jw", the Jaro-Winkler penatly size. Default: 0
bt	Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0
useBytes	Whether or not to perform byte-wise comparison. Default: FALSE
weight	Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)

Value

fuzzy_rbind() returns a dataframe that has bound the two inputted dataframes based on the closest matching columns, column names from dataframe 1 are preserved.

Details

When using datasets often times column names are slightly different, and fuzzy_rbind() helps to bind dataframes using fuzzy matching of the column names.

Examples

if (FALSE) {
if(interactive()){
 mtcars_colnames_messy = mtcars
 colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
 colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .5)
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .2)
 }
}