Wilcar
Wilcar

Reputation: 2513

Reshaping and sumarizing a data.frame based on partial match text (package stringdist)

I work on an old list names. The names of people are written differently but in reality, these are the same people. I used the stringdist package to compute the distance between strings to find wich names are probably the same.

A small example of my data:

 data <- data.frame(column1 = c("Lalande, Pierre","Lalande, P","Tertre, Girard ","Tertre Girard du"),
                    column2 = c(4, 5, 10, 1))

What it gives:

            column1 column2
    Lalande, Pierre       4
         Lalande, P       5
    Tertre, Girard       10
   Tertre Girard du       1

What I tried: using stringdist package

 library (stringdist)
 distance <- stringdistmatrix(data$column1,
                              useNames="strings",
                              method="lv")
 distance2 = as.matrix(distance)

Distance <5: quasi equal strings

             Lalande, Pierre Lalande, P Tertre, Girard 
Lalande, P                     5                           
Tertre, Girard                11         13                
Tertre Girard du              14         15          3

Reshaping

library(reshape2)
out <- unique(melt(distance2))

What it gives:

           Var1             Var2     value
1   Lalande, Pierre  Lalande, Pierre     0
2        Lalande, P  Lalande, Pierre     5
3   Tertre, Girard   Lalande, Pierre    11
4  Tertre Girard du  Lalande, Pierre    14
5   Lalande, Pierre       Lalande, P     5
6        Lalande, P       Lalande, P     0
7   Tertre, Girard        Lalande, P    13
8  Tertre Girard du       Lalande, P    15
9   Lalande, Pierre  Tertre, Girard     11
10       Lalande, P  Tertre, Girard     13
11  Tertre, Girard   Tertre, Girard      0
12 Tertre Girard du  Tertre, Girard      3
13  Lalande, Pierre Tertre Girard du    14
14  Lalande, P Tertre Girard du         15
15  Tertre, Girard  Tertre Girard du     3
16 Tertre Girard du Tertre Girard du     0

keeping only the good lines:

out2 <- out %>%
   filter (value>0 & value<5)
out2

final but without my column 3!

          Var1             Var2     value
1 Tertre Girard du  Tertre, Girard      3
2  Tertre, Girard  Tertre Girard du     3

How can do this? (summing my original data.frame column2 values)

Var1            Var2                 Column3(summing)
Lalande, Pierre    Lalande, P               9                
Tertre, Girard    Tertre Girard du         11

Upvotes: 1

Views: 168

Answers (1)

Wyldsoul
Wyldsoul

Reputation: 1553

I'm sure there is a cleaner way of doing this, but this works in base R.

 data <- data.frame(column1 = c("Lalande, Pierre","Lalande, P","Tertre, Girard ","Tertre Girard du"),
               column2 = c(4, 5, 10, 1))

create a column based on pattern match

 data$column3 <- gsub(",.*| .*",  "", data$column1) 

here the x part of the merge we are unstacking and transforming columns 1 and 3

for the y part of the merge we aggregate by the match column 3

x and y are merged by the respective match columns

  merge(t(unstack(data[c(1,3)])),aggregate(data$column2, by=list(gsub(",.*| .*",  "", data$column1)), FUN=sum), by.x = "row.names", by.y = "Group.1")

Upvotes: 1

Related Questions