A. Koirala
A. Koirala

Reputation: 1

R - Cleaning numbers that arent correctly entered

EDIT: I have modified the question a little. I apolozise for not understanding the full nature of the problem in the beginning.

Im working on cleaning data that is gathered from several people. While most people enter data in the correct format, i have found that some of them are 'lazy entries'

Below is my input dataset:

|  ID                       | Name    | Age  |
|--------------------------------------------|
|  123456/7/56617181/82     | ABC     | 23   |
|  12134/135                | CDE     | 24   |

And here is what I want:

|  ID                               | Name    | Age  |
|----------------------------------------------------|
|  123456/123457/56617181/56617182  | ABC     | 23   |
|  12134/12135                      | CDE     | 24   |

Once I have this, all i need to do is to generate an individual row for every ID. I've already figured this bit out. Nonetheless, this is what I will finally have:

|  ID         | Name    | Age  |
|------------------------------|
|  123456     | ABC     | 23   |
|  123457     | ABC     | 23   |
|  56617181   | ABC     | 23   |
|  56617182   | ABC     | 23   |
|  12134      | CDE     | 24   |
|  12135      | CDE     | 24   |

Any idea on how one could go about doing this in R? I know this should be do able, I don't know where to look for an answer.

Upvotes: 0

Views: 68

Answers (1)

akrun
akrun

Reputation: 887108

We could use cSplit from library(splitstackshape) to split the 'ID' column at /, by specifying the direction='long', we reshape the data from 'wide' to 'long' format. The output is a data.table, so we can use standard data.table techniques. We group by 'Name', and create another grouping variable ('ind') so that adjacent 2 elements fall into one group. In the next step, we group by 'Name' and 'ind', get the nchar of 'ID', substr the 1st element based on the max and min of nchar, paste it with the 2nd element (This is based on the example provided. If the values with less nchar appears first, we may have to use the which.max to select the element to be used in the substr ), concatenate (c) the first element, convert to numeric to create the 'ID' column.

library(splitstackshape)
library(data.table)
cSplit(df1, 'ID', '/', 'long')[, ind:=gl(.N, 2, .N) , by = Name
       ][, {i1 <- nchar(ID)
            list(ID=as.numeric(c(ID[1],
              paste0(substr(ID[1],1, max(i1)-min(i1)), ID[2]))),
             Age=Age)} , by = .(Name, ind)][, ind:= NULL][]
#  Name       ID Age
#1:  ABC   123456  23
#2:  ABC   123457  23
#3:  ABC 56617181  23
#4:  ABC 56617182  23
#5:  CDE    12134  24
#6:  CDE    12135  24

As I mentioned above, if the ID values with shorter substring occurs first, we may have to use which.max/which.min to select the 'ID'

cSplit(df1, 'ID', '/', 'long')[, ind:=gl(.N, 2, .N) ,Name
       ][,{i1 <- nchar(ID)
          i2 <- which.max(i1)
          i3 <- which.min(i1)
          list(ID=c(ID[i2],paste0(substr(ID[i2], 1, max(i1)-min(i1)), 
                  ID[i3])), Age=Age)} ,
                               by = .(Name, ind)]

data

df1 <- structure(list(ID = c("123456/7/56617181/82", "12134/135"),
 Name = c("ABC", 
"CDE"), Age = 23:24), .Names = c("ID", "Name", "Age"), 
class = "data.frame", row.names = c(NA, -2L))

Upvotes: 2

Related Questions