Reputation: 1
EDIT: I have modified the question a little. I apolozise for not understanding the full nature of the problem in the beginning.
Im working on cleaning data that is gathered from several people. While most people enter data in the correct format, i have found that some of them are 'lazy entries'
Below is my input dataset:
| ID | Name | Age |
|--------------------------------------------|
| 123456/7/56617181/82 | ABC | 23 |
| 12134/135 | CDE | 24 |
And here is what I want:
| ID | Name | Age |
|----------------------------------------------------|
| 123456/123457/56617181/56617182 | ABC | 23 |
| 12134/12135 | CDE | 24 |
Once I have this, all i need to do is to generate an individual row for every ID. I've already figured this bit out. Nonetheless, this is what I will finally have:
| ID | Name | Age |
|------------------------------|
| 123456 | ABC | 23 |
| 123457 | ABC | 23 |
| 56617181 | ABC | 23 |
| 56617182 | ABC | 23 |
| 12134 | CDE | 24 |
| 12135 | CDE | 24 |
Any idea on how one could go about doing this in R? I know this should be do able, I don't know where to look for an answer.
Upvotes: 0
Views: 68
Reputation: 887108
We could use cSplit
from library(splitstackshape)
to split the 'ID' column at /
, by specifying the direction='long'
, we reshape the data from 'wide' to 'long' format. The output is a data.table
, so we can use standard data.table techniques. We group by 'Name', and create another grouping variable ('ind') so that adjacent 2 elements fall into one group. In the next step, we group by 'Name' and 'ind', get the nchar
of 'ID', substr
the 1st element based on the max
and min
of nchar
, paste
it with the 2nd element (This is based on the example provided. If the values with less nchar
appears first, we may have to use the which.max
to select the element to be used in the substr
), concatenate (c
) the first element, convert to numeric
to create the 'ID' column.
library(splitstackshape)
library(data.table)
cSplit(df1, 'ID', '/', 'long')[, ind:=gl(.N, 2, .N) , by = Name
][, {i1 <- nchar(ID)
list(ID=as.numeric(c(ID[1],
paste0(substr(ID[1],1, max(i1)-min(i1)), ID[2]))),
Age=Age)} , by = .(Name, ind)][, ind:= NULL][]
# Name ID Age
#1: ABC 123456 23
#2: ABC 123457 23
#3: ABC 56617181 23
#4: ABC 56617182 23
#5: CDE 12134 24
#6: CDE 12135 24
As I mentioned above, if the ID values with shorter substring occurs first, we may have to use which.max/which.min
to select the 'ID'
cSplit(df1, 'ID', '/', 'long')[, ind:=gl(.N, 2, .N) ,Name
][,{i1 <- nchar(ID)
i2 <- which.max(i1)
i3 <- which.min(i1)
list(ID=c(ID[i2],paste0(substr(ID[i2], 1, max(i1)-min(i1)),
ID[i3])), Age=Age)} ,
by = .(Name, ind)]
df1 <- structure(list(ID = c("123456/7/56617181/82", "12134/135"),
Name = c("ABC",
"CDE"), Age = 23:24), .Names = c("ID", "Name", "Age"),
class = "data.frame", row.names = c(NA, -2L))
Upvotes: 2