Reputation: 421
I have a df with dimension 58000*900 which contains replicates in row values, I want to traverse through every row and remove them. An example will make it more clear.
df
IDs Name col1 col2 col3
123 AB.C 1.3,1.3,1.3,1.3,1.3 0,0,0,0,0 5,5,5,5,5
234 CD-E 2,2,2,2,2 0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456 123456 123456
345 FGH 9,9,9,9,9 54,54,54,54,54 0,0,0,0,0
Apparently every value is replicated 5 times and in some cases their is a problem that there is no .
or ,
separating the values.
What I want is drop those lines which does not contain either .
or ,
and for the rest remove the duplicate values. So, the output will be:
IDs Name col1 col2 col3
123 AB.C 1.3 0 5
234 CD-E 2 0.3 1
345 FGH 9 54 0
dput(df)
structure(list(IDs = c(123L, 234L, 568L, 345L), Name = structure(c(1L,
2L, 4L, 3L), .Label = c("ABC", "CDE", "FGH", "GHJ"), class = "factor"),
col1 = structure(c(2L, 3L, 1L, 4L), .Label = c("123456",
"1.3,1.3,1.3,1.3,1.3", "2,2,2,2,2", "9,9,9,9,9"), class = "factor"),
col2 = structure(1:4, .Label = c("0,0,0,0,0", "0.3,0.3,0.3,0.3,0.3",
"123456", "54,54,54,54,54"), class = "factor"), col3 = structure(c(4L,
2L, 3L, 1L), .Label = c("0,0,0,0,0", "1,1,1,1,1", "123456",
"5,5,5,5,5"), class = "factor")), .Names = c("IDs", "Name",
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 3
Views: 681
Reputation: 38500
Here is a base R method that works for your sample data:
df <- read.table(header=T, text="IDs Name col1 col2 col3
123 ABC 1.3,1.3,1.3,1.3,1.3 0,0,0,0,0 5,5,5,5,5
234 CDE 2,2,2,2,2 0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456 123456 123456
345 FGH 9,9,9,9,9 54,54,54,54,54 0,0,0,0,0")
# drop rows with no comma or dot
df <- df[-grep("[,.]", df$col1, invert=T),]
df[,grep("^col", names(df))] <- sapply(df[,grep("^col", names(df))],
function(i) gsub("^([0-9.]+),.*", "\\1", i))
This returns
IDs Name col1 col2 col3
1 123 ABC 1.3 0 5
2 234 CDE 2 0.3 1
3 568 GHJ 123456 123456 123456
4 345 FGH 9 54 0
We use regular expression functions grep
and gsub
to select the correct columns and to remove the part of each string that comes after a comma.
Upvotes: 1
Reputation: 21621
First, we restructure your data in a long format using gather()
, then we filter()
for value
with no ,
using grepl()
. We then split the string in value
into a list using strsplit()
and make each element of the list it's own row using unnest()
. We remove duplicated rows using distinct()
and spread()
back the key
and values
to columns.
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -(IDs:Name)) %>%
filter(grepl(",", value)) %>%
mutate(value = strsplit(value, ",")) %>%
unnest(value) %>%
distinct %>%
spread(key, value)
Which gives:
#Source: local data frame [3 x 5]
#
# IDs Name col1 col2 col3
# (int) (fctr) (chr) (chr) (chr)
#1 123 AB.C 1.3 0 5
#2 234 CD-E 2 0.3 1
#3 345 FGH 9 54 0
Another idea would be to use cSplit
from splitstackshape
:
df %>%
cSplit(., c("col1", "col2", "col3"), direction = "long", sep = ",") %>%
group_by(Name) %>%
filter(!any(is.na(.))) %>%
distinct
Which gives:
#Source: local data table [3 x 5]
#Groups: Name
#
# IDs Name col1 col2 col3
# (int) (fctr) (dbl) (dbl) (int)
#1 123 AB.C 1.3 0.0 5
#2 234 CD-E 2.0 0.3 1
#3 345 FGH 9.0 54.0 0
Upvotes: 4
Reputation: 15784
The long apply
way in base R:
as.data.frame( apply( df, c(1,2), gsub, pattern="(\\d*[.]*\\d*),.*", replacement="\\1") )
Which gives:
IDs Name col1 col2 col3
1 123 ABC 1.3 0 5
2 234 CDE 2 0.3 1
3 568 GHJ 123456 123456 123456
4 345 FGH 9 54 0
The idea is to keep only the first element before the first comma for each item
Drawback (?) it keeps the rows with no decimal values.
Upvotes: 1