Newbie
Newbie

Reputation: 421

remove duplicates from values in the rows

I have a df with dimension 58000*900 which contains replicates in row values, I want to traverse through every row and remove them. An example will make it more clear.

df
IDs Name    col1    col2    col3
123 AB.C    1.3,1.3,1.3,1.3,1.3 0,0,0,0,0   5,5,5,5,5
234 CD-E    2,2,2,2,2   0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456      123456              123456
345 FGH 9,9,9,9,9   54,54,54,54,54  0,0,0,0,0

Apparently every value is replicated 5 times and in some cases their is a problem that there is no . or , separating the values. What I want is drop those lines which does not contain either . or , and for the rest remove the duplicate values. So, the output will be:

IDs Name    col1    col2    col3
123 AB.C    1.3 0   5
234 CD-E    2   0.3 1
345 FGH 9   54  0

dput(df)
structure(list(IDs = c(123L, 234L, 568L, 345L), Name = structure(c(1L, 
2L, 4L, 3L), .Label = c("ABC", "CDE", "FGH", "GHJ"), class = "factor"), 
    col1 = structure(c(2L, 3L, 1L, 4L), .Label = c("123456", 
    "1.3,1.3,1.3,1.3,1.3", "2,2,2,2,2", "9,9,9,9,9"), class = "factor"), 
    col2 = structure(1:4, .Label = c("0,0,0,0,0", "0.3,0.3,0.3,0.3,0.3", 
    "123456", "54,54,54,54,54"), class = "factor"), col3 = structure(c(4L, 
    2L, 3L, 1L), .Label = c("0,0,0,0,0", "1,1,1,1,1", "123456", 
    "5,5,5,5,5"), class = "factor")), .Names = c("IDs", "Name", 
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 3

Views: 681

Answers (3)

lmo
lmo

Reputation: 38500

Here is a base R method that works for your sample data:

df <- read.table(header=T, text="IDs Name    col1    col2    col3
 123 ABC 1.3,1.3,1.3,1.3,1.3 0,0,0,0,0   5,5,5,5,5
                  234 CDE 2,2,2,2,2   0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
                  568 GHJ 123456      123456              123456
                  345 FGH 9,9,9,9,9   54,54,54,54,54  0,0,0,0,0")

# drop rows with no comma or dot
df <- df[-grep("[,.]", df$col1, invert=T),]

df[,grep("^col", names(df))] <- sapply(df[,grep("^col", names(df))], 
                                       function(i) gsub("^([0-9.]+),.*", "\\1", i))

This returns

  IDs Name   col1   col2   col3
1 123  ABC    1.3      0      5
2 234  CDE      2    0.3      1
3 568  GHJ 123456 123456 123456
4 345  FGH      9     54      0

We use regular expression functions grep and gsub to select the correct columns and to remove the part of each string that comes after a comma.

Upvotes: 1

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

First, we restructure your data in a long format using gather(), then we filter() for value with no , using grepl(). We then split the string in value into a list using strsplit() and make each element of the list it's own row using unnest(). We remove duplicated rows using distinct() and spread() back the key and values to columns.

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -(IDs:Name)) %>%
  filter(grepl(",", value)) %>%
  mutate(value = strsplit(value, ",")) %>%
  unnest(value) %>%
  distinct %>%
  spread(key, value)

Which gives:

#Source: local data frame [3 x 5]
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (chr) (chr) (chr)
#1   123   AB.C   1.3     0     5
#2   234   CD-E     2   0.3     1
#3   345    FGH     9    54     0

Another idea would be to use cSplit from splitstackshape:

df %>%
  cSplit(., c("col1", "col2", "col3"), direction = "long", sep = ",") %>%
  group_by(Name) %>%
  filter(!any(is.na(.))) %>%
  distinct

Which gives:

#Source: local data table [3 x 5]
#Groups: Name
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (dbl) (dbl) (int)
#1   123   AB.C   1.3   0.0     5
#2   234   CD-E   2.0   0.3     1
#3   345    FGH   9.0  54.0     0

Upvotes: 4

Tensibai
Tensibai

Reputation: 15784

The long apply way in base R:

as.data.frame( apply( df, c(1,2), gsub, pattern="(\\d*[.]*\\d*),.*", replacement="\\1") )

Which gives:

  IDs Name   col1   col2   col3
1 123  ABC    1.3      0      5
2 234  CDE      2    0.3      1
3 568  GHJ 123456 123456 123456
4 345  FGH      9     54      0

The idea is to keep only the first element before the first comma for each item

Drawback (?) it keeps the rows with no decimal values.

Upvotes: 1

Related Questions