Reputation: 13

How to remove rows in R based on specific criteria

I have a dynamic data frame in R - it is a hierarchy table generated based on user input. I am trying to remove rows where there are duplicates based on a certain criteria. So the table is not always x amount of columns or y amount of rows, it will always change based on user input.

The data frame (for one example) looks like this:
Class 1---Class 2---Class 3---Class 4
Cash------NA--------NA--------NA
Cash------Cash------NA--------NA
Cash------Cash------Cash------NA
Government-NA-------NA--------NA
Government-Security-NA--------NA
Government-Security-Security--NA
Government-Security-Security--Code

This specific example goes on for another 1100 rows approximately.

Where I would like to remove the rows where there are two similar column values, not including NA's. For the example above, "Cash NA NA NA" would stay in the data frame, but the two other cash rows would be excluded. "Governemnt NA NA NA" and "Government Security NA NA" would stay, but "Government Security Security NA" would be removed because it is basically the same as the row above.

The code I am using to do this now is this

sub.df2 = sub.df2[apply(sub.df2, MARGIN = 1, FUN = function(x) !any(duplicated(x, incomparables = NA))), , drop = FALSE]

However, this removes the last line - "Government Security Security Code" as well and I need that line to stay in and around 100-200 more in a similar style.

Upvotes: 0

Answers (2)

NicolaiNi

Reputation: 21

Maybe you can do it like proposed and after that use the function na.omit(df2) to get rid of all 'incomplete' rows; than you will have your last row again (only valid if your criteria to stay in is a row without NAs)

# your function
df_red <- df2[apply(df2, MARGIN =  1, FUN = function(x) !any(duplicated(x, incomparables = NA))), , drop = FALSE]

# new dataframe with na.omit() 
df_naOmit <- na.omit(df2)

# combine two dataframes
df_new <- rbind(df_red,df_naOmit)

The df_new dataframe includes only columns without duplicates but also the last completely filled row.

Upvotes: 1

r2evans

Reputation: 160952

Since you're really just looking for unique rows, removing identical elements within the row, try this:

df2a <- apply(df2, 1, unique)
df2a
# [[1]]
# [1] "Cash" NA    
# [[2]]
# [1] "Cash" NA    
# [[3]]
# [1] "Cash" NA    
# [[4]]
# [1] "Government" NA          
# [[5]]
# [1] "Government" "Security"   NA          
# [[6]]
# [1] "Government" "Security"   NA          
# [[7]]
# [1] "Government" "Security"   "Code"

Now we can use duplicated on the list:

df2b <- df2a[!duplicated(df2a)]
df2b
# [[1]]
# [1] "Cash" NA    
# [[2]]
# [1] "Government" NA          
# [[3]]
# [1] "Government" "Security"   NA          
# [[4]]
# [1] "Government" "Security"   "Code"

We could try to combine these back into a frame, but they need to be all the same length (padded with NA), so:

df2c <- lapply(df2b, `length<-`, max(lengths(df2b)))
df2c
# [[1]]
# [1] "Cash" NA     NA    
# [[2]]
# [1] "Government" NA           NA          
# [[3]]
# [1] "Government" "Security"   NA          
# [[4]]
# [1] "Government" "Security"   "Code"

Lastly, recombine into a frame:

as.data.frame(do.call(rbind, df2c))
#           V1       V2   V3
# 1       Cash     <NA> <NA>
# 2 Government     <NA> <NA>
# 3 Government Security <NA>
# 4 Government Security Code

Data:

df2 <- structure(list(Class.1 = c("Cash", "Cash", "Cash", "Government", 
"Government", "Government", "Government"), Class.2 = c(NA, "Cash", 
"Cash", NA, "Security", "Security", "Security"), Class.3 = c(NA, 
NA, "Cash", NA, NA, "Security", "Security"), Class.4 = c(NA, 
NA, NA, NA, NA, NA, "Code")), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 1

How to remove rows in R based on specific criteria

Answers (2)

Related Questions