Nneka
Nneka

Reputation: 1870

removing rows based on two conditions from a data.table in R

I have a data.table with a country code and the corresponding country name. In some cases country is not known and "OTHER" is used to indicate the unknown code. My data is inconsistent, in that for some code I have the country name but also an line with "OTHER", for example IRLAND or LUXEMBURG.

code <- c(104, 105, 105, 106, 109, 112, 115, 115)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY",  "OTHER", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)

What I want to do: I want to get unique code - for every code, there will be only one line, ideally with a country name, if not available then with "OTHER". I am looking for the simplest solution possible.

For now i was think of first checking if my data.table has some of those inconsistencies. If yes, then remove all those line that have both - country name and "OTHER" in the column country . I have tried the following, but none of the duplicates get removed

if (length(unique(id_country$code)) != length(unique(id_country))){

  # replace "OTHER" with the corresponding country name 
  duplicates <- id_country[duplicated(code),]
  id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]

}

desired output:

code <- c(104, 105, 106, 109, 112, 115)
country <- c("GERMANY", "IRLAND", "FRANCE", "FRANCE", "ITALY", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)

Upvotes: 0

Views: 342

Answers (4)

Len Greski
Len Greski

Reputation: 10875

If the desire is to eliminate the items that have OTHER and all items marked OTHER are duplicates of another country name, we can simply select rows where country is not equal to OTHER.

library(data.table)
code <- c(104, 105, 105, 106, 109, 112, 115, 115)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY",  "OTHER", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)

id_country[country != "OTHER",]

If there are "valid" unknown countries in the list (i.e. a non duplicated country code with a name of OTHER, the solution is a bit more complicated.

First, we'll find the duplicated countries after modifying the input data with a valid OTHER, country 117.

library(data.table)
code <- c(104, 105, 105, 106, 109, 112, 115, 115,117)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY",  "OTHER", 
             "LUXEMBURG","OTHER")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)
dupCodes <- id_country[, 'count' := .N, by = code][count > 1,.SD[1],by = code][[1]]

Then, we'll only delete rows where country is OTHER and there are duplicates of code.

id_country[country != "OTHER" | !(code %in% dupCodes),]

...and the output:

> id_country[country != "OTHER" | !(code %in% dupCodes),]
   code   country
1:  104   GERMANY
2:  105    IRLAND
3:  106    FRANCE
4:  109    FRANCE
5:  112     ITALY
6:  115 LUXEMBURG
7:  117     OTHER
> 

Correcting the original coding error

As originally written, the code in the question post includes a subtle error that causes the final subsetting operation to always fail.

 duplicates <- id_country[duplicated(code),]

Since duplicates is a data.table, not a numeric vector, the following code fragment always evaluates to FALSE.

id_country$code %in% duplicates

The fact that this code is inside an if() block makes it difficult to see that duplicates is a data table, since we can't see it in the RStudio environment viewer. If I run the code block outside the if() block and inspect the object, I see the following.

enter image description here

Clearly duplicates is not a vector.

This problem causes the rest of the subsetting operation to return every row from the input data table.

id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]

Why?

id_country$code %in% duplicates
> id_country$code %in% duplicates
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>

We can correct the defect by returning a vector instead of a data table from the line of code that makes the duplicates assignment as follows.

if (length(unique(id_country$code)) != length(unique(id_country))){

     # extract first column of resulting data.table as a vector
     duplicates <- id_country[duplicated(code),][[1]]
     # subset out duplicate rows named OTHER
     id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]

}
id_country

...and the output:

> id_country
   code   country
1:  104   GERMANY
2:  105    IRLAND
3:  106    FRANCE
4:  109    FRANCE
5:  112     ITALY
6:  115 LUXEMBURG
> 

Upvotes: 0

sachin2014
sachin2014

Reputation: 462

Here is another option if you are willing to use dplyr:

code_n counts the number of replicates in code and the ifelse conditionally populates the country variable while replacing any "OTHER" entries with NaN for duplicate code. The NaN entries are later filtered out. The code also works if there is a single entry for a unique code with "OTHER" as country filtering != "OTHER" will not work in that case.

id_country %>% group_by (code) %>% mutate(code_n=n()) %>% mutate(country = ifelse(code_n == 1,country,ifelse(country!="OTHER",country,NaN))) %>% filter(country!=NaN) %>% select(-code_n)

Output

A tibble: 6 x 2

# Groups:   code [6]
  code  country  
  <chr> <chr>    
1 104   GERMANY  
2 105   IRLAND   
3 106   FRANCE   
4 109   FRANCE   
5 112   ITALY    
6 115   LUXEMBURG

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389295

We can check condition with if :

library(data.table)

id_country[, .(country = if(any(country != 'OTHER')) 
                        country[country != 'OTHER'][1L] else 'OTHER'), code]

#   code   country
#1:  104   GERMANY
#2:  105    IRLAND
#3:  106    FRANCE
#4:  109    FRANCE
#5:  112     ITALY
#6:  115 LUXEMBURG

Upvotes: 0

linog
linog

Reputation: 6226

I have two solutions in mind. Both of them will produce the same result. The second one is maybe more appropriate if you have voluminous data because it avoids making a merge.

In your example there was no observation where only one code is associated with OTHER. This kind of observations won't be modified by the both methods.

Solution 1

The first one is based on merge. The idea is to clean on a data.table that only contains the code and then merge with the initial data

# METHODE 1: MERGE
id_country2 <- id_country[,.('clean_code' = unique(country)), by = code]
id_country2[, 'number_codes' := .N, by = code]
id_country2 <- id_country2[!(number_codes == 2 & clean_code == "OTHER")]

merge(id_country, id_country2)

   code number_codes   country clean_code
1:  104            1   GERMANY    GERMANY
2:  105            2    IRLAND     IRLAND
3:  105            2     OTHER     IRLAND
4:  106            1    FRANCE     FRANCE
5:  109            1    FRANCE     FRANCE
6:  112            1     ITALY      ITALY
7:  115            2     OTHER  LUXEMBURG
8:  115            2 LUXEMBURG  LUXEMBURG

Solution 2

The second solution uses conditional replacement directly in the initial dataframe. The idea is to create a function before that performs the substitution and then apply it only on some codes.

The function is:

replace_country <- function(x){
  val <- unique(x)
  return(
     gsub(pattern = "OTHER", replacement = val[val != "OTHER"][1],
          x)
) 
}

There's probably more elegant way to define it but it will do the job. By the way, I put val[val != "OTHER"][1] to ensure you only put one value in replacement. This might be extra carefulness but just in case.

This function will be called using lapply+SD verbs

id_country[, 'number_codes' := uniqueN(country), by = "code"]
id_country[number_codes > 1,  country := lapply(.SD, replace_country), .SDcols = "country",
           by = "code"]


  code   country number_codes
1:  104   GERMANY            1
2:  105    IRLAND            2
3:  105    IRLAND            2
4:  106    FRANCE            1
5:  109    FRANCE            1
6:  112     ITALY            1
7:  115 LUXEMBURG            2
8:  115 LUXEMBURG            2

You only apply the replace_country function on the observations with number_codes>1 with this syntax. Your dataframe is directly updated by reference

Upvotes: 0

Related Questions