Reputation: 1870
I have a data.table
with a country code and the corresponding country name. In some cases country is not known and "OTHER"
is used to indicate the unknown code. My data is inconsistent, in that for some code
I have the country
name but also an line with "OTHER",
for example IRLAND
or LUXEMBURG
.
code <- c(104, 105, 105, 106, 109, 112, 115, 115)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY", "OTHER", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)
What I want to do: I want to get unique code
- for every code, there will be only one line, ideally with a country name, if not available then with "OTHER". I am looking for the simplest solution possible.
For now i was think of first checking if my data.table
has some of those inconsistencies. If yes, then remove all those line that have both - country name and "OTHER" in the column country
. I have tried the following, but none of the duplicates
get removed
if (length(unique(id_country$code)) != length(unique(id_country))){
# replace "OTHER" with the corresponding country name
duplicates <- id_country[duplicated(code),]
id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]
}
desired output:
code <- c(104, 105, 106, 109, 112, 115)
country <- c("GERMANY", "IRLAND", "FRANCE", "FRANCE", "ITALY", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)
Upvotes: 0
Views: 342
Reputation: 10875
If the desire is to eliminate the items that have OTHER
and all items marked OTHER
are duplicates of another country name, we can simply select rows where country is not equal to OTHER
.
library(data.table)
code <- c(104, 105, 105, 106, 109, 112, 115, 115)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY", "OTHER", "LUXEMBURG")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)
id_country[country != "OTHER",]
If there are "valid" unknown countries in the list (i.e. a non duplicated country code with a name of OTHER
, the solution is a bit more complicated.
First, we'll find the duplicated countries after modifying the input data with a valid OTHER
, country 117.
library(data.table)
code <- c(104, 105, 105, 106, 109, 112, 115, 115,117)
country <- c("GERMANY", "IRLAND", "OTHER", "FRANCE", "FRANCE", "ITALY", "OTHER",
"LUXEMBURG","OTHER")
id_country <- cbind(code, country)
id_country <- as.data.table(id_country)
dupCodes <- id_country[, 'count' := .N, by = code][count > 1,.SD[1],by = code][[1]]
Then, we'll only delete rows where country
is OTHER
and there are duplicates of code
.
id_country[country != "OTHER" | !(code %in% dupCodes),]
...and the output:
> id_country[country != "OTHER" | !(code %in% dupCodes),]
code country
1: 104 GERMANY
2: 105 IRLAND
3: 106 FRANCE
4: 109 FRANCE
5: 112 ITALY
6: 115 LUXEMBURG
7: 117 OTHER
>
As originally written, the code in the question post includes a subtle error that causes the final subsetting operation to always fail.
duplicates <- id_country[duplicated(code),]
Since duplicates
is a data.table
, not a numeric vector, the following code fragment always evaluates to FALSE.
id_country$code %in% duplicates
The fact that this code is inside an if()
block makes it difficult to see that duplicates
is a data table, since we can't see it in the RStudio environment viewer. If I run the code block outside the if()
block and inspect the object, I see the following.
Clearly duplicates
is not a vector.
This problem causes the rest of the subsetting operation to return every row from the input data table.
id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]
Why?
id_country$code %in% duplicates
> id_country$code %in% duplicates
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
We can correct the defect by returning a vector instead of a data table from the line of code that makes the duplicates
assignment as follows.
if (length(unique(id_country$code)) != length(unique(id_country))){
# extract first column of resulting data.table as a vector
duplicates <- id_country[duplicated(code),][[1]]
# subset out duplicate rows named OTHER
id_country <- id_country[!(id_country$code %in% duplicates & id_country$country == "OTHER"),]
}
id_country
...and the output:
> id_country
code country
1: 104 GERMANY
2: 105 IRLAND
3: 106 FRANCE
4: 109 FRANCE
5: 112 ITALY
6: 115 LUXEMBURG
>
Upvotes: 0
Reputation: 462
Here is another option if you are willing to use dplyr
:
code_n
counts the number of replicates incode
and theifelse
conditionally populates thecountry
variable while replacing any "OTHER" entries withNaN
for duplicatecode
. TheNaN
entries are later filtered out. The code also works if there is a single entry for a unique code with "OTHER" as country filtering != "OTHER" will not work in that case.
id_country %>% group_by (code) %>% mutate(code_n=n()) %>% mutate(country = ifelse(code_n == 1,country,ifelse(country!="OTHER",country,NaN))) %>% filter(country!=NaN) %>% select(-code_n)
Output
# Groups: code [6]
code country
<chr> <chr>
1 104 GERMANY
2 105 IRLAND
3 106 FRANCE
4 109 FRANCE
5 112 ITALY
6 115 LUXEMBURG
Upvotes: 1
Reputation: 389295
We can check condition with if
:
library(data.table)
id_country[, .(country = if(any(country != 'OTHER'))
country[country != 'OTHER'][1L] else 'OTHER'), code]
# code country
#1: 104 GERMANY
#2: 105 IRLAND
#3: 106 FRANCE
#4: 109 FRANCE
#5: 112 ITALY
#6: 115 LUXEMBURG
Upvotes: 0
Reputation: 6226
I have two solutions in mind. Both of them will produce the same result. The second one is maybe more appropriate if you have voluminous data because it avoids making a merge.
In your example there was no observation where only one code is associated with OTHER
. This kind of observations won't be modified by the both methods.
The first one is based on merge
. The idea is to clean on a data.table
that only contains the code and then merge with the initial data
# METHODE 1: MERGE
id_country2 <- id_country[,.('clean_code' = unique(country)), by = code]
id_country2[, 'number_codes' := .N, by = code]
id_country2 <- id_country2[!(number_codes == 2 & clean_code == "OTHER")]
merge(id_country, id_country2)
code number_codes country clean_code
1: 104 1 GERMANY GERMANY
2: 105 2 IRLAND IRLAND
3: 105 2 OTHER IRLAND
4: 106 1 FRANCE FRANCE
5: 109 1 FRANCE FRANCE
6: 112 1 ITALY ITALY
7: 115 2 OTHER LUXEMBURG
8: 115 2 LUXEMBURG LUXEMBURG
The second solution uses conditional replacement directly in the initial dataframe. The idea is to create a function before that performs the substitution and then apply it only on some codes.
The function is:
replace_country <- function(x){
val <- unique(x)
return(
gsub(pattern = "OTHER", replacement = val[val != "OTHER"][1],
x)
)
}
There's probably more elegant way to define it but it will do the job. By the way, I put val[val != "OTHER"][1]
to ensure you only put one value in replacement. This might be extra carefulness but just in case.
This function will be called using lapply
+SD
verbs
id_country[, 'number_codes' := uniqueN(country), by = "code"]
id_country[number_codes > 1, country := lapply(.SD, replace_country), .SDcols = "country",
by = "code"]
code country number_codes
1: 104 GERMANY 1
2: 105 IRLAND 2
3: 105 IRLAND 2
4: 106 FRANCE 1
5: 109 FRANCE 1
6: 112 ITALY 1
7: 115 LUXEMBURG 2
8: 115 LUXEMBURG 2
You only apply the replace_country
function on the observations with number_codes
>1 with this syntax. Your dataframe is directly updated by reference
Upvotes: 0