l2archer
l2archer

Reputation: 13

Remove variables duplicates from data.table

I have data with over 6k columns. Each result has colums with data that are always the same.

   XCODE Age Sex ResultA Sex ResultB
1   X001  12   2       2       2       4
2   X002  23   2       4       2      66
3   X003  NA  NA      NA      NA      NA
4   X004  32   1       1       1       3
5   X005  NA  NA      NA      NA      NA
6   X001  NA  NA      NA      NA      NA
7   X002  NA  NA      NA      NA      NA
8   X003  33   1       8       1       6
9   X004  NA  NA      NA      NA      NA
10  X005  55   2       8       2       8

I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?

Upvotes: 1

Views: 44

Answers (3)

IceCreamToucan
IceCreamToucan

Reputation: 28675

You can use match if you need to check for equality of all values.

df[, unique(match(df, df)), with = F]

df2
#    XCODE Age Sex ResultA ResultB
# 1   X001  12   2       2       4
# 2   X002  23   2       4      66
# 3   X003  NA  NA      NA      NA
# 4   X004  32   1       1       3
# 5   X005  NA  NA      NA      NA
# 6   X001  NA  NA      NA      NA
# 7   X002  NA  NA      NA      NA
# 8   X003  33   1       8       6
# 9   X004  NA  NA      NA      NA
# 10  X005  55   2       8       8

Data used:

df <- fread('
   XCODE Age Sex ResultA Sex ResultB
1   X001  12   2       2       2       4
2   X002  23   2       4       2      66
3   X003  NA  NA      NA      NA      NA
4   X004  32   1       1       1       3
5   X005  NA  NA      NA      NA      NA
6   X001  NA  NA      NA      NA      NA
7   X002  NA  NA      NA      NA      NA
8   X003  33   1       8       1       6
9   X004  NA  NA      NA      NA      NA
10  X005  55   2       8       2       8
')[, -'V1']

Upvotes: 2

Luis
Luis

Reputation: 639

If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).

df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)

df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)

Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.

Upvotes: 1

LocoGris
LocoGris

Reputation: 4480

Try this:

 df[, unique(colnames(df))]

One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.

Upvotes: 1

Related Questions