Reputation: 13

Remove variables duplicates from data.table

I have data with over 6k columns. Each result has colums with data that are always the same.

   XCODE Age Sex ResultA Sex ResultB
1   X001  12   2       2       2       4
2   X002  23   2       4       2      66
3   X003  NA  NA      NA      NA      NA
4   X004  32   1       1       1       3
5   X005  NA  NA      NA      NA      NA
6   X001  NA  NA      NA      NA      NA
7   X002  NA  NA      NA      NA      NA
8   X003  33   1       8       1       6
9   X004  NA  NA      NA      NA      NA
10  X005  55   2       8       2       8

I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?

Upvotes: 1

Answers (3)

IceCreamToucan

Reputation: 28705

You can use match if you need to check for equality of all values.

df[, unique(match(df, df)), with = F]

df2
#    XCODE Age Sex ResultA ResultB
# 1   X001  12   2       2       4
# 2   X002  23   2       4      66
# 3   X003  NA  NA      NA      NA
# 4   X004  32   1       1       3
# 5   X005  NA  NA      NA      NA
# 6   X001  NA  NA      NA      NA
# 7   X002  NA  NA      NA      NA
# 8   X003  33   1       8       6
# 9   X004  NA  NA      NA      NA
# 10  X005  55   2       8       8

Data used:

df <- fread('
   XCODE Age Sex ResultA Sex ResultB
1   X001  12   2       2       2       4
2   X002  23   2       4       2      66
3   X003  NA  NA      NA      NA      NA
4   X004  32   1       1       1       3
5   X005  NA  NA      NA      NA      NA
6   X001  NA  NA      NA      NA      NA
7   X002  NA  NA      NA      NA      NA
8   X003  33   1       8       1       6
9   X004  NA  NA      NA      NA      NA
10  X005  55   2       8       2       8
')[, -'V1']

Upvotes: 2

Luis

Reputation: 639

If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).

df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)

df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)

Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.

Upvotes: 1

LocoGris

Reputation: 4480

Try this:

 df[, unique(colnames(df))]

One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.

Upvotes: 1

Remove variables duplicates from data.table

Answers (3)

Related Questions