hearse
hearse

Reputation: 389

Using spread correctly in R: Error

I have a dataframe inputData with columns acc_id, attrName, attrValue and timeStamp. The acc_id column will have duplicate entries because an example head of the data will be like this:

acc_id, attr_name, attr_value, timeStamp
123      abcVar1    138         Nov 1 2017
123      abcVar2    10          Nov 1 2017
123      abcVar3    2124        Nov 1 2017
468      abcVar1    13         Nov 9 2017
468      abcVar2    13458         Nov 9 2017
468      abcVar3    1238         Nov 9 2017

The situation can be generalized where attr_name will have a lot more than just 3 unique variables as well.

It needs to be converted into the format

acc_id abcVar1 abcVar2 abcVar3 timeStamp
123     138      10      2124  Nov 1 2017
468     13       13458   1238  Nov 9 2017

I tried spread(inputData,attr_name, attr_val) to get error

Upvotes: 2

Views: 2949

Answers (1)

AntoniosK
AntoniosK

Reputation: 16121

Based on your example above I get this:

library(tidyr)

df = read.table(text = "
acc_id attr_name attr_value timeStamp
123      abcVar1    138         Nov_1_2017
123      abcVar2    10          Nov_1_2017
123      abcVar3    2124        Nov_1_2017
468      abcVar1    13          Nov_9_2017
468      abcVar2    13458       Nov_9_2017
468      abcVar3    1238        Nov_9_2017
", header=T, stringsAsFactors=F)

spread(df, attr_name, attr_value)

#   acc_id  timeStamp abcVar1 abcVar2 abcVar3
# 1    123 Nov_1_2017     138      10    2124
# 2    468 Nov_9_2017      13   13458    1238

Maybe your example dataframe is not representative of your actual dataset. I cannot replicate your problem.

However, it seems to break when there are duplicate rows like this:

library(tidyr)
library(dplyr)

df = read.table(text = "
acc_id attr_name attr_value timeStamp
123      abcVar1    138         Nov_1_2017
123      abcVar2    10          Nov_1_2017
123      abcVar3    2124        Nov_1_2017
468      abcVar1    13          Nov_9_2017
468      abcVar1    13          Nov_9_2017
468      abcVar2    13458       Nov_9_2017
468      abcVar3    1238        Nov_9_2017
", header=T, stringsAsFactors=F)

spread(df, attr_name, attr_value)

# Error: Duplicate identifiers for rows (4, 5)

# deduplicate and reshape
df %>%
  distinct() %>%
  spread(attr_name, attr_value)

#   acc_id  timeStamp abcVar1 abcVar2 abcVar3
# 1    123 Nov_1_2017     138      10    2124
# 2    468 Nov_9_2017      13   13458    1238

Upvotes: 2

Related Questions