Reputation: 24583
This is related to Are there more elegant ways to transform ragged data into a tidy dataframe
Why following code is not working:
events = structure(list(date = structure(c(-714974, -714579, -717835), class = "Date"),
days = c(1, 6, 0.5), name = c("Intro to stats", "Stats Winter school",
"TidyR tools"), topics = c("probability|R", "R|regression|ggplot",
"tidyR|dplyr")), .Names = c("date", "days", "name", "topics"
), row.names = c(NA, -3L), class = "data.frame")
> newdf <- data.frame(topic=character(), days=character())
> for(i in 1:length(events$topics)){
+ xx = unlist(strsplit(events$topics[i],'\\|'))
+ for(j in 1:length(xx)){
+ yy = c(xx[j], events$days[i]/length(xx))
+ print(yy)
+ newdf=rbind(newdf, yy)
+ }
+ }
[1] "probability" "0.5"
[1] "R" "0.5"
[1] "R" "2"
[1] "regression" "2"
[1] "ggplot" "2"
[1] "tidyR" "0.25"
[1] "dplyr" "0.25"
There were 11 warnings (use warnings() to see them)
> newdf
X.probability. X.0.5.
1 probability 0.5
2 <NA> 0.5
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
>
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA ... :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
>
yy is okay but rbind is not working. Where is the error and how can it be corrected? Thanks for your help.
Upvotes: 5
Views: 10619
Reputation: 887471
You may try:
newdf <- data.frame(topic=character(), daysPerTopic=character(), stringsAsFactors=F)
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = data.frame(topic=xx[j], daysPerTopic=events$days[i]/length(xx), stringsAsFactors=F)
newdf <- rbind(newdf, yy)
}
}
newdf
# topic daysPerTopic
# 1 probability 0.50
# 2 R 0.50
# 3 R 2.00
# 4 regression 2.00
# 5 ggplot 2.00
# 6 tidyR 0.25
# 7 dplyr 0.25
Or
op <- options(stringsAsFactors=F) #set to F
#Your code
newdf <- data.frame(topic=character(), days=character())
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = c(xx[j], events$days[i]/length(xx))
print(yy)
newdf=rbind(newdf, yy)
}
}
newdf
# X.probability. X.0.5.
# 1 probability 0.5
# 2 R 0.5
# 3 R 2
# 4 regression 2
# 5 ggplot 2
# 6 tidyR 0.25
# 7 dplyr 0.25
options(op) #et back to default
Upvotes: 5
Reputation: 92302
Did you even try to debug your for
loop? For example, by adding print(class(yy))
print(str(newdf))
you would see that after first iteration both newdf
vectors become factors.
# [1] "probability" "0.5"
# [1] "character"
# 'data.frame': 0 obs. of 2 variables:
# $ topic: Factor w/ 0 levels:
# $ days : Factor w/ 0 levels:
# NULL
# [1] "R" "0.5"
# [1] "character"
# 'data.frame': 1 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1
# $ X.0.5. : Factor w/ 1 level "0.5": 1
# NULL
# [1] "R" "2"
# [1] "character"
# 'data.frame': 2 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1 NA
# $ X.0.5. : Factor w/ 1 level "0.5": 1 1
...
You would say "but I defined them as character
". True, but if you'll read rbind
documentation, you will see that
For cbind (rbind), vectors of zero length (including NULL) are ignored unless the result would have zero rows (columns), for S compatibility. (Zero-extent matrices do not occur in S3 and are not ignored in R.)
Another property of rbind
is that it inherits it's properties from data.frame
while one of them is stringsAsFactors == TRUE
What happened here could be easily illustrated in a dummy example, consider
temp <- data.frame(A = letters[1:3])
str(temp)
## 'data.frame': 3 obs. of 1 variable:
## $ A: Factor w/ 3 levels "a","b","c": 1 2 3
temp$A[3] <- "d"
## Warning message:
## In `[<-.factor`(`*tmp*`, 3, value = c(1L, 2L, NA)) :
## invalid factor level, NA generated
temp$A
## [1] a b <NA>
## Levels: a b c
You can see two things here:
data.frame
automatically converted character
class to factorsfactor
vector it converts it into NA
and throws the exact error you were receivingAs mentioned by @akrun, setting to options(stringsAsFactors=F)
will solve your problem
Upvotes: 5
Reputation: 3296
Set options(stringsAsFactors=FALSE) and your code should work as expected. The reason for the warnings and NA's in the result is because of the implicit conversion to factors and the type mismatch between newdf columns and yy, see https://stackoverflow.com/a/1640729/1541036.
For a cleaner way of achieving the same result, here's a group by solution using data.table
library(data.table)
events <- as.data.table(events)
events2 <- events[, list(topic=unlist(strsplit(topics, '|', fixed=TRUE))), by=c("date", "days", "name")]
events2[, probability := days / .N, by=name]
Upvotes: 3