Reputation: 23
I have a dataframe with 3 columns; Author, Date and word. E.g
Author | Date | word |
---|---|---|
AuthorA | 12/01/01 | word1A |
AuthorB | 12/01/01 | word1b |
AuthorA | 12/01/01 | word2A |
I want to extract every word of each Author, and put it into a dataframe where each column represents a different author.
E.g
AuthorA | Author B | Author C |
---|---|---|
word1A | word1B | word1C |
word2A | word2B | word2C |
... | ... | ... |
I'm running into issues with each Author having a different number of words, and therefore a different length of column. I've been using dplyr to try and extract the columns I need. What is the easiest way of doing this?
Upvotes: 1
Views: 60
Reputation: 160447
library(dplyr)
library(tidyr) # pivot_wider
select(dat, -Date) %>%
distinct() %>%
group_by(Author) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(rn, names_from="Author", values_from="word")
# # A tibble: 2 x 3
# rn AuthorA AuthorB
# <int> <chr> <chr>
# 1 1 word1A word1b
# 2 2 word2A <NA>
A small note about this: one premise of data.frame
s is that each row has data that correlates with the other values on that row. With pivoted data, this correlation can be difficult to trace back to the original data and its relationships.
In the case of this data, the only correlation of "word1A"
with "word1b"
, for instance, is that they were both the first word listed for their respective authors within the starting frame. So it may be an arbitrary relationship. This may be fine, but you may need to be careful to not relate too much significance to this.
Additionally, I chose to remove Date
from the data, as it did not seem relevant. The previous paragraphs suggest perhaps another view: the first word in each author-column could suggest either the most-frequent, the earliest, or the most-recent, depending on how you order the initial frame.
You can implement each of those "significance" methods with something like
"most-frequent"
dat %>%
count(Author, word) %>%
arrange(desc(n)) %>%
distinct() %>% ...
"earliest"
dat %>%
arrange(Date) %>%
distinct() %>% ...
"most-recent"
dat %>%
arrange(desc(Date)) %>%
distinct() %>% ...
(For both of the Date
-related sorts, you need to convert to a sortable Date
object. I'm assuming that the current format is something like "%m/%d/%y"
, but either way ... they need to be sortable, otherwise the last two bullets won't work for you.)
Upvotes: 3