Make new dataframe from rows of another

Question

I have a dataframe with 3 columns; Author, Date and word. E.g

Author	Date	word
AuthorA	12/01/01	word1A
AuthorB	12/01/01	word1b
AuthorA	12/01/01	word2A

I want to extract every word of each Author, and put it into a dataframe where each column represents a different author.

E.g

AuthorA	Author B	Author C
word1A	word1B	word1C
word2A	word2B	word2C
...	...	...

I'm running into issues with each Author having a different number of words, and therefore a different length of column. I've been using dplyr to try and extract the columns I need. What is the easiest way of doing this?

r2evans · Accepted Answer

library(dplyr)
library(tidyr) # pivot_wider
select(dat, -Date) %>%
  distinct() %>%
  group_by(Author) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  pivot_wider(rn, names_from="Author", values_from="word")
# # A tibble: 2 x 3
#      rn AuthorA AuthorB
#         
# 1     1 word1A  word1b 
# 2     2 word2A

A small note about this: one premise of data.frames is that each row has data that correlates with the other values on that row. With pivoted data, this correlation can be difficult to trace back to the original data and its relationships.

In the case of this data, the only correlation of "word1A" with "word1b", for instance, is that they were both the first word listed for their respective authors within the starting frame. So it may be an arbitrary relationship. This may be fine, but you may need to be careful to not relate too much significance to this.

Additionally, I chose to remove Date from the data, as it did not seem relevant. The previous paragraphs suggest perhaps another view: the first word in each author-column could suggest either the most-frequent, the earliest, or the most-recent, depending on how you order the initial frame.

You can implement each of those "significance" methods with something like

"most-frequent"

dat %>%
  count(Author, word) %>%
  arrange(desc(n)) %>%
  distinct() %>% ...

"earliest"

dat %>%
  arrange(Date) %>%
  distinct() %>% ...

"most-recent"

dat %>%
  arrange(desc(Date)) %>%
  distinct() %>% ...

(For both of the Date-related sorts, you need to convert to a sortable Date object. I'm assuming that the current format is something like "%m/%d/%y", but either way ... they need to be sortable, otherwise the last two bullets won't work for you.)

Make new dataframe from rows of another

Answers (1)

Related Questions