Gregor Hamill
Gregor Hamill

Reputation: 23

Make new dataframe from rows of another

I have a dataframe with 3 columns; Author, Date and word. E.g

Author Date word
AuthorA 12/01/01 word1A
AuthorB 12/01/01 word1b
AuthorA 12/01/01 word2A

I want to extract every word of each Author, and put it into a dataframe where each column represents a different author.

E.g

AuthorA Author B Author C
word1A word1B word1C
word2A word2B word2C
... ... ...

I'm running into issues with each Author having a different number of words, and therefore a different length of column. I've been using dplyr to try and extract the columns I need. What is the easiest way of doing this?

Upvotes: 1

Views: 60

Answers (1)

r2evans
r2evans

Reputation: 160447

library(dplyr)
library(tidyr) # pivot_wider
select(dat, -Date) %>%
  distinct() %>%
  group_by(Author) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  pivot_wider(rn, names_from="Author", values_from="word")
# # A tibble: 2 x 3
#      rn AuthorA AuthorB
#   <int> <chr>   <chr>  
# 1     1 word1A  word1b 
# 2     2 word2A  <NA>   

A small note about this: one premise of data.frames is that each row has data that correlates with the other values on that row. With pivoted data, this correlation can be difficult to trace back to the original data and its relationships.

In the case of this data, the only correlation of "word1A" with "word1b", for instance, is that they were both the first word listed for their respective authors within the starting frame. So it may be an arbitrary relationship. This may be fine, but you may need to be careful to not relate too much significance to this.

Additionally, I chose to remove Date from the data, as it did not seem relevant. The previous paragraphs suggest perhaps another view: the first word in each author-column could suggest either the most-frequent, the earliest, or the most-recent, depending on how you order the initial frame.

You can implement each of those "significance" methods with something like

  • "most-frequent"

    dat %>%
      count(Author, word) %>%
      arrange(desc(n)) %>%
      distinct() %>% ...
    
  • "earliest"

    dat %>%
      arrange(Date) %>%
      distinct() %>% ...
    
  • "most-recent"

    dat %>%
      arrange(desc(Date)) %>%
      distinct() %>% ...
    

(For both of the Date-related sorts, you need to convert to a sortable Date object. I'm assuming that the current format is something like "%m/%d/%y", but either way ... they need to be sortable, otherwise the last two bullets won't work for you.)

Upvotes: 3

Related Questions