Marco
Marco

Reputation: 2827

Merge data by group if available ignore otherwise

I have journal data overview with year, issue and title information. I have scraped some_content and am looking for a possible merge. There are two problems for me:

The data pretty much looks like this:

# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                        Issue = c(1,1,1,2,1,2,3,3), 
                        Title = c("A", "B", "F", "A", "F", "L", "A", "F"))

  Year Issue Title
1 2018     1     A
2 2018     1     B
3 2018     1     F
4 2018     2     A
5 2019     1     F
6 2019     2     L
7 2019     3     A
8 2019     3     F

# The scraped titles include punctuation, like .  !  ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019), 
                             Issue = c(1,1,2,3,3), 
                             Title = c("A.", "B!", "L?", "A.", "F"),
                             Content = c("helloworld", NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1    A. helloworld
2 2018     1    B!       <NA>
3 2019     2    L?      match
4 2019     3    A.      lorem
5 2019     3     F       <NA>

Let me tell you the story of "helloworld". In year 2018, issue 1 there are multiple title. Title A in overview definitely corresponds to A. in some_content even though it is not completely identical. Whenever a title per year-issue combination from overview can be string detected in the year-issue combination from some_content the Content from some_content should be merged to the overview dataframe. The result should look like this:

merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                         Issue = c(1,1,1,2,1,2,3,3), 
                         Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
                         Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1     A helloworld
2 2018     1     B       <NA>
3 2018     1     F       <NA>
4 2018     2     A       <NA>
5 2019     1     F       <NA>
6 2019     2     L      match
7 2019     3     A      lorem
8 2019     3     F       <NA>

Upvotes: 0

Views: 43

Answers (1)

Quinten
Quinten

Reputation: 41533

First I would suggest removing the punctuations using this:

some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)

After that you can do a simple left_join like this:

library(dplyr)
left_join(overview, some_content, by = c("Year", "Issue", "Title"))

Output:

  Year Issue Title    Content
1 2018     1     A helloworld
2 2018     1     B       <NA>
3 2018     1     F       <NA>
4 2018     2     A       <NA>
5 2019     1     F       <NA>
6 2019     2     L      match
7 2019     3     A      lorem
8 2019     3     F       <NA>

Upvotes: 1

Related Questions