R - Remove all line breaks between repeating character

Question

I am currently working on the data cleaning of a sentiment analysis, and am using a large dataset of news articles in the form of a data frame. I need to be able to analyze one article per row of the data frame, and am looking for a way to remove line breaks between the first ‘======‘ and the second ‘======‘, repeating throughout the entire dataframe. Also, after the content has «collapsed onto itself», I would like the publisher and date column to remain.

df <-  matrix(c("======","NA","NA","Daily Bugle Dec 31","Daily Bugle", "Dec 31" ,"Wookies are","NA","NA",". recreationally", "NA","NA", "using drugs at a", "NA", "NA", "higher rate than", "NA", "NA","ever before.", "NA", "NA","======", "NA", "NA" ),ncol=3,byrow=TRUE)
colnames(df) <- c("content","publisher","date")
df <- as.data.frame(df)
df[ df == "NA" ] <- NA

Gives this:

content              publisher   date
======                        
Daily Bugle, Dec 31  Daily Bugle Dec 31
Wookies are                   
recreationally                
using drugs at a              
higher rate than              
ever before.                  
======

I would like something like this:

content                                           publisher     date
======
Wookies are recreationally using drugs at a hig... Daily Bugle Dec 31           
======
Article 2
======
Article 3
======

Hope this was clear. I am relatively new to R.

Ronak Shah · Accepted Answer

Every article starts with '===' so that can be used as an article number.
Drop the first value of content for each article.
Keep the 1st value of publisher and date.

library(dplyr)

df %>%
  mutate(article_no = cumsum(grepl('===', content))) %>%
  filter(!grepl('===', content)) %>%
  group_by(article_no) %>%
  summarise(content = paste0(content[-1], collapse = ''), 
            publisher = publisher[1], 
            date = date[1])

#  article_no content                                                                 publisher   date  
#                                                                                   
#1          1 Wookies are. recreationallyusing drugs at ahigher rate thanever before. Daily Bugle Dec 31

R - Remove all line breaks between repeating character

Answers (2)

Related Questions