Reputation: 23
I am currently working on the data cleaning of a sentiment analysis, and am using a large dataset of news articles in the form of a data frame. I need to be able to analyze one article per row of the data frame, and am looking for a way to remove line breaks between the first ‘======‘ and the second ‘======‘, repeating throughout the entire dataframe. Also, after the content has «collapsed onto itself», I would like the publisher and date column to remain.
df <- matrix(c("======","NA","NA","Daily Bugle Dec 31","Daily Bugle", "Dec 31" ,"Wookies are","NA","NA",". recreationally", "NA","NA", "using drugs at a", "NA", "NA", "higher rate than", "NA", "NA","ever before.", "NA", "NA","======", "NA", "NA" ),ncol=3,byrow=TRUE)
colnames(df) <- c("content","publisher","date")
df <- as.data.frame(df)
df[ df == "NA" ] <- NA
Gives this:
content publisher date
====== <NA> <NA>
Daily Bugle, Dec 31 Daily Bugle Dec 31
Wookies are <NA> <NA>
recreationally <NA> <NA>
using drugs at a <NA> <NA>
higher rate than <NA> <NA>
ever before. <NA> <NA>
====== <NA> <NA>
I would like something like this:
content publisher date
======
Wookies are recreationally using drugs at a hig... Daily Bugle Dec 31
======
Article 2
======
Article 3
======
Hope this was clear. I am relatively new to R.
Upvotes: 1
Views: 74
Reputation: 4949
To help you, first I need to prepare some data.
library(tidyverse)
articles = read.table(
header = TRUE,sep = ",",text="
content,publisher,date
======,NA,NA
Daily News Dec 27,Daily News,Dec 27
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 28,Daily News,Dec 28
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 30,Daily News,Dec 30
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily Bugle Dec 31,Daily Bugle,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Weekly News Dec 31,Weekly News,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA") %>%
as_tibble() %>%
mutate(publisher = ifelse(publisher=="NA", NA, publisher),
date = ifelse(date=="NA", NA, date))
articles
output
# A tibble: 52 x 3
content publisher date
<chr> <chr> <chr>
1 ====== NA NA
2 Daily News Dec 27 Daily News Dec 27
3 Wookies are NA NA
4 . recreationally NA NA
5 using drugs at a NA NA
6 higher rate than NA NA
7 using drugs at a NA NA
8 higher rate than NA NA
9 using drugs at a NA NA
10 higher rate than NA NA
# ... with 42 more rows
I hope this is what your data format is. For me, these are five articles.
Now let's add one convert function and a simple mutation.
fConvert = function(data) tibble(
publisher = data$publisher[2],
date = data$date[2],
content = data %>% slice(3:(nrow(.)-1)) %>%
pull(content) %>% paste(collapse = " ")
)
articles %>% mutate(
idArticle = ifelse(!is.na(publisher),1, 0) %>%
cumsum() %>% lead(default=.[length(.)])
) %>% group_by(idArticle) %>%
nest() %>%
group_modify(~fConvert(.x$data[[1]]))
output
# A tibble: 5 x 4
# Groups: idArticle [5]
idArticle publisher date content
<dbl> <chr> <chr> <chr>
1 1 Daily News Dec 27 Wookies are . recreationally using drugs at a higher rate than using drugs at a higher rate than u~
2 2 Daily News Dec 28 Wookies are . recreationally using drugs at a higher rate than ever before. ever before. ever befo~
3 3 Daily News Dec 30 Wookies are . recreationally using drugs at a higher rate than ever before. ever before.
4 4 Daily Bugle Dec 31 Wookies are . recreationally using drugs at a higher rate than ever before.
5 5 Weekly News Dec 31 Wookies are . recreationally higher rate than ever before.
As you can see, I was able to extract five articles, despite their different lengths, and glue all the lines together into one content
. Hope that's what you meant.
Upvotes: 1
Reputation: 388982
'==='
so that can be used as an article number.content
for each article.publisher
and date
.library(dplyr)
df %>%
mutate(article_no = cumsum(grepl('===', content))) %>%
filter(!grepl('===', content)) %>%
group_by(article_no) %>%
summarise(content = paste0(content[-1], collapse = ''),
publisher = publisher[1],
date = date[1])
# article_no content publisher date
# <int> <chr> <chr> <chr>
#1 1 Wookies are. recreationallyusing drugs at ahigher rate thanever before. Daily Bugle Dec 31
Upvotes: 3