Rollo99
Rollo99

Reputation: 1613

How to subset a text by 2 sentences in R?

I have the following dataframe:

df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)

I have used to subset the text into sentences:

library(textshape)

split_sentence(df$Text)

However, I would like to subset the "Text" column every 2 senteces, so to get a list like:

This is great.
A really great place to be.
Good morning.
There are very skilled programmers here. 
SO is great.
You can get many things solve.

Can anyone help me?

Thanks!

Upvotes: 0

Views: 200

Answers (3)

hello_friend
hello_friend

Reputation: 5788

Base R solution, note this solution allows n to be set as any integer and follows that in a retain / skip pattern.

# Number of sentences to keep before removing the same number of sentences: n => integer scalar 
n <- 2

# Split the string into separate sentences: sentences => list of a character vector
res <- subset(data.frame(sentences = unlist(strsplit(paste0(df$Text, collapse = " "), "(?<=\\.)\\s+", perl = TRUE))),
                    ceiling(seq_along(sentences) / n) %% 2 == 1)[ , 1, drop = TRUE]

# Print the result to console: character vector => stdout (console)
res

# Data: 
df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)

Upvotes: 1

mt1022
mt1022

Reputation: 17299

Another option with strsplit and head:

unlist(lapply(strsplit(df$Text, '(?<=\\.)\\s*', perl = TRUE), head, 2))
# [1] "This is great."                           "A really great place to be."             
# [3] "Good morning."                            "There are very skilled programmers here."
# [5] "SO is great."                             "You can get many things solve."    

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389055

You could split Text into separate rows for every sentence and select only 1st 2 sentences in each row. Using dplyr you can do this as :

library(dplyr)

df %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(Text, sep = '\\.\\s*') %>%
  group_by(row) %>%
  slice(1:2) %>%
  ungroup %>%
  select(-row)

#  Text                                   
#  <chr>                                  
#1 This is great                          
#2 A really great place to be             
#3 Good morning                           
#4 There are very skilled programmers here
#5 SO is great                            
#6 You can get many things solve        

Upvotes: 2

Related Questions