Reputation: 11

How to extract strings from a full text using R?

I am now confused by a problem. I have more than 3,000 observations, each observation is a full text. For example:

text="Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。
It is a divorce dispute, according to 《marriage law》on June 21, 2016。"

Now, I want to extract the information for the plaintiff and defendant, and also I want to know whether this full text contain the word "《marriage law》"(T for yes, F for no)

Thus, I want to have the following results:

text	plaintiff	defendant	law
Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。It is a divorce dispute, according to 《marriage law》on June 21, 2016。	The plaintiff X, female, born on May, 1980, lives in X County, X Province。	The defendant X, male, born on May, 1971, lives in X County, X Province。	T

I tried several times, but it does not work. Many thanks for your kind help!

Follow up:

Thank you for your answers. However, the difficulty is that the whole text may have many sentences start with "the plaintiff" and ends with the punctuation "。". How can I only extract the first appearance of the sentence with plaintiff birth and residence information? The order is not fixed, the punctuation is always used.

For example, the whole text may also have sentence like "the plaintiff declares that he is wrong。" The pattern given in the previous answer will also extract this sentence, which I do not want.

Upvotes: 1

Answers (2)

Kat

Reputation: 18754

Update

With the additional information you've provided, see if this works for you.

This assumed that there is only once sentence each for the plaintiff and the defendant. I've added .* at the end of the 'rovince' (as in Province). That is so that if Province is not the end of the sentence, it still captures the entire sentence. I left off the P so that if capitalization is inconsistent, it doesn't matter.

I've used [^。]+ to capture anything except a period so it can only capture one sentence.

It still assumes that the sentence begins with "The plaintiff" (or defendant).

If this does not work, you'll really need to provide several more examples of potential content.

library(tidyverse)

td3 <- data.frame(oText = text) %>% 
  extract(into = c('plaintiff', 'defendent'), remove = F, col = oText,
          regex = "^.*(The plaintiff[^。]+rovince.*。).*(The defendant[^。]+rovince.*。).*") %>% 
  mutate(law = str_detect(oText, 'marriage law'))

Originally...

How tight are the patterns you've shown here? Is the plaintiff always in the second sentence? Does the defendant's description always follow the plaintiff? Is punctuation always used?

Here's a method that works with this data. This method does not assume any given order, but it does assume punctuation was used.

In the regex used you see 'The plaintiff' (or defendant), followed by .*, which means followed by anything, then ?, which tells us that we want the first occurrence of the lookahead. The lookahead, or where we want the regex to stop looking, is documented in (?= ). You have oddly encoded 。at the end of the sentences (assuming this was translated).

If you have periods or another recognized special character in your real data, you'll have to escape it. In this regex, you saw that the period followed by the asterisk was coding for ...and anything else... so if you were looking for a period or an asterisk, you'd have to 'escape' it so that the regex process knows that you meant the character literally.

library(tidyverse)
library(stringi)

tdf <- data.frame(oText = text) %>% 
  mutate(plaintiff = stri_extract_first_regex(oText, 'The plaintiff.*?(?=(。))'),
         defendent = stri_extract_first_regex(oText, 'The defendant.*?(?=(。))'),
         law = str_detect(oText, 'marriage law'))

If the patterns are strict, you could probably use dplyr::separate to make this even easier.

Upvotes: 1

Andre Wildberg

Reputation: 19231

An approach using str_extract and sub. The substitution removes any follow up sentences, if they exists. So the detected plaintiff and defendant can only be one sentence long (。 as the separator).

library(dplyr)
library(stringr)

tibble(text) %>% 
  mutate(plaintiff = sub("(。).*", "\\1", str_extract(text, "The plaintiff.*。")), 
         defendant = sub("(。).*", "\\1", str_extract(text, "The defendant.*。")), 
         law = grepl("《marriage law》", text)) %>% 
  print(Inf)
# A tibble: 1 × 4
  text                                                     plain…¹ defen…² law  
  <chr>                                                    <chr>   <chr>   <lgl>
1 "Ganluo County People's Court of X Province。The plaint… The pl… The de… TRUE 
# … with abbreviated variable names ¹plaintiff, ²defendant

full output

# A tibble: 1 × 4
  text                                                                          
  <chr>                                                                         
1 "Ganluo County People's …
  plaintiff                                                                  
  <chr>                                                                      
1 The plaintiff X, female, born on May, 1980, lives in X County, X Province。
  defendant                                                                
  <chr>                                                                    
1 The defendant X, male, born on May, 1971, lives in X County, X Province。
  law  
  <lgl>
1 TRUE

extended data

text <- "Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。The plaintiff wuen weofioi woe fowie fowie fowei f。The defendant wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。The defendant wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。\nIt is a divorce dispute, according to 《marriage law》on June 21, 2016。"

Upvotes: 1

How to extract strings from a full text using R?

Answers (2)

Update

Originally...

extended data

Related Questions