How to separate string from numbers in R?

I have a wild and crazy text file, the head of which looks like this:

2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> 💘
2016-07-01 02:59:34 <name redacted> 🍑🍑🍑
2016-07-01 03:02:48 <name > British security is a little more rigorous...

It goes on for a while. It's a big file. But I feel like it's going to be difficult to annotate with the coreNLP library or package. I'm doing natural language processing. In other words, I'm curious as to how I would shave off, say, at least the dates, if not the dates and the names.

But I guess I would need the names, since, eventually, I would like to be able to be like, this person said this 50 times, whereas this person said this 75 times, and so on, but that's getting a little ahead of myself, probably.

Would this require a regular expression? I'm working in R.

I haven't tried anything yet, since I don't know where to start. How would I write a code in R that would selectively read for only the text? the meaningfully-put-together phrases and sentences?

Upvotes: 0

Answers (4)

Jerry

Reputation: 1

With some help, I was able to figure it out.

> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> e <- data.frame(date = character(),
+                     time = character(),
+                     name = character(),
+                     text = character(),
+                     stringsAsFactors = TRUE)
f <- strcapture(d, c, e)
> f <- f [-c(1),]

The first line was all NAs, hence the last time with the -c

Upvotes: 0

bJust

Reputation: 161

Using base R regular expression used in gsub function it is possible to extract each piece of information. Suppose making this file as an example:

2016-07-01 02:50:35 <name1 surname1> hey
2016-07-01 02:51:26 <name1 surname1> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name1 surname1> thinking about
2016-07-01 02:52:07 <name2 surname2> nothing crappy 
2016-07-01 02:52:20 <name2 surname2> plane went by pretty fast
2016-07-01 02:54:08 <name2 surname2> no idea
2016-07-01 02:54:17 <name2 surname2> just know it's london
2016-07-01 02:56:44 <name1 surname1> you are probably asleep
2016-07-01 02:58:45 <name1 surname1> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name2 surname2> x
2016-07-01 02:59:34 <name1 surname2> y
2016-07-01 03:02:48 <name2 > British security is a little more rigorous...

Now in R console, your read the file as a simple text and process them by a regex. Argument 2 of gsub is to extract pattern from regex

your_data <- readLines(your_text_file)  # Reading 
pattern <- "(.*) <(\\S*) (\\S*)>(.*)" # The regex pattern
times <- gsub(pattern,"\\1",your_data) # Get Time and date
person_name <- gsub(pattern,"\\2 \\3",your_data) # Get name
message <- gsub(pattern,"\\4",your_data) # Get message

Upvotes: 0

Calum You

Reputation: 15072

Using your example pasted text, we can do the following. Note that your description of the way the text behaves when copy pasted suggests to me that there are actually newline characters \n in the text, but without a reproducible example it's hard to say.

Split the single long string into lines by splitting on the boundary before a date. If you have people regularly typing dates into messages, you can extend the pattern to include the time and name. If people are typing that into messages then it's gonna be complicated, but hopefully will only affect a few messages. This would be fixed by having line delineations.
Put the lines into a dataframe column and split on spaces that either precede or follow a caret < or > to split into name and message.

library(tidyverse)
text <- "2016-07-01 23:59:27 <John Doe> We're both signing off at the same time2016-07-02 00:00:04 <John Doe> :-)2016-07-02 00:00:28 <John Doe> I live you supercalagraa...phragrlous...esp..dociois2016-07-02 00:12:23 <Jane Doe> I love you :)2016-07-02 08:57:33"
text %>%
  str_split("(?=\\d{4}-\\d{2}-\\d{2})") %>%
  pluck(1) %>%
  enframe(name = NULL, value = "message") %>%
  separate(message, c("datetime", "name", "message"), sep = "\\s(?=<)|(?<=>)\\s", extra = "merge")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1,
#> 6].
#> # A tibble: 6 x 3
#>   datetime           name      message                                     
#>   <chr>              <chr>     <chr>                                       
#> 1 ""                 <NA>      <NA>                                        
#> 2 2016-07-01 23:59:… <John Do… We're both signing off at the same time     
#> 3 2016-07-02 00:00:… <John Do… :-)                                         
#> 4 2016-07-02 00:00:… <John Do… I live you supercalagraa...phragrlous...esp…
#> 5 2016-07-02 00:12:… <Jane Do… I love you :)                               
#> 6 2016-07-02 08:57:… <NA>      <NA>

^{Created on 2019-05-16 by the reprex package (v0.2.1)}

Upvotes: 0

Emma

Reputation: 27723

This may not need an expression, but if you wish to do that, this expression might help you to simply to that:

(.*)(\s<name.*)

RegEx

If this wasn't your desired expression, you can modify/change your expressions in regex101.com. You can add more boundaries if necessary.

RegEx Circuit

You can also visualize your expressions in jex.im:

JavaScript Demo

const regex = /(.*)(\s<name.*)/gm;
const str = `2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> 💘
2016-07-01 02:59:34 <name redacted> 🍑🍑🍑
2016-07-01 03:02:48 <name > British security is a little more rigorous...`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Upvotes: 1

How to separate string from numbers in R?

Answers (4)

RegEx

RegEx Circuit

JavaScript Demo

Related Questions