Reputation: 1
I have a wild and crazy text file, the head of which looks like this:
2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> π
2016-07-01 02:59:34 <name redacted> πππ
2016-07-01 03:02:48 <name > British security is a little more rigorous...
It goes on for a while. It's a big file. But I feel like it's going to be difficult to annotate with the coreNLP library or package. I'm doing natural language processing. In other words, I'm curious as to how I would shave off, say, at least the dates, if not the dates and the names.
But I guess I would need the names, since, eventually, I would like to be able to be like, this person said this 50 times, whereas this person said this 75 times, and so on, but that's getting a little ahead of myself, probably.
Would this require a regular expression? I'm working in R.
I haven't tried anything yet, since I don't know where to start. How would I write a code in R that would selectively read for only the text? the meaningfully-put-together phrases and sentences?
Upvotes: 0
Views: 90
Reputation: 1
With some help, I was able to figure it out.
> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> e <- data.frame(date = character(),
+ time = character(),
+ name = character(),
+ text = character(),
+ stringsAsFactors = TRUE)
f <- strcapture(d, c, e)
> f <- f [-c(1),]
The first line was all NAs, hence the last time with the -c
Upvotes: 0
Reputation: 161
Using base R regular expression used in gsub function it is possible to extract each piece of information. Suppose making this file as an example:
2016-07-01 02:50:35 <name1 surname1> hey
2016-07-01 02:51:26 <name1 surname1> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name1 surname1> thinking about
2016-07-01 02:52:07 <name2 surname2> nothing crappy
2016-07-01 02:52:20 <name2 surname2> plane went by pretty fast
2016-07-01 02:54:08 <name2 surname2> no idea
2016-07-01 02:54:17 <name2 surname2> just know it's london
2016-07-01 02:56:44 <name1 surname1> you are probably asleep
2016-07-01 02:58:45 <name1 surname1> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name2 surname2> x
2016-07-01 02:59:34 <name1 surname2> y
2016-07-01 03:02:48 <name2 > British security is a little more rigorous...
Now in R console, your read the file as a simple text and process them by a regex. Argument 2 of gsub is to extract pattern from regex
your_data <- readLines(your_text_file) # Reading
pattern <- "(.*) <(\\S*) (\\S*)>(.*)" # The regex pattern
times <- gsub(pattern,"\\1",your_data) # Get Time and date
person_name <- gsub(pattern,"\\2 \\3",your_data) # Get name
message <- gsub(pattern,"\\4",your_data) # Get message
Upvotes: 0
Reputation: 15072
Using your example pasted text, we can do the following. Note that your description of the way the text behaves when copy pasted suggests to me that there are actually newline characters \n
in the text, but without a reproducible example it's hard to say.
Split the single long string into lines by splitting on the boundary before a date. If you have people regularly typing dates into messages, you can extend the pattern to include the time and name. If people are typing that into messages then it's gonna be complicated, but hopefully will only affect a few messages. This would be fixed by having line delineations.
Put the lines into a dataframe column and split on spaces that either precede or follow a caret <
or >
to split into name and message.
library(tidyverse)
text <- "2016-07-01 23:59:27 <John Doe> We're both signing off at the same time2016-07-02 00:00:04 <John Doe> :-)2016-07-02 00:00:28 <John Doe> I live you supercalagraa...phragrlous...esp..dociois2016-07-02 00:12:23 <Jane Doe> I love you :)2016-07-02 08:57:33"
text %>%
str_split("(?=\\d{4}-\\d{2}-\\d{2})") %>%
pluck(1) %>%
enframe(name = NULL, value = "message") %>%
separate(message, c("datetime", "name", "message"), sep = "\\s(?=<)|(?<=>)\\s", extra = "merge")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1,
#> 6].
#> # A tibble: 6 x 3
#> datetime name message
#> <chr> <chr> <chr>
#> 1 "" <NA> <NA>
#> 2 2016-07-01 23:59:⦠<John Do⦠We're both signing off at the same time
#> 3 2016-07-02 00:00:⦠<John Do⦠:-)
#> 4 2016-07-02 00:00:β¦ <John Doβ¦ I live you supercalagraa...phragrlous...espβ¦
#> 5 2016-07-02 00:12:⦠<Jane Do⦠I love you :)
#> 6 2016-07-02 08:57:β¦ <NA> <NA>
Created on 2019-05-16 by the reprex package (v0.2.1)
Upvotes: 0
Reputation: 27723
This may not need an expression, but if you wish to do that, this expression might help you to simply to that:
(.*)(\s<name.*)
If this wasn't your desired expression, you can modify/change your expressions in regex101.com. You can add more boundaries if necessary.
You can also visualize your expressions in jex.im:
const regex = /(.*)(\s<name.*)/gm;
const str = `2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> π
2016-07-01 02:59:34 <name redacted> πππ
2016-07-01 03:02:48 <name > British security is a little more rigorous...`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Upvotes: 1