Michael Harris
Michael Harris

Reputation: 1

Creating a data frame from looping through text

Thanks in advance! I have been trying this for a few days, and I am kind of stuck. I am trying to loop through a text file (imported as a list), and create a data frame from the text file. The data frame starts a new row if the item in the list has a day of the week in the text, and will populate in the first column (V1). I want to put the rest of the comments in the second column (V2) and I may have to concatenate strings together. I am trying to use a conditional with grepl(), but I am kind of lost on the logic after I set up the initial data frame.

Here is an example text I am bringing into R (it is Facebook data from a text file). The []'s signify the list number. It is a lengthy file (50K+ lines) but I have the date column set up.

[1] Thursday, August 25, 2016 at 3:57pm EDT

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???

[3]Sunday, August 14, 2016 at 9:17am EDT

[4]Michael shared Jason post.

[5]This bird is a lot smarter than the majority of political posts I have read recently here

[6]Sunday, August 14, 2016 at 8:44am EDT

[7]Michael and Kurt are now friends.

The end result would be data frame where the day of the week starts a new row in the data frame, and the rest of the list is concatenated into the second column of the data frame. So the end data fame would be

Row 1 ([1] in V1 and [2] in V2)

Row 2 ([3] in V1 and [4],[5] in V2)

Row 3 ([6] in V1 and [7] in V2)

Here is the start of my code, and I can get V1 to populate correctly, but not the second column of the data frame.

### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")

### Remove empty lines from the text file
temp <- temp[temp!=""]

### Create the temp char file as a list file
tmp <- as.list(temp)

### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}

### Loop through the list
for (n in 1:length(tmp)){

    ### Search to see if there is a day in the list item
    for(i in 1:length(days)){
            if(grepl(days[i], tmp[n])==1){
    ### Bind the row to the df if there is a day in the list item
                    df<- rbind(df, tmp[n])
            }
    }
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.    
d <- c(d, tmp[n])
}

Upvotes: 0

Views: 96

Answers (1)

alistaire
alistaire

Reputation: 43344

Here's an option using the tidyverse:

library(tidyverse)

text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???

[3]Sunday, August 14, 2016 at 9:17am EDT

[4]Michael shared Jason post.

[5]This bird is a lot smarter than the majority of political posts I have read recently here

[6]Sunday, August 14, 2016 at 8:44am EDT

[7]Michael and Kurt are now friends."

df <- data_frame(lines = read_lines(text)) %>%    # read data, set up data.frame
    filter(lines != '') %>%    # filter out empty lines
    # set grouping by cumulative number of rows with weekdays in them
    group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
    # collapse each group to two columns
    summarise(V1 = lines[1], V2 = list(lines[-1]))

df
## # A tibble: 3 × 3
##     grp                                          V1        V2
##   <int>                                       <chr>    <list>
## 1     1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2     2    [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3     3    [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>

This approach uses a list column for V2, which is probably the best approach in terms of preserving your data, but use paste or toString if you need.


Roughly equivalent base R:

df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)

df <- df[df$V2 != '', , drop = FALSE]

df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))

df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})

df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})

df
##   grp                                          V1
## 1   1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2   2    [3]Sunday, August 14, 2016 at 9:17am EDT
## 3   3    [6]Sunday, August 14, 2016 at 8:44am EDT
##                                                                                                                                                                   V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2                                        [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3                                                                                                                               [7]Michael and Kurt are now friends.

Upvotes: 1

Related Questions