Lionette
Lionette

Reputation: 83

How do I use the new line character in R?

I need to make a table out of a text file in "R" so I can do statistics on it. My text file contains special characters like "$" and also "next line sign" (or paragraph sing in Microsoft Word which is equal to ^p in Microsoft Word).

I read this post, but it did not answer my question. For example, my text file is like:

-The$data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The$data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

-....

With multiple find/replace using gsub I want to make a table like this:

1,Aug,2009,UBC,,p value <0.01
2,Sep,2012,SFU,not, p value > 0.06

Also it would be helpful if you know any package/function to extract a table from a text file.

Upvotes: 0

Views: 481

Answers (1)

r2evans
r2evans

Reputation: 160447

Regex solutions are incredibly sensitive to the formation of the sentences, and since they have irregular spacing I'm inferring that they are either human-generated or generated with an irregular/inconsistent process. Deviations from this pattern will certainly cause portions to break.

As such, I'm making this as specific and robust as possible so that (1) columns are preserved even if not found, and (2) miscreant sentences don't gum up the works.

I assume that you would read in your data with something like:

dat <- readLines("path/to/file.txt")

so for sample data, I'm going to use

dat <- strsplit("-The$data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The$data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

-This$datum is different from the others
and is not significant", "[\n\r]")[[1]]

From here, I'll use a trick of cumsum(grepl(...)) to find instances where I know a line is starting, then group the following lines together.

cumsum(grepl("^-", dat))
# [1] 1 1 1 2 2 2 3 3
combined <- unlist(as.list(by(dat, cumsum(grepl("^-", dat)), paste, collapse = "\n")), use.names=FALSE)
combined
# [1] "-The$data 1 is taken on Aug, 2009 at UBC\nand is significant with p value <0.01\n"      
# [2] "-The$data 2 is taken on Sep, 2012 at SFU\nand is  not significant with p value > 0.06\n"
# [3] "-This$datum is different from the others\nand is not significant"                       

Now that the lines are grouped logically, here's a verbose but (I believe) mostly robust method for parsing out the columns you desire. (I should note that it is certainly feasible to write a single regex that tries to capture everything; the challenge in that is if you want to capture most things if present or just fail if something is not right. I'm leaning towards saving what you can and determining later which pattern is falling short; if you would rather discard an entire record if one small portion of a pattern doesn't work, then this can likely be reduced to a single pattern.)

patterns <- c(
  "(?<=data )[0-9]+(?= is taken)",
  "(?<=taken on )\\w+(?=, 2)",
  "(?<=, )2[0-9]{3}\\b",
  "(?<= at )\\w+(?=\n)",
  "(?<=and is ).*(?=significant)",
  "(?<=significant with).*"
)

lapply(patterns, function(ptn) {
  trimws(sapply(regmatches(combined, gregexpr(ptn, combined, perl = TRUE)), `length<-`, 1))
})
# [[1]]
# [1] "1" "2" NA 
# [[2]]
# [1] "Aug" "Sep" NA   
# [[3]]
# [1] "2009" "2012" NA    
# [[4]]
# [1] "UBC" "SFU" NA   
# [[5]]
# [1] ""    "not" "not"
# [[6]]
# [1] "p value <0.01"  "p value > 0.06" NA              

That output can easily be captured, named, and frame-ized with something like:

as.data.frame(setNames(
  lapply(patterns, function(ptn) {
    trimws(sapply(regmatches(combined, gregexpr(ptn, combined, perl = TRUE)), `length<-`, 1))
  }),
  c("number", "month", "year", "acronym", "not", "pvalue")),
  stringsAsFactors = FALSE)
#   number month year acronym not         pvalue
# 1      1   Aug 2009     UBC      p value <0.01
# 2      2   Sep 2012     SFU not p value > 0.06
# 3   <NA>  <NA> <NA>    <NA> not           <NA>

Upvotes: 1

Related Questions