xyy
xyy

Reputation: 547

How to vectorize the `paste` function in R?

Let's say I have a vector of strings (lines) that I scraped from a .txt webpage using readLines. Some of the lines will start with "<", generally indicating the start of a new paragraph. Some of the lines will start with a letter, generally indicating that it's connected to the line before it. I want to be able to concatenate lines that belong to the same paragraph.

My plan is to locate the lines that start with "<", and concatenate all the lines in between two lines that start with "<". Essentially, I end up with a list of indices. For example, I may want to concatenate lines[1:3], lines[4:4], lines[5:9], and so on. Is there a way to vectorize this? I cannot just do paste(lines[begin.index : end.index]), but that may give you an idea of what I'm hoping to achieve.

Here's some dummy data as an example, since my actual data is too long:

[1] "<P> sampletextsampletext"
[2] "sampletextsampletext</P>" 
[3] "<P> sampletext"
[4] "sampletext"
[5] "sampletext</P>"
[6] "<P> sampletext </P>"

I would want to concatenate lines 1 and 2 together, and likes 3, 4, and 5 together, and line 6 will stay the same.

Upvotes: 0

Views: 1632

Answers (2)

IRTFM
IRTFM

Reputation: 263331

This is a base R solution. The two grep operations deliver the starting and stopping lines for paragraphs to the mapply-administered function to collapse them together with paste:

> txt <- scan(what="")
1: "<P> sampletextsampletext"
2: "sampletextsampletext</P>" 
3: "<P> sampletext"
4: "sampletext"
5: "sampletext</P>"
6: "<P> sampletext </P>"
7: 
Read 6 items
> grep("<P>", txt)
[1] 1 3 6
> grep("</P>", txt)
[1] 2 5 6
> mapply( function(x,y) paste( txt[x:y], collapse=" "), grep("<P>", txt), grep("</P>", txt) )
[1] "<P> sampletextsampletext sampletextsampletext</P>"
[2] "<P> sampletext sampletext sampletext</P>"         
[3] "<P> sampletext </P>" 

Upvotes: 1

alistaire
alistaire

Reputation: 43334

If you're trying to separate HTML nodes, it's better to use functions that understand HTML. This has the advantage of keeping you from needing to manually find start and end tags, too.

# read in data
lines <- c("<P> sampletextsampletext",
           "sampletextsampletext</P>" ,
           "<P> sampletext",
           "sampletext",
           "sampletext</P>",
           "<P> sampletext </P>")

# load a simple HTML scraping/parsing package
library(rvest)

# find all `<p>` tags and their contents
lines %>% paste(collapse = '') %>% read_html() %>% html_nodes('p')
# {xml_nodeset (3)}
# [1] <p> sampletextsampletextsampletextsampletext</p>
# [2] <p> sampletextsampletextsampletext</p>
# [3] <p> sampletext </p>

Upvotes: 4

Related Questions