civy
civy

Reputation: 423

Keep text inside HTML in rvest

I would like to know how can I keep only the text between <> after running rvest on a specific attribute and website URL. This is the character set I get on the output

{xml_nodeset (11)}
 [1] <td id="open">1.1041</td>
 [2] <td id="open">1.1043</td>
 [3] <td id="open">1.1049</td>
 [4] <td id="open">1.1043</td>
 [5] <td class="right" id="open">47.617</td>
 [6] <td class="left" id="open">MA</td>

Ideally I want to isolate the contained text and get this

[1] 1.1041
[2] 1.1043
[3] 1.1049
[4] 1.1043
[5] 47.617
[6] MA

but so far by using the html_text function I get a concatenated string with "" between values which is not what I want

[1] "1.1041" "1.1043" "1.1049" "1.1043" "47.617" "MA"  

Upvotes: 1

Views: 149

Answers (1)

thepule
thepule

Reputation: 1751

Everything is being coerced to string format because of the last value MA. That's why you get quotes around the numbers.

You can convert everything to numeric, but the last value would be coerced to NA.

q <- c("1.1041", "1.1043", "1.1049", "1.1043", "47.617", "MA")
as.numeric(q)

# The output of the previous command is:
[1]  1.1041  1.1043  1.1049  1.1043 47.6170      NA
Warning message:
NAs introduced by coercion 

So you have to decide what format you want your data in.

Upvotes: 1

Related Questions