NexifySummer
NexifySummer

Reputation: 27

Extracting numbers from very long string into vector

I have the fairly long string shown below (~50k characters)

https://gist.github.com/anonymous/9de31de2e6fc9888f3debeda4698b739

I want to extract numbers (always 1 or 2 digit), that are always between "'>" and "<" and add them to a vector (must be in the correct order).

for example:

><td class='td-val ball-8'>13</td><td class='td-val ball-8'>9</td>

Would output a vector, [13,9]

I couldn't even get it to let me enter my string into r, when I tried to do it in the form.

mystring <- "text here"

When I would try to press enter then, it would just have a + next to the command line. So I think some of the symbols in the text were messing it up.

Upvotes: 2

Views: 74

Answers (2)

alistaire
alistaire

Reputation: 43364

Since it's HTML that you're trying to parse, it's best to use an HTML parsing package like rvest:

library(rvest)

url <- 'https://gist.githubusercontent.com/anonymous/9de31de2e6fc9888f3debeda4698b739/raw/c07c2d6c6f00060806b15ec57ed06d4a4e0d9d74/gistfile1.txt' 

url %>% read_html() %>% html_nodes('td.td-val') %>% html_text() %>% as.integer()

which returns

   [1] 13  9  8  8  1  2  0  8 11  2 13  5 13  4  4  5  4  7  3  8 10 13  1  7 14 13 10  2  0  8
  [31] 13  0 10  5 11  9  3  1  4  3  5 12  4 14  1  9 13  5  9  7 12 10  2 10 14  4 11 11 13  8
  [61]  8 10 10 12 12  6  8 13  7  2  2  9 10  9 13  3 14 14  0 14  4 11 14  6 10  2  0  0 10 14
  [91]  2  8  3  6 14  6  1  9 11 12  1 12  4  0  7  9  2 10  1 12  0  8  0  9  3 11 11  0  8  5
 [121]  0  6  1  9  8 10  7  4  7  0  3 12 10 11 11  8  4 11  1  5 12  2 14  9 12  8  1  9 14 13
 [151]  8  2  1  5  7  9 14 14 12  3  6  3  9  0  6  9  3  3 10  3  8  6  9  2  4 12  2  2 14  7
 [181] 12  8  0  8 12  2 12  9  6  8  9  9  3  7  9  0  6 13  0 12  3 14 12  4  8  9 14  4  5  9
 [211]  6  3  2  5  1  2  0  5  0  5  9  0 12 14 11 11  7  4 12  1 14  2 13  3 13  2  0 12 13  6
 [241]  5  3 13  9 12  2 11  6  8 12  9  6 13  9  0  0  4  2  1  0  0  3  0  3  7  9 11  1  8 10
 [271] 11 13 12  9 10  8 10  3  7 12  4  9  0  4 14  1  7  0  7  1  2  6  0  6  6  1  0  9  4  8
 [301]  0  7 13  8 11  4  1 12  1 14 11 13  9 12  8  2  8  7 12 13 12  5  8  5 10  2  7  5  9 12
 [331] 12 13  8  7  6  4 12 13  4  9 12  2  0 11  8  9  1 10  5 10  9 11 10  1  8  1 12 10  9  5
 [361]  7 10  5  2  7 12  4 10  6  9  0  6  0  4 13  7  0  8  3  3 11  8  4 12 10  5  7  1 11  3
 [391]  1 11  7 14 13 13 14  4  2 11  2 12  3  6 14 10  6 13  9 12  4 13 10  3  9 11  8  4  8 10
 [421]  9  6  3  6  7  5 11  0  2  7  6 11 11 13 13 12  7  9  6  9  5 12 14  3 13 10  1  2  7  1
 [451] 14  1  0  7  8 13  6  3  9 12  2  2  2  7 11  1  2 14  6 13 11  3  6 11  5  9  0  9 13 10
 [481] 11 13  3 12 12  3  7  6  5 14  3  9 10  6 13  5  7  4  5 12  8 14  5  6  8  7  0  0  2  1
 [511]  1  9 13 13  5  6 10  8  0  2  3  4  4  5 14 13  5  2  2  4  6  5  9  6 14  8  4 12  4  6
 [541]  9  1  4  2  4  9  1  7  1 10  0  1  1  8  6  5  8  4  9 11 14  2  3  8  2 11  3  7 11  2
 [571]  4  9  5  3  4  1  4  8 13  4  8  8  1  7  2  7  3 11 13  1 13  7  9  3  7  7  4 12  9 14
 [601] 11  9  2 12 12 14 10  4 12 11 12 10 14  3 11  6 12  3  6  3 11  8 10  2  6  3  1 11  2  6
 [631]  0  8 12  5  5  3  6  2 14 11  7 14 14  8 11  2  7  0 10  2  0  4  8  9  8  3  2 13  4 10
 [661]  2  5 13  2  2 12 12  0 10  4  1  5 13  3 10  3 11  2  5  3  9  6 11  0  8 12  0 11  2 11
 [691]  7  8  1  3  4 14  4  4  9  5 12  7  6  9 12 13  2 11  1 11 12  0  4  6 10  8  5 14  7  6
 [721]  4  7  2  5  2 14  3  8 10  6 14  7 14  3  2  6  5  0  3  0 12  0 12  3  5  5  8  5 14  6
 [751] 10 14  5  2  3 11  3  4  3 11  4  2  0 11 11 13  4  0  6 14  2  6  9 10  4  9  5  7  1 13
 [781]  8  3 13  3 10  4  8  1  3 11  2  8  5 10  7  6 10 14 14  2  2 12  8  4 13  7 11 13  4  5
 [811]  7  2  3  8 14  3  9 12  6  2  6  0  3  5  8  8  0 14 13 13  7 10  9  6  1  0  4  8  6  8
 [841] 14  1  9  0  9  2  7 10  8  5 10  7  1  8  2 13  3  1  8 12 12  2  5  6  3  9  4  5  4 13
 [871]  6  3 10  7  9  2  1 12  1 11  0 10  0 11  8  8  0  7  0 11 10  3 14  6  9 11 11  0 12  1
 [901] 10 13  1  7  7  2  0  3 13  9  2  4 12  3  0 11  1  8  8 13 12  6  8 13  8  1 13 11  2  9
 [931] 11  8 10  8  3 14  6 14  7  6  7 10  3 11  3 13 11  3  9 13  8 10  8  7 12  4 11 12 12  9
 [961]  6 10  2  8 13  7 11  5  7 12 10 14  1  6  7  6  7  2  3  5 13  6 10  9  5  2  0  1 11  8
 [991]  9  5  1  3  3  1 12  1 13  2 14  5  7  1 10  9  0  9 11 10  6  2  7 12 10  6  2 10 13  4
[1021]  9  9 14  4  4  5  7 13 13 13  6  7 12  1  6 11 12 14  4 11  6  4 10  0  9 12 10 10 13  8
[1051]  3  3  0  8  5 14 10  3  7  5  0 14  5  6 10 14  7  4  8  9  1  6 14  1 14  5  5 14  4 11
[1081] 12 14  9 13 14 13  2 13 11  9 14  2  1  9  8 11 13 11 14 13  3  4  9  6  9  6 10 13  1 12
[1111] 10 14 11  5  8  9  3  5  6 14  1 11 10 12  7  7  2 13 13 12 12  4  3 14  6  4  2  5  9  4
[1141] 14 11  6  4 11  6  4  4  8  2  2  5 14  1  7 11  8  9 11 11 10  6 14  3  0  3  8  8 14 13
[1171] 10  6 10  4  9 12  0  9  2  9 13 12  1 12  3  5  5  3 12  2  1  5  1  0 10  7  3 10 14 13
[1201] 11  8  0 10 12  9  4  5  4  8  5  6  2 11  7  5  5  8  4  9  9 10 14  3  7  9  1  9  9  8
[1231]  1  8 11  5  2  4  9 14 14  6 10  7  4 14  6  5  1  4  3  8 13 10  5  1  8  8  6  8  7  1
[1261] 14  4  4  7  2 12 10  8 10  5  6  7  2  3  5 13  1  2  9  8  5 14  1 11  9  5  8 12 13  0
[1291]  4  2  0  8  8  2  5  3 13 11  5 11 14 14  9 12  4  5  9  3 13 14  1  5 10  4  9  6  5  8
[1321]  7  5  7  3 14  8  4  8  4  6  5  8 11  0 14 13  2 13 12 13  3  4  7  8 11  4 14 12  3  6
[1351] 11  8  8  9  6  7  4  3 10  9  2  9 12 12  0  1 10  9  8  0 12  9  3 14 13  7  8 12 10  9
[1381] 10 10  2 11

Upvotes: 3

akuiper
akuiper

Reputation: 215137

You can use readLines to import string from the url which you can get by clicking the Raw button.

mystring <- readLines("https://gist.githubusercontent.com/anonymous/9de31de2e6fc9888f3debeda4698b739/raw/c07c2d6c6f00060806b15ec57ed06d4a4e0d9d74/gistfile1.txt")

Use some regular expression as follows should give you all the numbers you want:

library(stringr)
num <- gsub(">|<", "", str_extract_all(mystring, ">\\d+<", simplify = T))

head(as.vector(num))
[1] "13" "9"  "8"  "8"  "1"  "2" 

Upvotes: 2

Related Questions