Gathering the correct amount of digits for numbers when text mining

Question

I need to search for specific information within a set of documents that follows the same standard layout.

After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.

One piece of data I have to collect is the Total Power that appears as following:

TotalPower: 986559. (UoPow)

Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.

substrRight <- function(x, n,m){
 substr(x, nchar(x)-n+1, nchar(x)-m)
}

It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.

So I wrote:

TotalP = substrRight(myDf[i],17,9)        [1]

where myDf is a character vector with all the relevant observations.

Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.

The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.

How can I make sure that I will gather all digits in every number?

Thank you for the help.

akrun · Accepted Answer

We can extract the digits before a . by using regex lookaround

library(stringr)
str_extract(str1, "\d+(?=\.)")
#[1] "986559"

The \d+ indicates one or more digist followed by the regex lookaound .

Gathering the correct amount of digits for numbers when text mining

Answers (1)

Related Questions