Reputation: 69
This is my first time scraping text from a PDF document. I'm presenting what I think is the most useful format for the data for what I'm doing, but I could be wrong. Once I cleaned the PDF text, I formatted it into a tibble
(below).
I tried utilizing strsplit(dmt, \\s+)
to split the character string into three separate columns but that just separated everything completely. I had used str_squish()
to eliminate the spaces in the middle text portion of the string but that did not help the pattern matching.
The first numeric part of the character string sometimes ends with a )
or a number
. Here is what I'm working with:
dmt
# A tibble: 612 x 1
datamatrixtest[,1]
<chr>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
# ... with 602 more rows
Ideally, I could utilize strsplit
with an accurate pattern match to put the data into three separate columns.
dmt
# A tibble: 612 x 3
statute offense cases
<chr> <chr> <num>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
Upvotes: 2
Views: 190
Reputation: 388907
Based upon your data, we can also use tidyr::extract
by defining different capture groups.
library(dplyr)
library(tidyr)
df %>%
extract(datamatrixtest, into = c("statute", "offense", "cases"),
regex = "(.*?)\\s(.*?)(\\d.*)") %>%
mutate_all(trimws)
# statute offense cases
# <chr> <chr> <chr>
# 1 110.05 Human Service Vehicle Inspection Reqd 6
# 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
# 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
# 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
# 5 341.04(1) Non-Registration of Vehicle 10,125
# 6 341.04(2) Improper Registration of Vehicle 4
# 7 341.15(1) Fail/Display Vehicle License Plates 2,010
# 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
# 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#10 1341.15(2) Improperly Attached License Plates 7
Here, we define three groups, first starts from beginning of the text till the first whitespace is encountered, second one starts from where the first one ends till a number is encountered and the third one takes everything from the number till the end of the sentence.
data
df <- structure(list(datamatrixtest = c("110.05 Human Service
Vehicle Inspection Reqd 6",
"23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1",
"23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1",
"341.03 Oper Veh After Sus/Rev or Can of Reg 8,862",
"341.04(1) Non-Registration of Vehicle 10,125",
"341.04(2) Improper Registration of Vehicle 4",
"341.15(1) Fail/Display Vehicle License Plates 2,010",
"341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3",
"341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2",
"1341.15(2) Improperly Attached License Plates 7"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Reputation: 15072
I am assuming that your data is essentially as presented, with multiple spaces between columns. In other words, check that your dmt
is comparable to the one I've created below. In that case, we can just split each line on any section of more than one space with \\s{2,}
like this. If your data is not like this, or if any individual field happens to contain multiple spaces, then use dput
and head
to provide a sample so we can find a more precise pattern that will work.
library(tidyverse)
dmt <- read_lines(
"110.05 Human Service Vehicle Inspection Reqd 6
23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
341.04(1) Non-Registration of Vehicle 10,125
341.04(2) Improper Registration of Vehicle 4
341.15(1) Fail/Display Vehicle License Plates 2,010
341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
1341.15(2) Improperly Attached License Plates 7"
) %>%
enframe(name = NULL, value = "line")
dmt %>%
separate(line, c("statute", "offense", "cases"), sep = "\\s{2,}") %>%
mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
#> # A tibble: 10 x 3
#> statute offense cases
#> <chr> <chr> <int>
#> 1 110.05 Human Service Vehicle Inspection Reqd 6
#> 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
#> 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
#> 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8862
#> 5 341.04(1) Non-Registration of Vehicle 10125
#> 6 341.04(2) Improper Registration of Vehicle 4
#> 7 341.15(1) Fail/Display Vehicle License Plates 2010
#> 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
#> 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#> 10 1341.15(2) Improperly Attached License Plates 7
Created on 2019-09-23 by the reprex package (v0.3.0)
Upvotes: 1