mtotof
mtotof

Reputation: 69

Spreading character string into multiple columns with strsplit pattern matching

This is my first time scraping text from a PDF document. I'm presenting what I think is the most useful format for the data for what I'm doing, but I could be wrong. Once I cleaned the PDF text, I formatted it into a tibble (below).

I tried utilizing strsplit(dmt, \\s+) to split the character string into three separate columns but that just separated everything completely. I had used str_squish() to eliminate the spaces in the middle text portion of the string but that did not help the pattern matching.

The first numeric part of the character string sometimes ends with a ) or a number. Here is what I'm working with:

dmt
# A tibble: 612 x 1
   datamatrixtest[,1]                                             
   <chr>                                                          
 1 110.05          Human Service Vehicle Inspection Reqd         6
 2 23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
 3 23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
 4 341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
 5 341.04(1)       Non-Registration of Vehicle              10,125
 6 341.04(2)       Improper Registration of Vehicle              4
 7 341.15(1)       Fail/Display Vehicle License Plates       2,010
 8 341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
 9 341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
10 341.15(2)       Improperly Attached License Plates            7
# ... with 602 more rows

Ideally, I could utilize strsplit with an accurate pattern match to put the data into three separate columns.

dmt
# A tibble: 612 x 3
   statute         offense                                    cases
   <chr>           <chr>                                       <num>        
 1 110.05          Human Service Vehicle Inspection Reqd         6
 2 23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
 3 23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
 4 341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
 5 341.04(1)       Non-Registration of Vehicle              10,125
 6 341.04(2)       Improper Registration of Vehicle              4
 7 341.15(1)       Fail/Display Vehicle License Plates       2,010
 8 341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
 9 341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
10 341.15(2)       Improperly Attached License Plates            7

Upvotes: 2

Views: 190

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388907

Based upon your data, we can also use tidyr::extract by defining different capture groups.

library(dplyr)
library(tidyr)

df %>%
   extract(datamatrixtest, into = c("statute", "offense", "cases"),
           regex = "(.*?)\\s(.*?)(\\d.*)") %>%
    mutate_all(trimws)


#   statute       offense                                 cases 
#   <chr>         <chr>                                   <chr> 
# 1 110.05        Human Service Vehicle Inspection Reqd   6     
# 2 23.33(12)(b)  ATV-Fail/Stop for Law Enforce. Official 1     
# 3 23.33(6)(a)   ATV-Fail/Display Lighted Headlamp       1     
# 4 341.03        Oper Veh After Sus/Rev or Can of Reg    8,862 
# 5 341.04(1)     Non-Registration of Vehicle             10,125
# 6 341.04(2)     Improper Registration of Vehicle        4     
# 7 341.15(1)     Fail/Display Vehicle License Plates     2,010 
# 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag       3     
# 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag      2     
#10 1341.15(2)    Improperly Attached License Plates      7  

Here, we define three groups, first starts from beginning of the text till the first whitespace is encountered, second one starts from where the first one ends till a number is encountered and the third one takes everything from the number till the end of the sentence.

data

df <- structure(list(datamatrixtest = c("110.05          Human Service 
Vehicle Inspection Reqd         6", 
"23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1", 
"23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1", 
"341.03          Oper Veh After Sus/Rev or Can of Reg      8,862", 
"341.04(1)       Non-Registration of Vehicle              10,125", 
"341.04(2)       Improper Registration of Vehicle              4", 
"341.15(1)       Fail/Display Vehicle License Plates       2,010", 
"341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3", 
"341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2", 
"1341.15(2)       Improperly Attached License Plates            7"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Calum You
Calum You

Reputation: 15072

I am assuming that your data is essentially as presented, with multiple spaces between columns. In other words, check that your dmt is comparable to the one I've created below. In that case, we can just split each line on any section of more than one space with \\s{2,} like this. If your data is not like this, or if any individual field happens to contain multiple spaces, then use dput and head to provide a sample so we can find a more precise pattern that will work.

library(tidyverse)
dmt <- read_lines(
"110.05          Human Service Vehicle Inspection Reqd         6
23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
341.04(1)       Non-Registration of Vehicle              10,125
341.04(2)       Improper Registration of Vehicle              4
341.15(1)       Fail/Display Vehicle License Plates       2,010
341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
1341.15(2)       Improperly Attached License Plates            7"
) %>%
  enframe(name = NULL, value = "line")

dmt %>%
  separate(line, c("statute", "offense", "cases"), sep = "\\s{2,}") %>%
  mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
#> # A tibble: 10 x 3
#>    statute       offense                                 cases
#>    <chr>         <chr>                                   <int>
#>  1 110.05        Human Service Vehicle Inspection Reqd       6
#>  2 23.33(12)(b)  ATV-Fail/Stop for Law Enforce. Official     1
#>  3 23.33(6)(a)   ATV-Fail/Display Lighted Headlamp           1
#>  4 341.03        Oper Veh After Sus/Rev or Can of Reg     8862
#>  5 341.04(1)     Non-Registration of Vehicle             10125
#>  6 341.04(2)     Improper Registration of Vehicle            4
#>  7 341.15(1)     Fail/Display Vehicle License Plates      2010
#>  8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag           3
#>  9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag          2
#> 10 1341.15(2)    Improperly Attached License Plates          7

Created on 2019-09-23 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions