Willians Rujano
Willians Rujano

Reputation: 19

Parsing text in r without separator

I need help with ideas for parsing this text.

I want do it the most automatic way possible.

This is the text

text <-  "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"

I need this result:

a b
JOHN DEERE PMWF2126
NEW HOLLAND 441702A1
HIFI WE 2126
CUMMINS 4907485

This is an example, there is a different marks an item id

I try:

str_split(text, " ")

[[1]]
 [1] "JOHN"     "DEERE:"   "PMWF2126" "NEW"      "HOLLAND:" "441702A1" "HIFI:"    "WE"       "2126"    
[10] "CUMMINS:" "4907485"  "CUMMINS:" "3680433"  "CUMMINS:" "3680315"  "CUMMINS:" "3100310" 

Thanks!

Edit:

Thanks for your answers, very helpfull

But there is anoter case where can end with a letter to

text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"

Upvotes: 1

Views: 65

Answers (3)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

We can use separate_rows and separate from tidyr for this task:

library(tidyverse)
data.frame(text) %>%
  # separate into rows:
  separate_rows(text, sep = "(?<=\\d)\\s") %>%
  # separate into columns:
  separate(text,
           into = c("a", "b"),
           sep = ":\\s")
# A tibble: 4 × 2
  a           b       
  <chr>       <chr>   
1 JOHN DEERE  PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI        WE 2126 
4 CUMMINS     4907485 

The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.

Data:

text <-  "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 269546

Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.

text |>
  gsub("(\\d) ", "\\1\n", x = _) |>
  read.table(text = _, sep = ":", strip.white = TRUE)

giving

           V1       V2
1  JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3        HIFI  WE 2126
4     CUMMINS  4907485

If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.

library(gsubfn)
text |> 
  gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
  read.table(text = _, sep = ":", strip.white = TRUE)

Upvotes: 0

Wimpel
Wimpel

Reputation: 27732

Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not. If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.

text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"

lapply(
  strsplit(
    unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)), 
    ":"), trimws)

[[1]]
[1] "JOHN DEERE" "PMWF2126"  

[[2]]
[1] "NEW HOLLAND" "441702A1"   

[[3]]
[1] "HIFI"    "WE 2126"

[[4]]
[1] "CUMMINS" "4907485"

the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part. This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z]. These positions are the used as splitting points

enter image description here

Upvotes: 0

Related Questions