Reputation: 857
I have a csv file with 6 columns and one of the columns has text separated by comma, e.g., BOLT, RD HD SQ SHORT NECK, METRIC.
When I read this file in R there is overflow from this column and subsequently data moves to a new line.
Below I am pasting few lines
014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F 014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F 014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F
014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G 014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G 014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G
The last two line is where the problem lies. "NUT, METRIC, HEX FLANGE" should come under one variable.
How can this be resolved?
Upvotes: 2
Views: 3800
Reputation: 12819
data <- readLines(con = textConnection("014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F
014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F
014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F
014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G
014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G
014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G"))
pattern <- "^([^,]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*)$"
library(stringr)
str_match(data, pattern)[, - 1]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "014003051906" "ETN5080 " "0450" "BOLT KIT UPPER SHAFT WITH 5 SPEED" "1.000" "F"
# [2,] "014003051906" "ETN5967 " "0460" "SENSOR SENSOR FH BACKSHAFT SPEED" "1.000" "F"
# [3,] "014003051906" "ETN64267 " "0470" "TILT UNIT SENSOR" "1.000" "F"
# [4,] NA NA NA NA NA NA
# [5,] "014003065376" "03M7184 " "0020" "BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc" "4.000" "G"
# [6,] "014003065376" "03M7386 " "0090" "BOLT, RD HD SQ SHORT NECK, METRIC" "18.000" "G"
# [7,] "014003065376" "14M7296 " "0090" "NUT, METRIC, HEX FLANGE" "14.000" "G"
Edit:
Regex explanations for beginners, in plain words so please forgive inaccuracies:
^
and terminal $
mean start and end of string. str_match()
will extract). .
means any character, and .*
means any amount of any characters. [^,]
means any character that is not a comma. When put together, it means : start of string
- substring without a comma
- comma
(repeated 3 times) - substring possibly containing commas
- comma
- substring without a comma
- comma
- substring without a comma
- end of string
, and only the parenthesized groups are extracted.
Upvotes: 11