darkage
darkage

Reputation: 857

Reading a CSV file with commas in a column

I have a csv file with 6 columns and one of the columns has text separated by comma, e.g., BOLT, RD HD SQ SHORT NECK, METRIC.

When I read this file in R there is overflow from this column and subsequently data moves to a new line.

Below I am pasting few lines

014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F 014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F 014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F

014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G 014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G 014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G

The last two line is where the problem lies. "NUT, METRIC, HEX FLANGE" should come under one variable.

How can this be resolved?

Upvotes: 2

Views: 3800

Answers (1)

Aurèle
Aurèle

Reputation: 12819

data <- readLines(con = textConnection("014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F
014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F
014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F

014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G
014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G
014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G"))

pattern <- "^([^,]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*)$"

library(stringr)
str_match(data, pattern)[, - 1]
#      [,1]           [,2]        [,3]   [,4]                                     [,5]     [,6]
# [1,] "014003051906" "ETN5080 "  "0450" "BOLT KIT UPPER SHAFT WITH 5 SPEED"      "1.000"  "F" 
# [2,] "014003051906" "ETN5967 "  "0460" "SENSOR SENSOR FH BACKSHAFT SPEED"       "1.000"  "F" 
# [3,] "014003051906" "ETN64267 " "0470" "TILT UNIT SENSOR"                       "1.000"  "F" 
# [4,] NA             NA          NA     NA                                       NA       NA  
# [5,] "014003065376" "03M7184 "  "0020" "BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc" "4.000"  "G" 
# [6,] "014003065376" "03M7386 "  "0090" "BOLT, RD HD SQ SHORT NECK, METRIC"      "18.000" "G" 
# [7,] "014003065376" "14M7296 "  "0090" "NUT, METRIC, HEX FLANGE"                "14.000" "G" 

Edit:
Regex explanations for beginners, in plain words so please forgive inaccuracies:

  • Initial ^ and terminal $ mean start and end of string.
  • Parens are for grouping (groups that str_match() will extract).
  • . means any character, and .* means any amount of any characters.
  • [^,] means any character that is not a comma.

When put together, it means : start of string - substring without a comma - comma (repeated 3 times) - substring possibly containing commas - comma - substring without a comma - comma - substring without a comma - end of string, and only the parenthesized groups are extracted.

Upvotes: 11

Related Questions