biotech
biotech

Reputation: 727

Parse CSV file with commas inside fields

I just received a file from a colleague and don't know how to parse this:

Input:

key,value1,"value2,hello"

Desired output:

key,value2

Perl or Python are the languages I understand.

Thanks, Bernardo

Upvotes: 1

Views: 449

Answers (5)

Dave Cross
Dave Cross

Reputation: 69264

The standard Perl module Text::ParseWords can be used to handle CSV files.

#!/usr/bin/perl

use strict;
use warnings;

use Text::ParseWords;

while (<DATA>) {
  my @fields = parse_line(',', 0, $_);

  # Do something useful with the data in @fields
  print join ' | ', @fields;
}
__DATA__
key,value1,"value2,hello"

Upvotes: 3

vks
vks

Reputation: 67968

(.*?)\,.*?\"(.*?)\,.*

You can try this.

See demo.

http://regex101.com/r/rI6jZ0/2

Upvotes: 0

magnetometer
magnetometer

Reputation: 656

If you want to use regular expressions for this task, the following should work:

(\S+,)\d+,\"(\d+),\S+\"

(\S+,) is a first capturing group that selects the first key, including the comma. This is followed by some numbers, a comma and a quote \d+,\". The second capturing group (\d+) selects the second value, which is followed by a comma, a string and a quote: ,\D+\"

But as others already wrote, there other solutions that don't involve regular expressions.

Upvotes: 0

asontu
asontu

Reputation: 4659

Best way to do it with regex:

[^,"]+|"(?:[^"]|"")+"

Regular expression visualization

Debuggex Demo

Upvotes: 0

Heinzi
Heinzi

Reputation: 172270

This is valid CSV syntax, so you can just use a CSV parser.

You did not specify which language you are using, but most have a CSV parser readily available in the class library (for example, TextFieldParser in .NET) or as an external component (for example, CSVParser in Apache Commons for Java).

If you do want to re-invent the wheel (which I do not recommend), the algorithm is quite simple:

result = "", inQuotes = false
read next character
if end-of-line:
    if inQuotes:
        throw error (unmatched quotes)
    yield result
    return
else if character = '"':
    invert inQuotes
else if character = ',' and not inQuotes:
    yield result
    result = ""
else:
    result += character

Upvotes: 2

Related Questions