Reputation: 21

Parse csv file with line breaks and additional commas

My current csv file looks like the following:

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, ,"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5555)"  
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave  
Boston, MA 23123  
(67.11111,- 49.5567)"  
131, Kyle, Engineer, 3A-bhbh, , "Barnes St  
San Francisco, CA 34654  
(65.11111, 55.985432)"

In some cases field5 does not have a value. Also, field6 is within quotes and has line breaks. For example: field6 for the first row of data is actually

"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5575)"

I need to write a script to parse this file and return 12.111, 43.557 in place of the current value of field6 such that the final csv file will look like

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, , "12.111, 43.555"  
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"  
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"

I have looked at cvsparser but my understanding is that it works only if the entire data row is on one line without any line breaks. Also, I cannot simply use a comma to split the rows because there are addresses which have multiple commas within them. Any suggestions on how I could parse this csv file?

Upvotes: 1

Answers (4)

rns

Reputation: 761

For such ‘unstructured csv’ format you can use Marpa::R2, a Perl interface to Marpa, a general BNF parser.

The data can be described in BNF as this ::= that (~ operator defines lexical rules). Parens in ::= rules, e.g. (header [\n]) mean ‘don't include in parse result.’

The parser returns a data structure (array of arrays in [ id, child1, child2 ... ] format), from which data can be extracted.

You can also define semantic actions as Perl sub's in the same or separate package to process the data.

An example script and its output, based on your data, is below.

script:

use 5.010;
use strict;
use warnings;

use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Deepcopy = 1;

use Marpa::R2;

my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),

    :default ::= action => [ name, value]
    lexeme default = action => [ name, value] latm => 1

    csv ::= (header [\n]) lines

    header ::= column+ separator => column_sep
    column_sep ~ ', '
    column ~ 'field' [1-6]

    lines       ::= line+ separator => [\n]
    line        ::= fields1_5 (',') field6
    field_sep   ~ ','
    fields1_5   ::= field1_5+ separator => field_sep
    field1_5    ~ num | word | code
    field6      ~ address

    num ~ [\d]+
    word ~ [A-Za-z]+
    code  ~ num word '-' word
    address ~ '"' address_chars '"'
    address_chars ~ [^\"]+ #"

    :discard ~ space
    space ~ ' '

END_OF_SOURCE
} );

my $input = <<EOI;
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
EOI

say Dumper ${ $g->parse( \$input, { trace_terminals => 0 } ) };

output:

[
  'csv',
  [
    'lines',
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '111'
        ],
        [
          'field1_5',
          'John'
        ],
        [
          'field1_5',
          'Doctor'
        ],
        [
          'field1_5',
          '1A-jrd'
        ]
      ],
      [
        'field6',
        '"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '121'
        ],
        [
          'field1_5',
          'Bob'
        ],
        [
          'field1_5',
          'Teacher'
        ],
        [
          'field1_5',
          '2A-abcd'
        ],
        [
          'field1_5',
          '345'
        ]
      ],
      [
        'field6',
        '"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '131'
        ],
        [
          'field1_5',
          'Kyle'
        ],
        [
          'field1_5',
          'Engineer'
        ],
        [
          'field1_5',
          '3A-bhbh'
        ]
      ],
      [
        'field6',
        '"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
      ]
    ]
  ]
]

Upvotes: 1

Sobrique

Reputation: 53498

You need a CSV parser for that sort of data. I would suggest perl and Text::CSV:

Something like this:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV; 

my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } ); 

open ( my $input_fh, '<', "sample.csv" ) or die $!; 

my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );

while ( my $row = $csv -> getline ( $input_fh ) ) { 
    $row -> [5] =~ s,.*\(,\(,ms;
    $csv -> print ( \*STDOUT, $row );
}

Given source data of:

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, ,"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd,345,"Moore Ave  
Boston, MA 23123  
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, ,"Barnes St  
San Francisco, CA 34654  
(65.11111, 55.985432)"

Outputs:

field1," field1"," field3"," field4"," field5"," field6  "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"

Hopefully it's clear how you could further modify 'field6' to meet your spec precisely.

Upvotes: 0

hspandher

Reputation: 16753

You can use csv library for this

import csv

with open('myfile.csv') as myfile:
     csv_file = csv.reader(myfile, delimiter = ',')

Now you have the rows, do whatever you wish.

Upvotes: 0

FooTheBar

Reputation: 857

You can't. As you allow commas in field 6, this is a valid file

A,B,C,D,,E a,b,c,d,e,f

And you have no way to find out if this file contains one or two entries, as the field 6 for the first data set can be 'E' or 'E \n a,b,c,d,e,f'

Upvotes: 0

Parse csv file with line breaks and additional commas

Answers (4)

Related Questions