Reputation: 21
My current csv file looks like the following:
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
In some cases field5 does not have a value. Also, field6 is within quotes and has line breaks. For example: field6 for the first row of data is actually
"Tuft St
Peoria, IL 54345
(12.11111, 43.5575)"
I need to write a script to parse this file and return 12.111, 43.557 in place of the current value of field6 such that the final csv file will look like
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, , "12.111, 43.555"
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"
I have looked at cvsparser but my understanding is that it works only if the entire data row is on one line without any line breaks. Also, I cannot simply use a comma to split the rows because there are addresses which have multiple commas within them. Any suggestions on how I could parse this csv file?
Upvotes: 1
Views: 203
Reputation: 761
For such ‘unstructured csv’ format you can use Marpa::R2, a Perl interface to Marpa, a general BNF parser.
The data can be described in BNF as this ::= that
(~
operator defines lexical rules). Parens in ::=
rules, e.g. (header [\n])
mean ‘don't include in parse result.’
The parser returns a data structure (array of arrays in [ id, child1, child2 ... ]
format), from which data can be extracted.
You can also define semantic actions as Perl sub
's in the same or separate package to process the data.
An example script and its output, based on your data, is below.
script:
use 5.010;
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Deepcopy = 1;
use Marpa::R2;
my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),
:default ::= action => [ name, value]
lexeme default = action => [ name, value] latm => 1
csv ::= (header [\n]) lines
header ::= column+ separator => column_sep
column_sep ~ ', '
column ~ 'field' [1-6]
lines ::= line+ separator => [\n]
line ::= fields1_5 (',') field6
field_sep ~ ','
fields1_5 ::= field1_5+ separator => field_sep
field1_5 ~ num | word | code
field6 ~ address
num ~ [\d]+
word ~ [A-Za-z]+
code ~ num word '-' word
address ~ '"' address_chars '"'
address_chars ~ [^\"]+ #"
:discard ~ space
space ~ ' '
END_OF_SOURCE
} );
my $input = <<EOI;
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
EOI
say Dumper ${ $g->parse( \$input, { trace_terminals => 0 } ) };
output:
[
'csv',
[
'lines',
[
'line',
[
'fields1_5',
[
'field1_5',
'111'
],
[
'field1_5',
'John'
],
[
'field1_5',
'Doctor'
],
[
'field1_5',
'1A-jrd'
]
],
[
'field6',
'"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'121'
],
[
'field1_5',
'Bob'
],
[
'field1_5',
'Teacher'
],
[
'field1_5',
'2A-abcd'
],
[
'field1_5',
'345'
]
],
[
'field6',
'"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'131'
],
[
'field1_5',
'Kyle'
],
[
'field1_5',
'Engineer'
],
[
'field1_5',
'3A-bhbh'
]
],
[
'field6',
'"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
]
]
]
]
Upvotes: 1
Reputation: 53498
You need a CSV parser for that sort of data. I would suggest perl and Text::CSV:
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } );
open ( my $input_fh, '<', "sample.csv" ) or die $!;
my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );
while ( my $row = $csv -> getline ( $input_fh ) ) {
$row -> [5] =~ s,.*\(,\(,ms;
$csv -> print ( \*STDOUT, $row );
}
Given source data of:
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd,345,"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, ,"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
Outputs:
field1," field1"," field3"," field4"," field5"," field6 "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"
Hopefully it's clear how you could further modify 'field6' to meet your spec precisely.
Upvotes: 0
Reputation: 16753
You can use csv
library for this
import csv
with open('myfile.csv') as myfile:
csv_file = csv.reader(myfile, delimiter = ',')
Now you have the rows, do whatever you wish.
Upvotes: 0
Reputation: 857
You can't. As you allow commas in field 6, this is a valid file
A,B,C,D,,E a,b,c,d,e,f
And you have no way to find out if this file contains one or two entries, as the field 6 for the first data set can be 'E' or 'E \n a,b,c,d,e,f'
Upvotes: 0