Tomasz Zieliński
Tomasz Zieliński

Reputation: 16367

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:

    data1, data2  ,"data3'''",  'data4""',,,data5,

but this one is malformed:

    data1, data2, da"ta3", 'data4',

-- quotation marks can only be prepended or trailed by spaces.

Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.

I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.

So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)

EDIT1:

csv module is not much of use here:

    >>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
    [['2', ' "dat', 'a1"', " 'dat", "a2'", '']]

    >>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
    [['2', 'dat,a1', "'dat", "a2'", '']]

-- unless this can be tuned?

EDIT2: A few language edits - I hope it's more valid English now

EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

Upvotes: 13

Views: 17666

Answers (6)

knipknap
knipknap

Reputation: 6184

If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:

<xml>
  <line>
    <column>data1</column>
    <column>data2  </column>
    <column>data3'''</column>
    <column>data4""</column>
    <column/>
    <column/>
    <column>data5</column>
    <column/>
  </line>
</xml>

Note that Gelatin also has a Python API:

from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Upvotes: 0

John Machin
John Machin

Reputation: 82992

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.

It evidently contains rules like:

If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).

What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.

You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.

How is your data encoded?

Upvotes: 2

Peter Hansen
Peter Hansen

Reputation: 22087

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).

In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...

You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

Upvotes: 7

onaclov2000
onaclov2000

Reputation: 5841

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).

Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.

Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?

I'm sorry I don't know python regex's but in perl this would look something like this:

# Look for one or more letter/number at least one ' or " or more and at least one    
#  or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
    # Prints the line if the above regex is found
    print $line;

}

Just simply convert that for when you look at a line.

I'm sorry if I misunderstood the question

I hope it helps!

Upvotes: 1

Max Shawabkeh
Max Shawabkeh

Reputation: 38643

While the csv module is the right answer here, a regex that could do this is quite doable:

import re

r = re.compile(r'''
    \s*                # Any whitespace.
    (                  # Start capturing here.
      [^,"']+?         # Either a series of non-comma non-quote characters.
      |                # OR
      "(?:             # A double-quote followed by a string of characters...
          [^"\\]|\\.   # That are either non-quotes or escaped...
       )*              # ...repeated any number of times.
      "                # Followed by a closing double-quote.
      |                # OR
      '(?:[^'\\]|\\.)*'# Same as above, for single quotes.
    )                  # Done capturing.
    \s*                # Allow arbitrary space before the comma.
    (?:,|$)            # Followed by a comma or the end of a string.
    ''', re.VERBOSE)

line = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""

print r.findall(line)

# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']

EDIT: To validate lines, you can reuse the regex above with small additions:

import re

r_validation = re.compile(r'''
    ^(?:    # Capture from the start.
      # Below is the same regex as above, but condensed.
      # One tiny modification is that it allows empty values
      # The first plus is replaced by an asterisk.
      \s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
    )*$    # And don't stop until the end.
    ''', re.VERBOSE)

line1 = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""

if r_validation.match(line1):
    print 'Line 1 is valid.'
else:
    print 'Line 1 is INvalid.'

if r_validation.match(line2):
    print 'Line 2 is valid.'
else:
    print 'Line 2 is INvalid.'

# Prints:
#    Line 1 is valid.
#    Line 2 is INvalid.

Upvotes: 12

pwdyson
pwdyson

Reputation: 1177

Python has a standard library module to read csv files:

import csv

reader = csv.reader(open('file.csv'))

for line in reader:
    print line

For your example input this prints

['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']

EDIT:

you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.

>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

Upvotes: 4

Related Questions