a.s
a.s

Reputation: 39

Split csv with regex pattern (repetition?)

I'm new in RegEx and I have a problem. It looks like easy but whatever I tried, it doesn't work.

I have two lines like :

aaa,bbb,111,22.3,2021-01-01 4:4:4.444

ccc,ddd,555,66.7,2021-02-02 8:8:8.888

This regex does what I want : (.+),(.+),(.+),(.+),(.+) => 2 matches with 5 groups

*Match 0 :*

 group 1 = aaa
 
 group 2 = bbb
 
 ...
 
 group 5 = 2021-01-01 4:4:4.444

*Match 1 :*
 
 group 1 = ccc
 
 ...

But if I have more than 5 "fields" it will be complicated. How can I have the same result with something like (.+),"n repetitions"(.+)? Or something else ? I tried with {n} and * but it's not the result expected. I also tried some regex from other posts :

All the modifications tested don't match with my first simple regex ( *(.+),(.+),(.+),(.+),(.+)* )

Edit : I'll finally go for a python solution. Thanks you all

Upvotes: 2

Views: 208

Answers (2)

pho
pho

Reputation: 25489

An easy way to do this would be to create the regex using str.join().

num_cols = 5

re_str = ','.join(['(.+)'] * num_cols)
rexp = re.compile(re_str)

teststr = """aaa,bbb,111,22.3,2021-01-01 4:4:4.444
ccc,ddd,555,66.7,2021-02-02 8:8:8.888"""

re.findall(rexp, teststr)

This gives:

[('aaa', 'bbb', '111', '22.3', '2021-01-01 4:4:4.444'),
 ('ccc', 'ddd', '555', '66.7', '2021-02-02 8:8:8.888')]

You can change num_cols to make your regex match any number of columns in your csv.

Keep in mind that this approach will not account for quotes in the CSV, which are supposed to indicate that the commas within the quote are not column separators. If you want good, easy CSV parsing, just use the csv module.

Another caveat is that if your text has more than num_cols columns, your matched result will merge them so that you end up with num_cols groups per match. For example, if we have six columns in our teststr but num_cols = 5:

teststr =  """aaa,bbb,111,22.3,2021-01-01 4:4:4.444,123
ccc,ddd,555,66.7,2021-02-02 8:8:8.888,456"""

the code above gives:

[('aaa,bbb', '111', '22.3', '2021-01-01 4:4:4.444', '123'),
 ('ccc,ddd', '555', '66.7', '2021-02-02 8:8:8.888', '456')]

Upvotes: 1

Eikeike
Eikeike

Reputation: 23

You could try this:

([^,]+),?

It matches any word not containing a comma followed by a comma as many times as you want

Upvotes: 0

Related Questions