Subhasis Das
Subhasis Das

Reputation: 1677

Extracting string between quotes split across multiple lines in Python

I have a file containing multiple entries. Each entry is of the following form:

"field1","field2","field3","field4","field5"

All of the fields are guaranteed to not contain any quotes, however they can contain ,. The problem is that field4 can be split across multiple lines. So an example file can look like:

"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"

I want to extract the fields using Python. If the field would not have been split across multiple lines this would have been simple: Extract string from between quotations. But I can't seem to find a simple way to do this in presence of multiline fields.

EDIT: There are actually five fields. Sorry about the confusion if any. The question has been edited to reflect this.

Upvotes: 5

Views: 2352

Answers (4)

blakev
blakev

Reputation: 4469

If you control the input to this file, you need to sanitize it beforehand by replacing \n with something ([\n]?) before putting the values into a comma-separated list.

Or, instead of saving strings -- save them as r-strings.

Then, use the csv module to parse it quickly with predefined separators, encoding and quotechar

Upvotes: 0

Birei
Birei

Reputation: 36282

I think that the csv module can solve this problem. It splits correctly with newlines:

import csv 

f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
    for field in row:
        print('-- {}'.format(field))

It yields:

-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz

Upvotes: 6

Mark R. Wilkins
Mark R. Wilkins

Reputation: 1302

The answer from the question you linked worked for me:

import re
f = open("test.txt")
text = f.read()

string_list = re.findall('"([^"]*"', text)

At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use

new_string = string_list.replace("\n", " ")

to clean that up.

Upvotes: 1

Vivek
Vivek

Reputation: 920

Try :

awk '{FS=','} /pattern if needed/{print $0}' fname

Upvotes: 0

Related Questions