Reputation: 1677
I have a file containing multiple entries. Each entry is of the following form:
"field1","field2","field3","field4","field5"
All of the fields are guaranteed to not contain any quotes, however they can contain ,
. The problem is that field4
can be split across multiple lines. So an example file can look like:
"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"
I want to extract the fields using Python. If the field would not have been split across multiple lines this would have been simple: Extract string from between quotations. But I can't seem to find a simple way to do this in presence of multiline fields.
EDIT: There are actually five fields. Sorry about the confusion if any. The question has been edited to reflect this.
Upvotes: 5
Views: 2352
Reputation: 4469
If you control the input to this file, you need to sanitize it beforehand by replacing \n
with something ([\n]?) before putting the values into a comma-separated list.
Or, instead of saving strings -- save them as r-strings.
Then, use the csv
module to parse it quickly with predefined separators, encoding and quotechar
Upvotes: 0
Reputation: 36282
I think that the csv
module can solve this problem. It splits correctly with newlines:
import csv
f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
for field in row:
print('-- {}'.format(field))
It yields:
-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz
Upvotes: 6
Reputation: 1302
The answer from the question you linked worked for me:
import re
f = open("test.txt")
text = f.read()
string_list = re.findall('"([^"]*"', text)
At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use
new_string = string_list.replace("\n", " ")
to clean that up.
Upvotes: 1