Python script using regex (re) to remove extra newlines

Question

I have a tab-delimited text file that may have some values containing newlines, like this:

        col1    col2    col3

row1    val1    "Some text
containing newlines. Yup, possibly
more than one..."        val3
row2    val4    "val5"    val6

Note:

Text value that would contain newlines is guaranteed to be enclosed in double quotes initially
Number of rows or columns may be different.
Any value may be text or may be a number, may contain newlines and may not

I am trying to write a small Python script using re in order to:

get rid of extra newlines (but preserve the original ones, i.e. at the end of each row)
enclose every single value in double quotes

It would be great to have it in a form like that:

def normalize_format(data, delimiter = '	'):
    data = re.sub(_DESIRED_REGEX_, r'"\1"', data)
    return data

where data is the whole file contents as a single string and _DESIRED_REGEX_ is the one I would like to have figured out

Usage of re is not mandatory, but short and elegant solution is appreciated :)

Tim Pietzcker · Accepted Answer

You should be using the csv module instead:

import csv
with open("mycsv.csv", "rb") as infile, open("newcsv.csv", "wb") as outfile:
    reader = csv.reader(infile, delimiter="	")
    writer = csv.writer(outfile, delimiter="	", quoting=csv.QUOTE_ALL)
    # Now you can remove all the newlines within fields
    # and write them back to a new CSV file:

    for row in reader:
        writer.writerow([field.replace("
", " ") for field in row])

Python script using regex (re) to remove extra newlines

Answers (1)

Related Questions