Eric Gopak
Eric Gopak

Reputation: 1843

Python script using regex (re) to remove extra newlines

I have a tab-delimited text file that may have some values containing newlines, like this:

        col1    col2    col3

row1    val1    "Some text
containing newlines. Yup, possibly
more than one..."        val3
row2    val4    "val5"    val6

Note:

I am trying to write a small Python script using re in order to:

It would be great to have it in a form like that:

def normalize_format(data, delimiter = '\t'):
    data = re.sub(_DESIRED_REGEX_, r'"\1"', data)
    return data

where data is the whole file contents as a single string and _DESIRED_REGEX_ is the one I would like to have figured out

Usage of re is not mandatory, but short and elegant solution is appreciated :)

Upvotes: 2

Views: 159

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

You should be using the csv module instead:

import csv
with open("mycsv.csv", "rb") as infile, open("newcsv.csv", "wb") as outfile:
    reader = csv.reader(infile, delimiter="\t")
    writer = csv.writer(outfile, delimiter="\t", quoting=csv.QUOTE_ALL)
    # Now you can remove all the newlines within fields
    # and write them back to a new CSV file:

    for row in reader:
        writer.writerow([field.replace("\n", " ") for field in row])

Upvotes: 2

Related Questions