gen_Eric
gen_Eric

Reputation: 227310

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.

Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).

When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.

How can I fix this?

I was thinking I could use a regex. Maybe something like:

/,[^,]*\n[^,]*,/

Problem with this is it matches the last element of one line and the 1st of the next line.

Here is an example csv that demonstrates the issue:

field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7

8

9",10,11
a,b,c,d,e

Upvotes: 0

Views: 1350

Answers (2)

Alan Moore
Alan Moore

Reputation: 75252

A simple regex replacement won't work, but here's a solution based on preg_replace_callback:

function add_quotes($matches) {
    return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
                        '"$0"',
                        $matches[0]);
}

$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';

$result=preg_replace_callback($row_regex, 'add_quotes', $source);

The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.

If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:

preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);

$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';

Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)

See an online demo

Upvotes: 2

srini.venigalla
srini.venigalla

Reputation: 5145

If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Upvotes: 0

Related Questions