Ben
Ben

Reputation: 149

Solving "error: literal newline found in data" in Postgres?

I have a data set with 32 million tuples, and I'm using COPY FROM to copy those tuples into a table. Within the first 7 million or so there are 3 improperly-formatted tuples. In these cases, instead of the next record being below the current record, it is instead appended to the end of the current record on the same line. So instead of

record1
record2

it was

record1record2

I fixed this by navigating to where the line breaks should have been and pressing "Enter" once per instance, a typical newline entry. Once I fixed those lines, the COPY function was reading them in completely fine. To me, this implied that "Enter" was a valid way to insert a newline, as the COPY function read in those "Enter"-generated newline characters with no problem.

However, later on when I passed the 7 millionth tuple, I encountered:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7308000

SQL State: 22P04

I looked in the data at that record, and found that it was no different from the one above or below it. Just to make sure, I pressed "Backspace" at the beginning of record 7308000 to move it up one line to the end of record 7307999, just like the above format record1record2. Then I pressed "Enter" to ensure that the newline character that started record 7308000 would be identical to the character that COPY had previously taken in with no trouble. In that same vein, the newline character that ended record 7307999 was the exact same newline character I used to separate the earlier incorrectly-formatted records (once again, COPY took this newline character in with no problems). To cover my bases, I did the same to record 7308001, making sure that the newline that ended record 7308000 would be identical to the newline that I typed earlier, which COPY accepted. However, upon saving and trying to COPY from the file again, I got:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7307999

So apparently the error had moved up, and the newline character that COPY previously took in suddenly became invalid. Once again, I looked in the data and moved the beginning of 7307999 up to the end of 7307998, and then pressed "Enter" to insert a newline character that COPY had been recognizing as valid for over 7 million records. So at this point, I am confident that the end of 7307998 is a newline character that is supposed to have been proven to work in COPY earlier on. I run the query again and now I get:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7307998

The error has moved up again. To my knowledge, I am just putting in the same "Enter" newline characters that I did earlier, and for some reason past 7308000 COPY reads them as invalid.

These are lines 7307996-7308000 of the data set (56 columns, most meaningless)

2012-02-23T13:10:03.1769237+00:00   9863996 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.1869189+00:00   9863997 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.1969230+00:00   9863998 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.2069124+00:00   9863999 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.2169261+00:00   9864000 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

I added one "Enter" in the stackoverflow editor after each record in order to display them one after another, but the point is that the relevant middle line (7307998) is formatted exactly like the others, and I made sure that it was surrounded by newlines that were already shown to be valid characters by the COPY function. Every time I try to ensure that the newlines surrounding a record are the same as the ones I used earlier, the error goes up a line. I have no idea what to do in this situation.

My query is:

copy time_raw from E'C:\\Users\\bozon92\\Documents\\YorkU\\Summer 2016\\4080 Project\\Other Files\\allData.txt';

I have tried appending "with (format csv, delimiter '\t')" but that tells me COPY delimiter must be a single one-byte character, whereas "with (format csv, delimiter ' ')" (' ' being a literal tab space instead of \t) it gives me the same nature of error, just with a slightly different syntax:

ERROR:  unquoted newline found in data
HINT:  Use quoted CSV field to represent newline.
CONTEXT:  COPY time_raw, line 7307998

It's just "unquoted" instead of "literal".

As a note, I have been told that I can just cut the data at 7 million tuples, which I will probably eventually do, but I want to know why this issue happened, so I can avoid it in the future. I have no idea what is wrong with the data, because those particular records look completely fine and exactly the same format as the records before and after it, so how can I deal with this literal newline issue? I have no idea how to approach it because I can't find even a trace of what is wrong.

Upvotes: 3

Views: 13240

Answers (5)

dario
dario

Reputation: 141

in linux, I solved with these 3 commands

  • sed -i 's/\//g' FILENAME
  • sed -i -e 's/\r//g' FILENAME
  • sed -i 's/\x00//g' FILENAME

Upvotes: 1

BJW
BJW

Reputation: 105

Using ATOM or a similar text editor, use the Regex search and replace [\r\n]+ with \n. Although it may lock up with more than 1M rows.

Upvotes: 0

Your dump to import might end with \r (0x0D), not \r\n (0x0D 0x0A), causing this error, which was my case. Make sure it doesn’t.

Upvotes: 0

gdexlab
gdexlab

Reputation: 199

If you have access to the source system that pulls the tuples, you have the option of replacing all newline characters when you pull the data:

SELECT regexp_replace(input_field, '[\\n\\r]+', ' ')

This will put a space in place of the end of line character. This fixed the ERROR: literal newline found in data for my situation.

Upvotes: 2

mvp
mvp

Reputation: 116187

According to code snippet from PostgreSQL source, copy.c:

 /* Process \n */
 if (c == '\n' && (!cstate->csv_mode || !in_quote))
 {
     if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
         ereport(ERROR,
             (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
             !cstate->csv_mode ?
             errmsg("literal newline found in data") :
             errmsg("unquoted newline found in data"),
             !cstate->csv_mode ?
             errhint("Use \"\\n\" to represent newline.") :
             errhint("Use quoted CSV field to represent newline.")));
      cstate->eol_type = EOL_NL;      /* in case not set yet */
      /* If reach here, we have found the line terminator */
      break;
}

it means that your input data is using byte 0x0A somewhere inside your strings, e.g. you use "abcNxyz", where instead of N actually there is byte with value 0x0A.

Solution is to use string "abc\n" instead. You should be able to find all spurious newlines and replace them to \n using some script, perhaps Python or Perl.

Upvotes: 4

Related Questions