Luis
Luis

Reputation: 3497

Read csv files with newline characters between columns

I have read and parse csv files using pandas, columns separated by semicolon ;. Some of the files have (for unknown reasons) a '\r\n' sequence after some columns, which makes pandas.read_csv split them in different rows. I want to escape those characters and append the "second row" to the "first" one, otherwise the parsing gets difficult afterwards.

I could identify those lines, because they are followed by numbers, while the correct first column contains the time like 00:00:00. Is it possible to do that with pandas.read_csv?


Example

With a correct file my code would look like this:

data = io.StringIO( ''' a; b; c; d 
                    x10; 20; 30; 40
                    x11; 21; 31; 41
                    x12; 22; 32; 42
                    x13; 23; 33; 43
                    x14; 24; 34; 44
                    x15; 25; 35; 45
                 ''' )

pd.read_csv( data, sep=';' )

Output:

    a   b   c   d
0   10  20  30  40
1   11  21  31  41
2   12  22  32  42
3   13  23  33  43
4   14  24  34  44
5   15  25  35  45

Problem

With a damaged file, it looks like this:

data = io.StringIO( ''' a; b; c; d 
                        x10; 20; 30; 40
                        x11; 21; 31; 41
                        x12; 22; 
                        32; 42
                        x13; 23; 33; 43
                        x14; 24; 34; 44
                        x15; 25; 35; 45
                     ''' )

pd.read_csv( data, sep=';' )

Output:

    a   b   c   d
0   x10     20  30  40.0
1   x11     21  31  41.0
2   x12     22      NaN
3   32  42  NaN     NaN
4   x13     23  33  43.0
5   x14     24  34  44.0
6   x15     25  35  45.0

The expected output, however, is the first one in both cases. In this example, I would think to replace \r\n\d\d by \d\d to get rid of those newline chars before/while constructing the dataframe in pandas.

If possible, I would like to avoid fixing the files first, avoid to make an extra script to just check all of them before reading with pandas, because there are new files coming regularly.


Is it possible to replace part of a string when reading csv files in pandas?

Is there another approach to this kind of problem?


Using python 3.6.8 , pandas 0.24.2

Upvotes: 0

Views: 1886

Answers (2)

Luis
Luis

Reputation: 3497

I got parts from ivallesp's answer and came up with a solution that keeps the broken lines.

I post it here as documentation for the future me (who often forgets such things) and also for others that might come across similar problems.


Bad file, with broken lines

infile = io.StringIO( ''' a; b; c; d 
                        x10; 20; 30; 40
                        x11; 21; 31; 41
                        x12; 22; 
                        32; 42
                        x13; 23; 33; 43
                        x14; 24; 34; 44
                        x15; 25; 35; 45
                     ''' )

# The lines are joined with a \n, and whitespace stripped
data = '\n'.join( [ item.strip() for item in infile ] )
# Now data is not a file stream, but a string, with \n s in between

#Search for occurrences of newline + NOT(x + number) and just keep
# found group 
data = re.sub( '\n(?!x\d\d)', '\1', data )

# Now data is a file stream again
data = io.StringIO( data )

# Fed to pandas.read_csv
pd.read_csv( data, sep=';' )

Variation

With an actual file in disk (not io.StringIO) I had to make a small modification, delete the .strip(), not sure why. Other than that, it works also joining on nothing (''.join(...)).

Lastly, my actual file has time in the first column, in the form of 00:00, 00:05 and so on. So I actually used it like this:

import re

with open( 'broken_rows_file.csv', 'r' ) as infile:
    data = ''.join( [ item for item in infile ] )

#All that is NOT ##:## should be replaced
data = re.sub( '\n(?!\d\d:\d\d)', '\1', data ) 
data = io.StringIO( data )

df = pd.read_csv( data, sep=';' )
df

       a    b   c   d
0   00:10   20  30  40
1   00:11   21  31  41
2   00:12   22  32  42
3   00:13   23  33  43
4   00:14   24  34  44
5   00:15   25  35  45

Upvotes: 0

ivallesp
ivallesp

Reputation: 2212

That's something that happens commonly with big csvs. The way I would use to solve it would be to read them using python and check that the number of separators matches what you expect, otherwise remove the row. Then once the raw data is corrected you can load it in pandas using StringIO. An example over your wrong example:

# We load the file
filestream = open(filepath)

# Now we filter the data as follows
data = filter(lambda l: l.count(";")==3, filestream)

# Now we convert to String IO
stream = io.StringIO("\n".join(data))

# And finally we read with Pandas
pd.read_csv(stream, sep=';' )

Upvotes: 1

Related Questions