Reputation: 3497
I have read and parse csv files using pandas
, columns separated by semicolon ;
. Some of the files have (for unknown reasons) a '\r\n'
sequence after some columns, which makes pandas.read_csv
split them in different rows. I want to escape those characters and append the "second row" to the "first" one, otherwise the parsing gets difficult afterwards.
I could identify those lines, because they are followed by numbers, while the correct first column contains the time like 00:00:00
. Is it possible to do that with pandas.read_csv
?
Example
With a correct file my code would look like this:
data = io.StringIO( ''' a; b; c; d
x10; 20; 30; 40
x11; 21; 31; 41
x12; 22; 32; 42
x13; 23; 33; 43
x14; 24; 34; 44
x15; 25; 35; 45
''' )
pd.read_csv( data, sep=';' )
Output:
a b c d
0 10 20 30 40
1 11 21 31 41
2 12 22 32 42
3 13 23 33 43
4 14 24 34 44
5 15 25 35 45
Problem
With a damaged file, it looks like this:
data = io.StringIO( ''' a; b; c; d
x10; 20; 30; 40
x11; 21; 31; 41
x12; 22;
32; 42
x13; 23; 33; 43
x14; 24; 34; 44
x15; 25; 35; 45
''' )
pd.read_csv( data, sep=';' )
Output:
a b c d
0 x10 20 30 40.0
1 x11 21 31 41.0
2 x12 22 NaN
3 32 42 NaN NaN
4 x13 23 33 43.0
5 x14 24 34 44.0
6 x15 25 35 45.0
The expected output, however, is the first one in both cases. In this example, I would think to replace \r\n\d\d
by \d\d
to get rid of those newline chars before/while constructing the dataframe in pandas.
If possible, I would like to avoid fixing the files first, avoid to make an extra script to just check all of them before reading with pandas, because there are new files coming regularly.
Is it possible to replace part of a string when reading csv files in pandas?
Is there another approach to this kind of problem?
Using python 3.6.8 , pandas 0.24.2
Upvotes: 0
Views: 1886
Reputation: 3497
I got parts from ivallesp's answer and came up with a solution that keeps the broken lines.
I post it here as documentation for the future me (who often forgets such things) and also for others that might come across similar problems.
Bad file, with broken lines
infile = io.StringIO( ''' a; b; c; d
x10; 20; 30; 40
x11; 21; 31; 41
x12; 22;
32; 42
x13; 23; 33; 43
x14; 24; 34; 44
x15; 25; 35; 45
''' )
# The lines are joined with a \n, and whitespace stripped
data = '\n'.join( [ item.strip() for item in infile ] )
# Now data is not a file stream, but a string, with \n s in between
#Search for occurrences of newline + NOT(x + number) and just keep
# found group
data = re.sub( '\n(?!x\d\d)', '\1', data )
# Now data is a file stream again
data = io.StringIO( data )
# Fed to pandas.read_csv
pd.read_csv( data, sep=';' )
Variation
With an actual file in disk (not io.StringIO
) I had to make a small modification, delete the .strip()
, not sure why. Other than that, it works also joining on nothing (''.join(...)
).
Lastly, my actual file has time in the first column, in the form of 00:00
, 00:05
and so on. So I actually used it like this:
import re
with open( 'broken_rows_file.csv', 'r' ) as infile:
data = ''.join( [ item for item in infile ] )
#All that is NOT ##:## should be replaced
data = re.sub( '\n(?!\d\d:\d\d)', '\1', data )
data = io.StringIO( data )
df = pd.read_csv( data, sep=';' )
df
a b c d
0 00:10 20 30 40
1 00:11 21 31 41
2 00:12 22 32 42
3 00:13 23 33 43
4 00:14 24 34 44
5 00:15 25 35 45
Upvotes: 0
Reputation: 2212
That's something that happens commonly with big csvs. The way I would use to solve it would be to read them using python and check that the number of separators matches what you expect, otherwise remove the row. Then once the raw data is corrected you can load it in pandas using StringIO
. An example over your wrong example:
# We load the file
filestream = open(filepath)
# Now we filter the data as follows
data = filter(lambda l: l.count(";")==3, filestream)
# Now we convert to String IO
stream = io.StringIO("\n".join(data))
# And finally we read with Pandas
pd.read_csv(stream, sep=';' )
Upvotes: 1