Reputation: 460
So I have a CSV that looks a bit like this:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
And when I try to use the following code to generate a dataFrame..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)
The rest are considered 'bad lines' giving me the following error:
Skipping line 17467: expected 3 fields, saw 9
How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?
Thanks!
Upvotes: 20
Views: 35264
Reputation: 879143
If you know that the data contains N
columns, you can
tell Pandas in advance how many columns to expect via the names
parameter:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
If you have an the upper limit, N
, on the number of columns, then you can
have Pandas read N
columns and then use dropna
to drop completely empty columns:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.
Upvotes: 13
Reputation: 41
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
Change 9
in colnames to number x
if code gives the error
Skipping line 17467: expected 3 fields, saw x
Upvotes: 0
Reputation: 1670
Consider using Python csv
to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
Gives you a csv import dialect and the following DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Left as an exercise is handling the whitespace padding in the input file.
Upvotes: 3
Reputation: 14063
Read fixed width should work:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
or with a delimiter
param
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
note that for your actual file you will not use StringIO
you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)
Upvotes: 2
Reputation: 7509
add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:
file.csv:
a,b,c,d,e
1,2,3
3
2,3,4
code:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN
Upvotes: 1
Reputation: 59519
If using only pandas
, read in lines, deal with the separator after.
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Upvotes: 15