Reputation: 1498
Suppose I have a dataset which has the following header:
<id> <timestamp> <N> <1> <2> ... <N>
In this dataset, every row has a column "N", a number that determines how many columns that are numerically labeled after it. For example, I have a row like the following:
5 142323151.14 800 5.3564 5.4534 ... 7.4839 (800 columns after the 3rd column)
It is guaranteed that all rows have the same number of columns.
How can I read this CSV file with Pandas read_csv
and label the columns currectly? Is it possible to be done in a single call? I am learning Pandas, so I wonder how Pandas can do a task that would require multiple lines of Python code to do.
Thanks for your help!
EDIT: I tried
pd.read_csv('file.csv', names=['id','timestamp','count',...],
delimiter=' ',
header=None)
and I don't know what to put in the ... part
Upvotes: 2
Views: 2309
Reputation: 210832
UPDATE:
What if I have several trailing columns, such as <1> <2> ... , how do you use the value of n here?
In [320]: df
Out[320]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 5 1.423232e+08 8 1 2 3 4 5 6 7 8 1000 1000
1 6 1.423236e+08 8 11 22 33 44 55 66 77 88 1000 1000
In [321]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]
Out[321]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 11, 12]
In [322]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]
In [323]: df
Out[323]:
id timestamp n 1 2 3 4 5 6 7 8 11 12
0 5 1.423232e+08 8 1 2 3 4 5 6 7 8 1000 1000
1 6 1.423236e+08 8 11 22 33 44 55 66 77 88 1000 1000
if you can predefine trailing column names, you can do this:
In [328]: trailing_cols = ['max','min']
In [329]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + trailing_cols
Out[329]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 'max', 'min']
OLD answer:
I'd do it this way:
first read your CSV not specifying column names:
df = pd.read_csv('file.csv', delim_whitespace=True, header=None)
In [271]: df
Out[271]:
0 1 2 3 4 5 6 7 8 9 10
0 5 1.423232e+08 800 1 2 3 4 5 6 7 8
1 5 1.423232e+08 800 11 22 33 44 55 66 77 88
now we can rename columns as follows:
In [272]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:].values - 2).tolist()
In [273]: df
Out[273]:
id timestamp n 1 2 3 4 5 6 7 8
0 5 1.423232e+08 800 1 2 3 4 5 6 7 8
1 5 1.423232e+08 800 11 22 33 44 55 66 77 88
Upvotes: 4
Reputation: 2978
You don't need to specify the number of columns in read_csv. You can just call
df = pd.read_csv('file.csv')
And it should read all the columns in (and rows as well). You can pass header=0
if you'd like but pandas can usually infer this.
Upvotes: 0