zkytony
zkytony

Reputation: 1498

Python Pandas: read csv with N columns where N is specified in the some other column

Suppose I have a dataset which has the following header:

<id>  <timestamp>  <N>  <1>  <2> ... <N>

In this dataset, every row has a column "N", a number that determines how many columns that are numerically labeled after it. For example, I have a row like the following:

5 142323151.14 800 5.3564 5.4534 ... 7.4839 (800 columns after the 3rd column)

It is guaranteed that all rows have the same number of columns.

How can I read this CSV file with Pandas read_csv and label the columns currectly? Is it possible to be done in a single call? I am learning Pandas, so I wonder how Pandas can do a task that would require multiple lines of Python code to do.

Thanks for your help!

EDIT: I tried

pd.read_csv('file.csv', names=['id','timestamp','count',...],
                        delimiter=' ',
                        header=None)

and I don't know what to put in the ... part

Upvotes: 2

Views: 2309

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

UPDATE:

What if I have several trailing columns, such as <1> <2> ... , how do you use the value of n here?

In [320]: df
Out[320]:
   0             1   2   3   4   5   6   7   8   9   10    11    12
0   5  1.423232e+08   8   1   2   3   4   5   6   7   8  1000  1000
1   6  1.423236e+08   8  11  22  33  44  55  66  77  88  1000  1000

In [321]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]
Out[321]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 11, 12]

In [322]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]

In [323]: df
Out[323]:
   id     timestamp  n   1   2   3   4   5   6   7   8    11    12
0   5  1.423232e+08  8   1   2   3   4   5   6   7   8  1000  1000
1   6  1.423236e+08  8  11  22  33  44  55  66  77  88  1000  1000

if you can predefine trailing column names, you can do this:

In [328]: trailing_cols = ['max','min']

In [329]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + trailing_cols
Out[329]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 'max', 'min']

OLD answer:

I'd do it this way:

first read your CSV not specifying column names:

df = pd.read_csv('file.csv', delim_whitespace=True, header=None)

In [271]: df
Out[271]:
   0             1    2   3   4   5   6   7   8   9   10
0   5  1.423232e+08  800   1   2   3   4   5   6   7   8
1   5  1.423232e+08  800  11  22  33  44  55  66  77  88

now we can rename columns as follows:

In [272]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:].values - 2).tolist()

In [273]: df
Out[273]:
   id     timestamp    n   1   2   3   4   5   6   7   8
0   5  1.423232e+08  800   1   2   3   4   5   6   7   8
1   5  1.423232e+08  800  11  22  33  44  55  66  77  88

Upvotes: 4

ashishsingal
ashishsingal

Reputation: 2978

You don't need to specify the number of columns in read_csv. You can just call

df = pd.read_csv('file.csv')

And it should read all the columns in (and rows as well). You can pass header=0 if you'd like but pandas can usually infer this.

Upvotes: 0

Related Questions