TY Lim
TY Lim

Reputation: 623

Limit columns of CSV read into dataframe based on first rows

I have a CSV (actually tab-separated) with rows with a variable number of columns. The first few rows are the same length, let's say x, and then subsequent rows are variable (usually longer, let's say y).

I would like to read this file into a pandas dataframe with number of columns = the length of the first few rows, i.e. x columns, and for the later rows, including values up to column x and dropping those after.

Is there an elegant way to do this in the pd.read_csv command?

I have tried using error_bad_lines=False but this results in the later rows being dropped, when what I need is to truncate them at col x. Note that I don't know ahead of time what x is so can't simply hardcode it either.

EDIT - this is a truncated example of course - here the first few rows have 6 values and the last two have 10 values:

Time    0   1   2   3   4   5
alpha   1.35393 1.35393 1.35393 1.35393 1.35393 1.35393
dn  0.015   0.015   0.015   0.015   0.015   0.015
Mu  16.3012 16.3012 16.3012 16.3012 16.3012 16.3012
roll    114.29  114.29  114.29  114.29  114.29  114.29  114.29  114.29  114.29  114.29
timesens    50  50  50  50  50  50  50  50  50  50

I would like a dataframe that has all rows, and columns from 0:6.

Upvotes: 0

Views: 129

Answers (1)

piterbarg
piterbarg

Reputation: 8219

I put your sample data into data1.csv and used the following. Basically reading each line as a separate dataframe and then concatenating together

from io import StringIO
dfs = []
with open('data1.csv') as f:
    for line in f:
        io_string = StringIO(line)
        df = pd.read_csv(io_string, delim_whitespace=True, header = None)
        dfs.append(df)

whole_df = pd.concat(dfs).reset_index(drop = True)
whole_df

produces


|    | 0        |         1 |         2 |         3 |         4 |         5 |         6 |      7 |      8 |      9 |     10 |
|---:|:---------|----------:|----------:|----------:|----------:|----------:|----------:|-------:|-------:|-------:|-------:|
|  0 | Time     |   0       |   1       |   2       |   3       |   4       |   5       | nan    | nan    | nan    | nan    |
|  1 | alpha    |   1.35393 |   1.35393 |   1.35393 |   1.35393 |   1.35393 |   1.35393 | nan    | nan    | nan    | nan    |
|  2 | dn       |   0.015   |   0.015   |   0.015   |   0.015   |   0.015   |   0.015   | nan    | nan    | nan    | nan    |
|  3 | Mu       |  16.3012  |  16.3012  |  16.3012  |  16.3012  |  16.3012  |  16.3012  | nan    | nan    | nan    | nan    |
|  4 | roll     | 114.29    | 114.29    | 114.29    | 114.29    | 114.29    | 114.29    | 114.29 | 114.29 | 114.29 | 114.29 |
|  5 | timesens |  50       |  50       |  50       |  50       |  50       |  50       |  50    |  50    |  50    |  50    |

You get nans for columns after x, you can post-process to drop those columns once you have the df, if you wish:

whole_df.dropna(axis=1)

Upvotes: 1

Related Questions