Reputation: 623
I have a CSV (actually tab-separated) with rows with a variable number of columns. The first few rows are the same length, let's say x
, and then subsequent rows are variable (usually longer, let's say y
).
I would like to read this file into a pandas
dataframe with number of columns = the length of the first few rows, i.e. x
columns, and for the later rows, including values up to column x
and dropping those after.
Is there an elegant way to do this in the pd.read_csv
command?
I have tried using error_bad_lines=False
but this results in the later rows being dropped, when what I need is to truncate them at col x
. Note that I don't know ahead of time what x
is so can't simply hardcode it either.
EDIT - this is a truncated example of course - here the first few rows have 6 values and the last two have 10 values:
Time 0 1 2 3 4 5
alpha 1.35393 1.35393 1.35393 1.35393 1.35393 1.35393
dn 0.015 0.015 0.015 0.015 0.015 0.015
Mu 16.3012 16.3012 16.3012 16.3012 16.3012 16.3012
roll 114.29 114.29 114.29 114.29 114.29 114.29 114.29 114.29 114.29 114.29
timesens 50 50 50 50 50 50 50 50 50 50
I would like a dataframe that has all rows, and columns from 0:6.
Upvotes: 0
Views: 129
Reputation: 8219
I put your sample data into data1.csv
and used the following. Basically reading each line as a separate dataframe and then concatenating together
from io import StringIO
dfs = []
with open('data1.csv') as f:
for line in f:
io_string = StringIO(line)
df = pd.read_csv(io_string, delim_whitespace=True, header = None)
dfs.append(df)
whole_df = pd.concat(dfs).reset_index(drop = True)
whole_df
produces
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---:|:---------|----------:|----------:|----------:|----------:|----------:|----------:|-------:|-------:|-------:|-------:|
| 0 | Time | 0 | 1 | 2 | 3 | 4 | 5 | nan | nan | nan | nan |
| 1 | alpha | 1.35393 | 1.35393 | 1.35393 | 1.35393 | 1.35393 | 1.35393 | nan | nan | nan | nan |
| 2 | dn | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | nan | nan | nan | nan |
| 3 | Mu | 16.3012 | 16.3012 | 16.3012 | 16.3012 | 16.3012 | 16.3012 | nan | nan | nan | nan |
| 4 | roll | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 | 114.29 |
| 5 | timesens | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
You get nan
s for columns after x
, you can post-process to drop those columns once you have the df, if you wish:
whole_df.dropna(axis=1)
Upvotes: 1