Reputation: 495
I'm new to using Pandas dataframes. I have data in a .csv like this:
foo, 1234,
bar, 4567
stuff, 7894
New Entry,,
morestuff,1345
I'm reading it into the dataframe with
df = pd.read_csv
But what I really want is a new dataframe (or a way of splitting the current one) every time I have a "New Entry" line (obviously without including it). How could this be done?
Upvotes: 0
Views: 371
Reputation: 76927
1) Doing it on the fly while reading the file line-by-line and checking for NewEntry
break is one approach.
2) Other way, if the dataframe already exists is to find the NewEntry
and slice the dataframe into multiple ones to dff = {}
df
col1 col2
0 foo 1234
1 bar 4567
2 stuff 7894
3 NewEntry NaN
4 morestuff 1345
Find the NewEntry
rows, add [-1]
and [len(df.index)]
for boundary conditions
rows = [-1] + np.where(df['col1']=='NewEntry')[0].tolist() + [len(df.index)]
[-1, 3L, 5]
Create the dict of dataframes
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
Dict of dataframes {0: datafram1, 1: dataframe2}
dff
{0: col1 col2
0 foo 1234
1 bar 4567
2 stuff 7894, 1: col1 col2
4 morestuff 1345}
Dataframe 1
dff[0]
col1 col2
0 foo 1234
1 bar 4567
2 stuff 7894
Dataframe 2
dff[1]
col1 col2
4 morestuff 1345
Upvotes: 1
Reputation: 394051
So using your example data which I concatenated 3 times, after loading (I named the cols 'a','b','c' for convenience) we then find the indices where you have 'New Entry' and the produce a list of tuples of these positions stepwise to mark the beg, end range.
We can then iterate over this list of tuples and slice the orig df and append to list:
In [22]:
t="""foo,1234,
bar,4567
stuff,7894
New Entry,,
morestuff,1345"""
df = pd.read_csv(io.StringIO(t),header=None,names=['a','b','c'] )
df = pd.concat([df]*3, ignore_index=True)
df
Out[22]:
a b c
0 foo 1234 NaN
1 bar 4567 NaN
2 stuff 7894 NaN
3 New Entry NaN NaN
4 morestuff 1345 NaN
5 foo 1234 NaN
6 bar 4567 NaN
7 stuff 7894 NaN
8 New Entry NaN NaN
9 morestuff 1345 NaN
10 foo 1234 NaN
11 bar 4567 NaN
12 stuff 7894 NaN
13 New Entry NaN NaN
14 morestuff 1345 NaN
In [30]:
import itertools
idx = df[df['a'] == 'New Entry'].index
idx_list = [(0,idx[0])]
idx_list = idx_list + list(zip(idx, idx[1:]))
idx_list
Out[30]:
[(0, 3), (3, 8), (8, 13)]
In [31]:
df_list = []
for i in idx_list:
print(i)
if i[0] == 0:
df_list.append(df[i[0]:i[1]])
else:
df_list.append(df[i[0]+1:i[1]])
df_list
(0, 3)
(3, 8)
(8, 13)
Out[31]:
[ a b c
0 foo 1234 NaN
1 bar 4567 NaN
2 stuff 7894 NaN, a b c
4 morestuff 1345 NaN
5 foo 1234 NaN
6 bar 4567 NaN
7 stuff 7894 NaN, a b c
9 morestuff 1345 NaN
10 foo 1234 NaN
11 bar 4567 NaN
12 stuff 7894 NaN]
Upvotes: 1