Reputation: 971
I have a following data frame, which was obtained using the code:
df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()
Refer here
The resultant data frame obtained :
id x,y
0 1 [(0, 0), (1, 2)]
1 2 [(1, 3), (1, 2)]
2 3 [(2, 5), (4, 6)]
Is it possible to get something like this:
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
Here, the list of coordinates obtained as a result in previous df is split into new rows against their respective ids.
Upvotes: 1
Views: 1767
Reputation: 294258
'id'
column
str.len
method to quickly count the number of elements in each element's sub-list. This is convenient as we can directly pass this result to the repeat
method of df1['id']
which will repeat each element by a corresponding amount from the lengths we passed.'x,y'
column
np.concatenate
to push all the sub-lists together. However, in this case, the sub-lists are lists of tuples. np.concatenate
will not treat these as lists of objects. So instead, I use the sum
method and that will use the underlying sum
method on lists, which will in turn concatenate.pandas
if we stick with pandas
we can keep the code cleaner
Use repeat
with str.len
and sum
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (4, 6)
numpy
we can quicken this approach up by using the underlying numpy arrays and equivalent numpy methods
NOTE: this is equivalent logic!
pd.DataFrame({
'id': df1['id'].values.repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()
})
We can speed it up even more by skipping the the str.len
method and calculating the lengths with a list comprehension.
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
small data
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 351 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 590 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 498 µs per loop
larger data
df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 579 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 841 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 704 µs per loop
Upvotes: 2
Reputation: 862601
You can use DataFrame
constructor with stack
:
df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='x,y')
print (df2)
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
numpy
solution use numpy.repeat
by lengths
of values by str.len
, x,y
column is flattenig by numpy.ndarray.sum
:
df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
print (df2)
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (1, 9)
2 3 (4, 6)
Timings:
In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop
In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop
#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop
Upvotes: 5