Reputation: 23
I've started using Pandas recently and have been stumbling over this issue for a few days. I have a dataframe with interval information that looks a bit like this:
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
RangeBegin RangeEnd Var1 Var2
0 1 2 A A
1 3 4 A A
2 5 7 A B
3 10 11 B B
4 12 41 B B
5 42 54 B B
6 65 100 A A
It is sorted by RangeBegin. The idea is to to end up with something like this instead:
RangeBegin RangeEnd Var1 Var2
0 1.0 4.0 A A
2 5.0 7.0 A B
3 10.0 54.0 B B
6 65.0 100.0 A A
Where every "duplicate" (matching Var1 and Var2) row with contiguous ranges is aggregated into a single row. I'm thinking of expanding this algorithm to detect and deal with overlaps, but I'd like to get this working properly first.
You see, I've got a solution working by using iterrows to build a new dataframe row-by-row, but it takes far too long on my real dataset and I'd like to use a more vectorized implementation. I've looked into groupby but can't find a set of keys (or a function to apply to said groups) that would make this work.
Here's my current implementation as it stands:
def test():
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
print(df)
i = 0
cols = df.columns
aggData = pd.DataFrame(columns = cols)
for row in df.iterrows():
rowIndex, rowData = row
#if our new dataframe is empty or its last row is not contiguous, append it
if(aggData.empty or not duplicateContiguousRow(cols,rowData,aggData.loc[i])):
aggData = aggData.append(rowData)
i=rowIndex
#otherwise, modify the last row
else:
aggData.loc[i,'RangeEnd'] = rowData['RangeEnd']
print(aggData)
def duplicateContiguousRow(cols, row, aggDataRow):
#first bool: are the ranges contiguous?
contiguousBool = aggDataRow['RangeEnd']+1 == row['RangeBegin']
if(not contiguousBool):
return False
#second bool: is this row a duplicate (minus range columns)?
duplicateBool = True
for col in cols:
if(not duplicateBool):
break
elif col not in ['RangeBegin','RangeEnd']:
#Nan != Nan
duplicateBool = duplicateBool and (row[col] == aggDataRow[col] or (row[col]!=row[col] and aggDataRow[col]!=aggDataRow[col]))
return duplicateBool
EDIT: This question just got asked while I was writing this one. The answer looks promising
Upvotes: 2
Views: 1238
Reputation: 6784
You can use groupby
for this purpose, when first detecting the consecutive segments:
df['block'] = ((df['Var1'].shift(1) != df['Var1']) | (df['Var2'].shift(1) != df['Var2'])).astype(int).cumsum()
df.groupby(['Var1', 'Var2', 'block']).agg({'RangeBegin': np.min, 'RangeEnd': np.max}).reset_index()
will result in:
Var1 Var2 block RangeBegin RangeEnd
0 A A 1 1 4
1 A A 4 65 100
2 A B 2 5 7
3 B B 3 10 54
You could then sort by block
to restore the original order.
Upvotes: 7