Reputation: 63
I have a DataFrame
, X
, in the following format:
Value1 Value2 Value3 Month Area
1 1 3 0 1
4 10 45 1 1
.. .. .. .. ..
15 11 10 34 1
10 21 12 35 1
1 2 2 0 2
3 4 4 1 2
.. .. .. .. ..
27 39 21 34 2
23 42 32 35 2
.. .. .. .. ..
From the DataFrame
X
, I want to create a DataFrame
Y
, with the same columns except for Month
and Area
. The rows of Y
, need to be relative to the Area
column but have the column values for the row of X+1
.
For example, the first row of Y
will contain the value of the second row of X
, if the area is 1.
Value1 Value2 Value3 (Index) (Area)
4 10 45 0 1
.. .. .. .. ..
10 21 12 34 1
3 4 4 35 2
.. .. .. .. ..
23 42 32 69 2
.. .. .. .. ..
Then, I need to remove the rows in X
, which contains the last Month
Value, for every Area
.
The DataFrame
X
would then be converted into something like the following:
Value1 Value2 Value3 Month Area
1 1 3 0 1
4 10 45 1 1
.. .. .. .. ..
15 11 10 34 1
1 2 2 0 2
3 4 4 1 2
.. .. .. .. ..
27 39 21 34 2
.. .. .. .. ..
This is my code at the moment:
#Define a list of indexes of X to Drop
to_drop = list()
prev = None
y_index = 0
#Iterate through X
for index, row in x.iterrows():
if not prev is None:
#If the Area is the same as before
if row['Area'] == prev :
#Append to the Y dataframe and increase y's index
y.loc[y_index] = row[y_columns]
y_index+=1
#We found a new index. We want to delete the previous one, which will have the last
#month value for the previous area
else: to_drop.append(prev_index)
prev_index= index
prev = row['Area']
print("Removing some rows...")
x = x.drop(x.index[to_drop])
However, when I check the new DataFrames
, they seem to be losing a lot of information, such as full Areas
being wiped out.
Any idea for a better approach, and what I could be doing wrong?
Please note, this is an explicative example. The Dataset is made of several thousand Area
values, with around 30
Month
Values.
Thanks in advance.
Upvotes: 2
Views: 410
Reputation: 153460
df['Month'] = df['Month'].shift()
df = df.dropna(subset=['Month'], axis=0)
df[df.duplicated('Area', keep='last')]
Does drop_duplicates
work for you?
df.drop_duplicates(['Area'], keep='first')
Output:
Value1 Value2 Value3 Month Area
0 1 1 3 0 1
2 1 2 2 0 2
And,
df.drop_duplicates(['Area'], keep='last')
Output:
Value1 Value2 Value3 Month Area
1 4 10 45 1 1
3 3 4 4 1 2
Update, drops the last records in each group:
df[df.duplicated('Area', keep='last')]
Upvotes: 0