Reputation: 41
I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: Name and Value. One name can repeat itself multiple times among rows. I want to calculate the average value for each name and create a new dataframe with the average value for each name. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!
Name DataType TimeStamp Value Quality
Food Float 2019-01-01 13:00:00 105.75 122
Food Float 2019-01-01 17:30:00 11.8110352 122
Food Float 2019-01-01 17:45:00 12.7932892 122
Water Float 2019-01-01 14:01:00 16446.875 122
Water Float 2019-01-01 14:00:00 146.875 122
RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name object
Value object
This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.
for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()
Upvotes: 2
Views: 9395
Reputation: 148910
You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...
groupby
is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):
df.groupby('TagName')['Sample_value'].mean().reset_index()
it gives as expected:
TagName Sample_value
0 Steam 1.081447e+06
1 Utilities 3.536931e+05
Details on the magic words:
groupby
: identifies the column(s) used to group the rows (same values)['Sample_values']
: restrict the groupby object to the column of interestmean()
: computes the mean per groupreset_index()
: by default the grouping columns go into the index, which is fine for the mean operation. reset_index
make them back normal columnsUpvotes: 1
Reputation: 41
You don't need to walk through the rows, you can just take all of the fields that match your criteria
d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
#iterate over all unique entries in col1
for entry in df["col1"].unique():
# get all the col2 values where col1 is the current iter of col1 entries
meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
print(meanofcurrententry)
This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing
Upvotes: 3
Reputation: 3066
It sounds like the groupby()
functionality is what you want. You define the column where your groups are and then you can take the mean()
of each group. An example from the documentation:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
In your case it would be something like this:
df.groupby('TagName')['Samples_value'].mean()
Edit: So, I applied the code to your provided input dataframe and following is the output:
TagName
Steam 1.081447e+06
Utilities 3.536931e+05
Name: Sample_value, dtype: float64
Is this what you are looking for?
Upvotes: 2