Reputation: 43
Hi i'm doing some ML for my Bachelor Thesis with tweets. To normalize the data in my df i implemented a function
def avg_value_over_time(df, day_column: str, target_columns: List[Tuple[str, str]], day_span: int = 90):
"""
Averages a certain value over time
:param df: The dataframe
:param day_column: The column with the day (datetime)
:param target_columns: The columns that should be averaged over time (list of tuples of shape (column_name, new_avg_column_name))
:param day_span: The number of days over which should be averaged
:return: A new created column with the averaged values
"""
def apply_func(x):
day = x[day_column]
fr, to = day - timedelta(days=day_span), day - timedelta(days=1)
d = select_by_days(df, day_column, fr, to)
dct = {n2: d[n1].mean() for n1, n2 in target_columns}
return pd.Series(dct)
return df.apply(apply_func, axis=1)
the function works pretty good with the following code
numMSA_agg = avg_value_over_time(gt_df, 'Date',[('NumMentionsAvg','NumMentionsAvgNorm'), ('NumSourcesAvg', 'NumSourcesAvgNorm'),('NumArticlesAvg', 'NumArticlesAvg')])
But now i'm looking for a way to iterate over my columns and input them in my function. Due to the word counting of the scikit library i have more than 3000 columns and i don't want to manually add them all in my function and normalize them. Therefore i'm looking for a way to get all columns into the function and than just create a new column for every iterated column with the old column name + the string "Norm". Many thanks in advance
Upvotes: 1
Views: 57
Reputation: 882
Here is a minimalistic example to showcase a solution to your question.
If your dataframe had four columns out of which you wanted to normalise three columns using your function, then the following code snippet would do the trick without having to manually supply the target_columns
argument:
cols = df.columns # Which is suppose ['Date','NumMentionsAvg', 'NumSourcesAvg', 'NumArticlesAvg']
cols = cols[1:] # Which in this case selects the columns ['NumMentionsAvg', 'NumSourcesAvg', 'NumArticlesAvg']
target_columns_argument = []
for i in cols:
target_columns_argument.append((i,i+'Norm'))
In this example, the target_cols_argument
would look like:
[('NumMentionsAvg', 'NumMentionsAvgNorm'),
('NumSourcesAvg', 'NumSourcesAvgNorm'),
('NumArticlesAvg', 'NumArticlesAvgNorm')]
So, then you can call your function as follows:
numMSA_agg = avg_value_over_time(gt_df, 'Date', target_columns = target_columns_argument)
So, if you have 3000 columns it would still work given that you select the appropriate indices of the columns you wish to normalise in the second line of the code.
Upvotes: 2