Working with a Pandas DF with an implemented function

Question

Hi i'm doing some ML for my Bachelor Thesis with tweets. To normalize the data in my df i implemented a function

def avg_value_over_time(df, day_column: str, target_columns: List[Tuple[str, str]], day_span: int = 90):
    """
    Averages a certain value over time
    :param df: The dataframe
    :param day_column: The column with the day (datetime)
    :param target_columns: The columns that should be averaged over time (list of tuples of shape (column_name, new_avg_column_name))
    :param day_span: The number of days over which should be averaged
    :return: A new created column with the averaged values
    """

    def apply_func(x):
        day = x[day_column]
        fr, to = day - timedelta(days=day_span), day - timedelta(days=1)
        d = select_by_days(df, day_column, fr, to)

        dct = {n2: d[n1].mean() for n1, n2 in target_columns}

        return pd.Series(dct)

    return df.apply(apply_func, axis=1)

the function works pretty good with the following code

numMSA_agg = avg_value_over_time(gt_df, 'Date',[('NumMentionsAvg','NumMentionsAvgNorm'), ('NumSourcesAvg', 'NumSourcesAvgNorm'),('NumArticlesAvg', 'NumArticlesAvg')])

But now i'm looking for a way to iterate over my columns and input them in my function. Due to the word counting of the scikit library i have more than 3000 columns and i don't want to manually add them all in my function and normalize them. Therefore i'm looking for a way to get all columns into the function and than just create a new column for every iterated column with the old column name + the string "Norm". Many thanks in advance

Ishwar Venugopal · Accepted Answer

Here is a minimalistic example to showcase a solution to your question.

If your dataframe had four columns out of which you wanted to normalise three columns using your function, then the following code snippet would do the trick without having to manually supply the target_columns argument:

cols = df.columns # Which is suppose ['Date','NumMentionsAvg', 'NumSourcesAvg', 'NumArticlesAvg']

cols = cols[1:] # Which in this case selects the columns ['NumMentionsAvg', 'NumSourcesAvg', 'NumArticlesAvg']

target_columns_argument = []

for i in cols:
    target_columns_argument.append((i,i+'Norm'))

In this example, the target_cols_argument would look like:

[('NumMentionsAvg', 'NumMentionsAvgNorm'),
 ('NumSourcesAvg', 'NumSourcesAvgNorm'),
 ('NumArticlesAvg', 'NumArticlesAvgNorm')]

So, then you can call your function as follows:

numMSA_agg = avg_value_over_time(gt_df, 'Date', target_columns = target_columns_argument)

So, if you have 3000 columns it would still work given that you select the appropriate indices of the columns you wish to normalise in the second line of the code.

Working with a Pandas DF with an implemented function

Answers (1)

Related Questions