Krblaze
Krblaze

Reputation: 5

Faster way to iterate over columns in pandas

I have the following task.

I have this data:

import pandas
import numpy as np
data = {'name':  ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
    'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
    ' email': ['[email protected]', '[email protected]', np.nan, '[email protected]', np.nan ,np.nan , '[email protected]', '[email protected]'],
    'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
    'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
    }
df = pandas.DataFrame(data)

What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.

E.g. If we look at most_visited_airport and Heathrow I need to output three files containing the names, emails and phones of the people who visited the airport the most.

enter image description here

enter image description here

Currently, I have this code to do the operation for both columns and all the values:

columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
    values = df[each].dropna().unique()
    for i in values:
        df1 = df.loc[df[each]==i,'name']
        df2 = df.loc[df[each]==i,' email']
        df3 = df.loc[df[each]==i,'phone']
        df1.to_csv(f'{each}_{i}_{df1.name}.csv')
        df2.to_csv(f'{each}_{i}_{df2.name}.csv')
        df3.to_csv(f'{each}_{i}_{df3.name}.csv')

Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.

Thank you in advance!

Upvotes: 0

Views: 686

Answers (2)

D.Vinogradov
D.Vinogradov

Reputation: 103

You can do it this way.

cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
    place = [i for i in df[most_col].dropna().unique().tolist()]
    csv_cols = ['name', 'phone', ' email']
    result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
              csv_cols]
    return result


[func_current_cols_to_csv(i) for i in cols]

also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Upvotes: 0

Mad Physicist
Mad Physicist

Reputation: 114518

You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:

for column in df.filter(regex='^most'):
    for key, group in df.groupby(column):
        for attr in ('name', 'phone', 'email'):
            group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')

Upvotes: 1

Related Questions