Reputation: 75
I have multiple time series dataframes where I keep having to do the same things such as: name the columns, drop columns, add columns, perform operations on columns, perform numpy.select operations on columns, remove columns(lately I have been using a second dataframe with the now unneeded columns).
Is there anyway I can create a function doing these things without me having to keep copying and pasting the code to get my data ready?
Slightly pseudocode example:
cols = ['date','open','high','low','close','volume']
df = pd.read_csv('data.csv',sep='\t',names=cols)
dcol=['volume']
df.drop(dcol,axis=1,inplace=True)
df.insert(loc=5,column='name1',value=(df['operation']-df['operation']))
df2 = df.copy()
df2.insert(loc=6,column='name2',value=(df['operation']-df['operation']))
import numpy as np
conditions = [(cond1),(cond2)]
values1 = [(value1),(value2)]
values2 = [(value1),(value2)]
values3 = [(value1),(value2)]
# and finally three of these
df['randomname'] = np.select(conditions,values1)
So, is there a fast way to do this? Or I just need to pull myself up by the bootstraps...
Upvotes: 2
Views: 1665
Reputation: 75
Done! So, we want to create functions doing our tedious tasks and later use df.pipe() with the functions in. (I'm using pseudocode again on operations)
(not using functions or pipe here)
cols = ['date','open','high','low','close','volume']
df = pd.read_csv('yourdata.csv',sep='\t',names=cols)
def insert_cols(data):
data.insert(loc=6,column='yourCol',value=(data['a']-data['b']))
def numpy_cols(data):
conditions = [data(cond1),data(cond2)]
values1 = [data(value1),data(value2)]
data['yourCol2'] = np.select(conditions,values1)
def drop_cols(data):
dcols = ['volume','randomName']
data.drop(dcols,axis=1,inplace=True)
(and other dfs with same format!)
df.pipe(insert_cols)
df.pipe(numpy_cols)
df.pipe(drop_cols)
There is a way to apply all your functions in one go using pipe, but I could not get it to work.
I used these websites, and the first one is what made it click:
https://data-flair.training/blogs/pandas-function-applications/
https://towardsdatascience.com/using-pandas-pipe-function-to-improve-code-readability-96d66abfaf8
https://www.kdnuggets.com/2021/01/cleaner-data-analysis-pandas-pipes.html
EDIT: I just used a single function containing all necessary cleaning and manipulation, and then just piped that function to the dataframe. Works the same but doing it in a single line (no need to use .pipe() multiple times).
Upvotes: 2