helloworldnoob
helloworldnoob

Reputation: 75

Python Pandas Dataframe: Fast way to clean and manipulate data?

I have multiple time series dataframes where I keep having to do the same things such as: name the columns, drop columns, add columns, perform operations on columns, perform numpy.select operations on columns, remove columns(lately I have been using a second dataframe with the now unneeded columns).

Is there anyway I can create a function doing these things without me having to keep copying and pasting the code to get my data ready?

Slightly pseudocode example:

cleaning

cols = ['date','open','high','low','close','volume']
df = pd.read_csv('data.csv',sep='\t',names=cols)
dcol=['volume']
df.drop(dcol,axis=1,inplace=True)

multiple of these

df.insert(loc=5,column='name1',value=(df['operation']-df['operation']))

second df (used for hiding the values from the main df)

df2 = df.copy()

again, mulitple of these

df2.insert(loc=6,column='name2',value=(df['operation']-df['operation']))

using numpy to select values from df2 to insert them into main df

import numpy as np
conditions = [(cond1),(cond2)]
values1 = [(value1),(value2)]
values2 = [(value1),(value2)]
values3 = [(value1),(value2)]

# and finally three of these
df['randomname'] = np.select(conditions,values1)

So, is there a fast way to do this? Or I just need to pull myself up by the bootstraps...

Upvotes: 2

Views: 1665

Answers (1)

helloworldnoob
helloworldnoob

Reputation: 75

Done! So, we want to create functions doing our tedious tasks and later use df.pipe() with the functions in. (I'm using pseudocode again on operations)

standard for my csv files

(not using functions or pipe here)

cols = ['date','open','high','low','close','volume']
df = pd.read_csv('yourdata.csv',sep='\t',names=cols)

insert columns and as many as you wish

def insert_cols(data):
    data.insert(loc=6,column='yourCol',value=(data['a']-data['b']))

using numpy.select in function

def numpy_cols(data):
    conditions = [data(cond1),data(cond2)]
    values1 = [data(value1),data(value2)]

    data['yourCol2'] = np.select(conditions,values1)

dropping columns in function

def drop_cols(data):
    dcols = ['volume','randomName']
    data.drop(dcols,axis=1,inplace=True)

and finally, applying functions through pipe to the df

(and other dfs with same format!)

df.pipe(insert_cols)
df.pipe(numpy_cols)
df.pipe(drop_cols)

There is a way to apply all your functions in one go using pipe, but I could not get it to work.

I used these websites, and the first one is what made it click:

https://data-flair.training/blogs/pandas-function-applications/

https://towardsdatascience.com/using-pandas-pipe-function-to-improve-code-readability-96d66abfaf8

https://www.kdnuggets.com/2021/01/cleaner-data-analysis-pandas-pipes.html

EDIT: I just used a single function containing all necessary cleaning and manipulation, and then just piped that function to the dataframe. Works the same but doing it in a single line (no need to use .pipe() multiple times).

Upvotes: 2

Related Questions