Reputation: 491
I have a config file (csv) :
Column name;Function;Args
Region;function1;arg1
Country;function2;arg1, arg2
email;function3;arg1
...
And i want to apply a specific Function from my config file to a specific column in my csv file (fileIn
large file > 1GB) using dask, pandas or standard csv:
Region;Country;name
Europe;Slovakia;Mark
Asia;china;Steeve
...
Is there a a clean way to iterate over the config file ?
df = pd.read_csv(fileIn, sep=';', low_memory=True, chunksize=1000000, error_bad_lines=False)
for chunk in df
chunk['Region'] = chunk['Region'].apply(lambda x: MyClass.function1(args1))
chunk['Country'] = chunk['Country'].apply(lambda x: MyClass.function2(arg1, arg2))
chunk['email'] = chunk['email'].apply(lambda x: MyClass.function3(arg1))
chunk['Region'].to_csv(fileOut, index=False, header=True, sep=';')
...
here is an example of one of my functions called in my config file :
def function1(value, replaceWith):
text = re.sub(r'[^ ]', replaceWith, value)
return text
Upvotes: 1
Views: 1063
Reputation: 9941
You can set up a dict with functions and apply it to the chunk dataframe on each iteration.
Here's some code, please see comments for explanations:
# set up functions, for example
# - f1 to uppercase
# - f2 to lowercase
# - f3 to reverse string
def f1(x):
return x.upper()
def f2(x):
return x.lower()
def f3(x):
return x[::-1]
# set up dict mapping function name in config to a function
fns = {
'function1': f1,
'function2': f2,
'function3': f3,
}
# read config and set dict mapping column to a function
# here using the following `config.csv`:
# Column name;Function
# Region;function1
# Country;function2
# name;function3
df_config = pd.read_csv('config.csv', sep=';')
col_fns = df_config.set_index('Column name')['Function'].map(fns).to_dict()
# read and process csv in chunks
fileIn = 'file.csv'
fileOut = 'out.csv'
chunkSize = 1
df = pd.read_csv(
fileIn, sep=';', low_memory=True, chunksize=chunkSize, error_bad_lines=False)
for i, chunk in enumerate(df):
chunk_processed = chunk.apply(col_fns) # apply functions
chunk_processed.to_csv(
fileOut, index=False, header=(i==0), sep=';', mode='w' if i==0 else 'a')
# read the first 100 lines of the processed csv to test
pd.read_csv(fileOut, sep=';', nrows=100)
Output:
Region Country name
0 EUROPE slovakia kraM
1 ASIA china eveetS
P.S. You'll need to change chunkSize
to something reasonable, of course, only using 1 here for testing, because the inFile
only has 2 lines, and we should test with multiple chunks
Update: If the functions you need to apply have some parameters, you can create wrapper functions:
# set up functions
def f(x, replaceWith):
return re.sub(r'[^ ]', replaceWith, x)
# set up dict mapping function name in config to a function
fns = {
'function1': f,
'function2': f,
'function3': f,
}
# read config and set dict mapping column to a function
# here using the following `config.csv`:
# Column name;Function;args
# Region;function1;A
# Country;function2;B
# name;function3;C
df_config = pd.read_csv('config.csv', sep=';')
col_fns = {r['Column name']: lambda x: fns[r['Function']](x, r['args'])
for _, r in df_config.iterrows()}
...
# making it output both raw and processed values:
for i, chunk in enumerate(df):
chunk_processed = chunk.apply(col_fns) # apply functions
chunk_out = pd.concat(
[chunk, chunk_processed.add_suffix('_processed')], axis=1)
chunk_out.to_csv(
fileOut, index=False, header=(i==0), sep=';', mode='w' if i==0 else 'a')
Output:
Region Country name Region_processed Country_processed name_processed
0 Europe Slovakia Mark AAAAAA BBBBBBBB CCCC
1 Asia china Steeve AAAA BBBBB CCCCCC
Upvotes: 1