siiraaj
siiraaj

Reputation: 491

Python mapping two csv files

I have a config file (csv) :

Column name;Function;Args
Region;function1;arg1
Country;function2;arg1, arg2
email;function3;arg1
...

And i want to apply a specific Function from my config file to a specific column in my csv file (fileIn large file > 1GB) using dask, pandas or standard csv:

Region;Country;name
Europe;Slovakia;Mark
Asia;china;Steeve
...

Is there a a clean way to iterate over the config file ?

df = pd.read_csv(fileIn, sep=';', low_memory=True, chunksize=1000000, error_bad_lines=False)

for chunk in df
    chunk['Region'] = chunk['Region'].apply(lambda x: MyClass.function1(args1))
    chunk['Country'] = chunk['Country'].apply(lambda x: MyClass.function2(arg1, arg2))
    chunk['email'] = chunk['email'].apply(lambda x: MyClass.function3(arg1))
    
    chunk['Region'].to_csv(fileOut, index=False, header=True, sep=';')
    ...


here is an example of one of my functions called in my config file :

def function1(value, replaceWith):
    text = re.sub(r'[^ ]', replaceWith, value)
    return text

Upvotes: 1

Views: 1063

Answers (1)

perl
perl

Reputation: 9941

You can set up a dict with functions and apply it to the chunk dataframe on each iteration.

Here's some code, please see comments for explanations:

# set up functions, for example
# - f1 to uppercase
# - f2 to lowercase
# - f3 to reverse string
def f1(x):
    return x.upper()

def f2(x):
    return x.lower()

def f3(x):
    return x[::-1]

# set up dict mapping function name in config to a function
fns = {
    'function1': f1,
    'function2': f2,
    'function3': f3,
}

# read config and set dict mapping column to a function
# here using the following `config.csv`:
#   Column name;Function
#   Region;function1
#   Country;function2
#   name;function3
df_config = pd.read_csv('config.csv', sep=';')
col_fns = df_config.set_index('Column name')['Function'].map(fns).to_dict()

# read and process csv in chunks
fileIn = 'file.csv'
fileOut = 'out.csv'
chunkSize = 1

df = pd.read_csv(
    fileIn, sep=';', low_memory=True, chunksize=chunkSize, error_bad_lines=False)

for i, chunk in enumerate(df):
    chunk_processed = chunk.apply(col_fns) # apply functions
    chunk_processed.to_csv(
        fileOut, index=False, header=(i==0), sep=';', mode='w' if i==0 else 'a')

# read the first 100 lines of the processed csv to test
pd.read_csv(fileOut, sep=';', nrows=100)

Output:

   Region   Country    name
0  EUROPE  slovakia    kraM
1    ASIA     china  eveetS

P.S. You'll need to change chunkSize to something reasonable, of course, only using 1 here for testing, because the inFile only has 2 lines, and we should test with multiple chunks


Update: If the functions you need to apply have some parameters, you can create wrapper functions:

# set up functions
def f(x, replaceWith):
    return re.sub(r'[^ ]', replaceWith, x)

# set up dict mapping function name in config to a function
fns = {
    'function1': f,
    'function2': f,
    'function3': f,
}

# read config and set dict mapping column to a function
# here using the following `config.csv`:
#   Column name;Function;args
#   Region;function1;A
#   Country;function2;B
#   name;function3;C
df_config = pd.read_csv('config.csv', sep=';')
col_fns = {r['Column name']: lambda x: fns[r['Function']](x, r['args'])
           for _, r in df_config.iterrows()}

...

# making it output both raw and processed values:
for i, chunk in enumerate(df):
    chunk_processed = chunk.apply(col_fns) # apply functions
    chunk_out = pd.concat(
        [chunk, chunk_processed.add_suffix('_processed')], axis=1)
    chunk_out.to_csv(
        fileOut, index=False, header=(i==0), sep=';', mode='w' if i==0 else 'a')

Output:

   Region   Country    name Region_processed Country_processed name_processed
0  Europe  Slovakia    Mark           AAAAAA          BBBBBBBB           CCCC
1    Asia     china  Steeve             AAAA             BBBBB         CCCCCC

Upvotes: 1

Related Questions