Martin Bouhier
Martin Bouhier

Reputation: 361

Incorporating new data into an existing dataframe

With my code I can join 2 databases in one. Now, I need to do the same with another database file.

archivo1:

Fecha       Cliente     Impresiones Impresiones 2   Revenue
20/12/17    Jose        1312        35              $12
20/12/17    Martin      12          56              $146
20/12/17    Pedro       5443        124             $1,256
20/12/17    Esteban     667         1235            $1

archivo2:

Fecha       Cliente     Impresiones Impresiones 2   Revenue
21/12/17    Jose        25          5               $2
21/12/17    Martin      6347        523             $123
21/12/17    Pedro       2368        898             $22
21/12/17    Esteban     235         99              $7,890

archivo:

Fecha       Cliente     Impresiones Impresiones 2   Revenue
22/12/17    Peter       55          5               $2
22/12/17    Juan        634527      523             $123
22/12/17    Pedro       836         898             $22
22/12/17    Esteban     125         99              $7,890

I have this results:

enter image description here

The problem is that I need to add the new database(archivo) into the Data.xlsx file and it will look like:

enter image description here

Code:

import pandas as pd
import pandas.io.formats.excel
import numpy as np

# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")

df = pd.concat([df1, df2])\
       .set_index(['Cliente', 'Fecha'])\
       .stack()\
       .unstack(-2)\
       .sort_index(ascending=[True, False])

i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())

idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
df.index = df.index.droplevel()

# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None

with pd.ExcelWriter("Data.xlsx",
                    engine='xlsxwriter',
                    date_format='dd/mm/yyyy',
                    datetime_format='dd/mm/yyyy') as writer:

    df.to_excel(writer, sheet_name='Sheet1')

Upvotes: 1

Views: 45

Answers (1)

cs95
cs95

Reputation: 403050

Extending my comment as an answer, I'd recommend creating a function that will reshape your dataframes to conform to a given format. I'd recommend doing this simply because it is much easier to just reshape your data, rather than reshape new entries to conform to the existing structure. This is because your current structure is a format that makes it extremely hard to work with (take it from me).

So, the easiest thing to do would be to create a function -

def process(dfs):    
    df = pd.concat(dfs)\
           .set_index(['Cliente', 'Fecha'])\
           .stack()\
           .unstack(-2)\
           .sort_index(ascending=[True, False])

    i = df.index.get_level_values(0)
    j = df.index.get_level_values(1)

    y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
    x = i.unique().repeat(len(df.index.levels[1]) + 1)

    df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
    df.index = df.index.droplevel()

    return df

Now, load your dataframes -

df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx', ...]:
    df_list.append(pd.read_excel(file))

Now, call the process function with your df_list -

df = process(df_list)
df

Fecha        20/12/17 21/12/17
Esteban                       
Revenue            $1   $7,890
Impresiones2     1235       99
Impresiones       667      235
Jose                          
Revenue           $12       $2
Impresiones2       35        5
Impresiones      1312       25
Martin                        
Revenue          $146     $123
Impresiones2       56      523
Impresiones        12     6347
Pedro                         
Revenue        $1,256      $22
Impresiones2      124      898
Impresiones      5443     2368

Save df to a new excel file. Repeat the process for every new dataframe that enters the system.

In summary, your entire code listing would look like this -

import pandas as pd
import pandas.io.formats.excel
import numpy as np


def process(dfs):    
    df = pd.concat(dfs)\
           .set_index(['Cliente', 'Fecha'])\
           .stack()\
           .unstack(-2)\
           .sort_index(ascending=[True, False])

    i = df.index.get_level_values(0)
    j = df.index.get_level_values(1)

    y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
    x = i.unique().repeat(len(df.index.levels[1]) + 1)

    df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
    df.index = df.index.droplevel()

    return df


if __name__ == '__main__':
    df_list = []
    for file in ['archivo1.xlsx', 'archivo2.xlsx']:
        df_list.append(pd.read_excel(file))

    df = process(df_list)


    with pd.ExcelWriter("test.xlsx",
                        engine='xlsxwriter',
                        date_format='dd/mm/yyyy',
                        datetime_format='dd/mm/yyyy') as writer:                
        df.to_excel(writer, sheet_name='Sheet1') 

The alternative to this tedious process is to change your dataset structure, and reconsider a more viable alternative that makes it much easier to add new data to existing data without having to keep reshaping everything from scratch. This is something you'll have to sit down and think about.

Upvotes: 3

Related Questions