arqchicago
arqchicago

Reputation: 309

updating data from csv files using Python dataframe

I am extracting data from many CSV files and saving it into one file. This process is repetitive and every so often, I have new versions of the files and so the underlying data needs to be updated. I am having a hard time getting the data that is desired. Here is an example of what I am trying to do:

old file (/old/Test1.csv)

tests   col1    col2    col3
test1     11       8       5
test3      9       5       7
test5     12       9      -1

new file (/new/Test1.csv)

tests   col2    col3    col4
test1      8       7      15
test3      5       9      10
test7      1       4       9
test9     11      10      12

Note that in the new file, col1 is absent, there is a new column col4, row test5 is absent and there are two new rows test7 and test9. The desired output should:

  1. Contain all columns and rows from test1.csv and test2.csv
  2. Include updated data in test2.csv for (row,col) pairs from test1.csv
  3. If there are no updates to (row,col) pairs from test1.csv, then the data in test1.csv should be used.
  4. Any empty cells in the updated file should be filled with 0.

For the data shown above, this should be the updated file (Test1_update.csv)

tests    col1    col2    col3    col4
test1      11       8       7      15
test3       9       5       9      10
test5      12       9      -1       0 
test7       0       1       4       9
test9       0      11      10      12       

I am able to get to this point using the code below:

tests    col1    col2    col3    col4
test1     Nan       8       7      15
test3     Nan       5       9      10
test5     Nan     Nan     Nan     Nan 
test7     Nan       1       4       9
test9     Nan      11      10      12


import pandas as pd
import numpy as np

df1 = pd.read_csv('\\dir\\test1.csv', index_col=0)
df2 = pd.read_csv('\\dir\\test2.csv', index_col=0)

new_index = list(set(list(df1.index.values)+list(df2.index.values)))
new_cols = list(set(list(df1.columns.values)+list(df2.columns.values)))

df3 = pd.DataFrame(index=new_index, columns=new_cols)
df4 = df2.reindex(df3.index)
df4 = df4.join(df3, rsuffix='_P')
df4 = df4.loc[:,~df4.columns.str.endswith('_P')]
print df4

Upvotes: 3

Views: 7508

Answers (1)

arqchicago
arqchicago

Reputation: 309

I was able to get the desired data frame.

import pandas as pd
import numpy as np

df1 = pd.read_csv('\\dir\\test1.csv', index_col=0)
df2 = pd.read_csv('\\dir\\test2.csv', index_col=0)

new_index = list(set(list(df1.index.values)+list(df2.index.values)))
df2 = df2.reindex(new_index)
df2 = df2.join(df1, rsuffix='_P')
df2 = df2.loc[:,~df2.columns.str.endswith('_P')].fillna(df1).fillna(0)
df2.sort_index(inplace=True)
print df2.to_string()


       col2  col3  col4  col1                        
test1     8     7    15    11
test3     5     9    10     9
test5     9    -1     0    12
test7     1     4     9     0
test9    11    10    12     0

Upvotes: 2

Related Questions