Reputation: 305
I have an excel file with two sheets. I would like to copy 3 columns from the first sheet to the second sheet.
Note:
I have tried many methods. My best attempt so far is :
df_new_sheet2 = pd.concat([df_old_sheet2, df_three_of_sheet1], axis=1, join_axes=[df_old_sheet2.index])
But this is not the desired output.
If pandas can not do this, could you please suggest some other python package which could work ?
In case I'm not describing the problem clearly enough, I upload a pic which may help more or less. Thx for your answers~
UPDATE[2017.07.24]:
I Finally find my fault!
insert one column with index number and then , follow the resolution of b2002, things gonna to be well. :)
Upvotes: 3
Views: 8773
Reputation: 914
This method uses pandas and xlsxwriter.
Setup (create demo excel file):
import pandas as pd
df1 = pd.DataFrame({'1_A': [1,2,3,4], '1_B': [5,4,6,5],
'1_C': [8,7,9,0], '1_D': [9,7,8,5], '1_E': [2,4,9,8]})
df2 = pd.DataFrame({'1_A': [5,4,1,3], '1_B': [55,2,3,4]})
setup_dict = {'Sheet_1': df1, 'Sheet_2': df2}
with pd.ExcelWriter('excel_file.xlsx',
engine='xlsxwriter') as writer:
for ws_name, df_sheet in setup_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
(Start here to read an existing excel file)
#Read your excel file, use "sheetname=None" to create a dictionary of
#worksheet dataframes. (Note: future versions of pandas will use
#"sheet_name" vs. "sheetname").
#Replace 'excel_file.xlsx' with the actual path to your file.
ws_dict = pd.read_excel('excel_file.xlsx', sheetname=None)
#Modify the Sheet_2 worksheet dataframe:
#(or, create a new worksheet by assigning concatenated df to a new key,
#such as ws_dict['Sheet_3'] = ...)
ws_dict['Sheet_2'] = pd.concat([ws_dict['Sheet_2'][['1_A','1_B']],
ws_dict['Sheet_1'][['1_A','1_B','1_C']]],
axis=1)
#Write the ws_dict back to disk as an excel file:
#(replace 'excel_file.xlsx' with your desired file path.)
with pd.ExcelWriter('excel_file.xlsx',
engine='xlsxwriter') as writer:
for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
Other methods may be used to combine the columns such as a join (with different suffixes representing the original worksheets, for example) since all of the worksheets are converted to dataframes when the excel file is read.
EDIT (for new worksheet and unique column names...)
ws_dict = pd.read_excel('excel_file.xlsx', sheetname=None)
#Modify the Sheet_2 worksheet dataframe:
#(or, create a new worksheet by assigning concatenated df to a new key,
#such as ws_dict['Sheet_3'] = ...)
ws_dict['Sheet_3'] = ws_dict['Sheet_2'][['1_A','1_B']].join(ws_dict['Sheet_1'][['1_A','1_B','1_C']],
lsuffix='_sh2', rsuffix='_sh1', how='outer')
#Write the ws_dict back to disk as an excel file:
#(replace 'excel_file.xlsx' with your desired file path.)
with pd.ExcelWriter('excel_file.xlsx',
engine='xlsxwriter') as writer:
for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
Upvotes: 2
Reputation: 107707
If using Excel and Python for Windows (and if not, for future readers), consider an SQL solution with an ODBC connection to the JET/ACE Engine which can query Excel workbooks, its own Access databases, even text files (csv/tab/txt). This engine which are .dll files is installed by default with Windows machines or MS Office. This approach avoids opening any workbook.
Simply run an INNER JOIN
on the sheets and use panda's read_sql()
to import query resultset directly into a dataframe. Connection can use pyodbc
or pypyodbc
modules. And since you work in SQL, SELECT
needed columns, rename them, filter with WHERE
, JOIN
or UNION
other worksheets and in other workbooks, even aggregate with GROUP BY
:
import pyodbc
import pandas as pd
strfile = "C:\Path\To\Workbook.xlsx"
conn = pyodbc.connect(r'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};' + \
'DBQ={};'.format(strfile), autocommit=True)
strSQL = " SELECT s1.[1_A] As s1_1_A, s1.[1_B] As s1_1_B," + \
" s2.[1_A] AS s2_1_A, s2.[1_B] As s2_1_B, s2.[1_C] As s2_1_C" + \
" FROM [Sheet1$] s1" + \
" INNER JOIN [Sheet2$] s2 ON s1.[index] = s2.[index]"
df = pd.read_sql(strSQL, conn)
conn.close()
Upvotes: 1