mabel echo
mabel echo

Reputation: 3

import multiple Excel files to pandas and export to multiple Stata files

  1. My raw Excel files are:

[excel_1.xlsx,excel_2.xlsx,...,excel_12.xlsx].

At first I want to import them into dataframes and then append them into a big dataframe, then df.to_dta, but python shows error and said:

MemoryError

I guess the problem is that the appended dataframe is too big.

  1. So I thought I could transform each Excel file to each Stata file, which is:

[excel_1.xlsx,excel_2.xlsx,...,excel_12.xlsx]

to

[excel_1.dta,excel_2.dta,...,excel_12.dta]

and append them in Stata, but I don't know how to do that.

  1. My original code was
import pandas as pd
IO = 'excel_1.xlsx'

df = pd.read_excel(io=IO, skiprows = [1,2] ,
                           dtype={"Opnprc": "str","Hiprc": "str","Loprc": "str","Clsprc": "str","Dnshrtrd": "str","Dnvaltrd": "str","Dsmvosd": "str",
                                  "Dsmvtll": "str","Dretwd": "str","Dretnd": "str","Adjprcwd": "str","Adjprcnd": "str","Markettype": "str",
                                  "Trdsta": "str"})

df.to_stata('excel1.dta')

I guess a for loop should work, but I don't know how to do that.

(the append code:

import os
import pandas as pd


cwd = os.path.abspath('D:\\onedrive\\test2') 
files = os.listdir(cwd) 
print(files) 
df = pd.DataFrame()
for file in files:
    if file.endswith('.xlsx'):
        df = df.append(pd.read_excel(file, skiprows = [1,2] ,
                           dtype={"Opnprc": "str","Hiprc": "str","Loprc": "str","Clsprc": "str","Dnshrtrd": "str","Dnvaltrd": "str","Dsmvosd": "str",
                                  "Dsmvtll": "str","Dretwd": "str","Dretnd": "str","Adjprcwd": "str","Adjprcnd": "str","Markettype": "str",
                                  "Trdsta": "str"}), ignore_index=True) 
df.head() 

df.to_stata('test.dta')

Upvotes: 0

Views: 317

Answers (1)

S V Praveen
S V Praveen

Reputation: 421

Here is how to transform each Excel file to a Stata file using a for loop in python3.

import pandas as pd
IO = 'excel_{}.xlsx'
num_files = 12

for i in range(1, num_files + 1):
    df = pd.read_excel(
            io=IO.format(i), 
            skiprows = [1,2] ,
            dtype={"Opnprc": "str","Hiprc": "str","Loprc": "str","Clsprc": "str","Dnshrtrd": "str","Dnvaltrd": "str","Dsmvosd": "str",
                   "Dsmvtll": "str","Dretwd": "str","Dretnd": "str","Adjprcwd": "str","Adjprcnd": "str","Markettype": "str",
                   "Trdsta": "str"})
    df.to_stata('excel_{}.dta'.format(i))
    

Upvotes: 1

Related Questions