Reading specific column from excel file using pandas

Question

I have a folder with full of excel files and subfolders.Subfolders also contain excel files.

Using os.walk() i have been able to go through each subfolder and get the data from that excel file.My problem is , i want to take only mobile numbers column from all excel files and store them in database. Issue with that is , column name for mobiles number is different in each excel file and there are thousand excel files.

So i decided to get columns by its datatypes and length. (e.g. if column datatype is integer and length of each cell in that particular column is 10 then i would push that data.)

But i don't understand how to do this. I looped though columns and rows but all i got is repeatative data or error. Can someone please help me out with this ? Would be glad if you helps cause i am doing this since two days but failing. Thanks in advance.

 def file_access(file_path_list):
    for path in file_path_list:
        if path.endswith(('xlsx' , 'xls' , 'XLS')):
            print '------------------------------

'
            folder = path.split('/')[-2]
            sheet = path.split('/')[-1]
            print folder , sheet

            df = pd.read_excel(path, sheet_name=None, header=None)

            # Here i am trying to get data from df object but failing to do so.
            # You can suggest me code from this stage.

ALFAFA · Accepted Answer

To obtain just phone number in many excel files you can try the following code:

import pandas as pd
import xlrd
import os

mydir = (os.getcwd()).replace('\','/') + '/'

#Get all excel files include subdir
filelist=[]
for path, subdirs, files in os.walk(mydir):
    for file in files:
        if (file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.XLS')):
            filelist.append(os.path.join(path, file))
number_of_files=len(filelist)

# Get data of cells from excel
data=[]
for i in range(number_of_files):
    #df.append(pd.read_excel(r''+ mydir +filelist[i]))
    df=pd.read_excel(r''+filelist[i])
    l=len(df.iloc[0])
    for n in range(l):
        if len(str(df.iloc[0][n])) >= 10:
            data.append(df.iloc[:][df.axes[1][n]])
            break
res=[]
for i in range(len(data)):
    res.append(data[i].values.tolist())
print(res)

To get all phone data from different column's name, it's used the length of phone number to differentiate to other column data. For this case i has used the length of phone number used in my country that is 11 in length (ex: 82330403045).

Output:

>>> data
[0    82330403045
1    82330403046
2    82330403047
3    82330403048
Name: Phone, dtype: int64, 0    82330403049
1    82330403050
2    82330403051
3    82330403052
Name: PhoneCell, dtype: int64]

>>> res
[[82330403045, 82330403046, 82330403047, 82330403048], [82330403049, 82330403050, 82330403051, 82330403052], [82330403049, 82330403050, 82330403051, 82330403052], [82330403045, 82330403046, 82330403047, 82330403048], [82330403049, 82330403050, 82330403051, 82330403052], [82330403049, 82330403050, 82330403051, 82330403052]]

You can use this output data for your database.

My Excel files:

Book1.xlsx file:

Book2.xlsx file:

Reading specific column from excel file using pandas

Answers (1)

Related Questions