Ruchir
Ruchir

Reputation: 13

Column selection in Python

I am trying to find solution to below given problem but seems like I am going wrong with the approach

I have a set of Excel with some columns like ISBN, Title etc. The columns names in Excel are not properly formatted. ISBN is named as ISBN in some of the Excel files while it is named as ISBN-13, Alias, ISBN13 etc. in others. Similarly for Title and other columns.

I have read all these Excels as data frame in python using read Excel and used str.contains to find the columns based on substring. Please find code below:

searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 
13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 
DIGITS)','EAN','ALIAS','ITEMCODE']


searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE 
NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE 
18','COMPLETETITLE']

for f, i in zip(files_txt1, num1): 
df = pd.read_excel(f,encoding='sys.getfilesystemencoding()') 
df.columns = df.columns.str.upper() 
df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
df1['Title']= 
df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]

The code works fine if I have excel with text present in list. However throws error in case excel does not have any columns with name similar to list. Also code does not work for ISBN.

Please see detailed error below:

--------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
    376                 result = np.empty(len(values), dtype=object)
--> 377                 result[:] = values
    378             except ValueError:

ValueError: could not broadcast input array from shape (31807,0) into shape (31807)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value)    2375             try:
-> 2376                 value = Series(value)    2377             except:

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    247                 data = _sanitize_array(data, index, dtype, copy,
--> 248                                        raise_cast_failure=True)
    249 

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)    3028         else:
-> 3029             subarr = _asarray_tuplesafe(data, dtype=dtype)    3030 

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
    379                 # we have a list-of-list
--> 380                 result[:] = [tuple(x) for x in values]
    381 

ValueError: cannot copy sequence with size 0 to array axis with dimension 31807

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last) <ipython-input-23-9e043c13fef2> in <module>()
     11     df.columns = df.columns.str.upper()
     12     #print(list(df.columns))
---> 13     df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
     14     df1['Title'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]
     15     df1['Curr'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor2))]]

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)    2329         else:    2330        
# set column
-> 2331             self._set_item(key, value)    2332     2333     def _setitem_slice(self, key, value):

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)    2394         """    2395 
-> 2396         self._ensure_valid_index(value)    2397         value = self._sanitize_column(key, value)    2398         NDFrame._set_item(self, key, value)

C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value)    2376                 value = Series(value)    2377             except:
-> 2378                 raise ValueError('Cannot set a frame with no defined index '    2379                                  'and a value that cannot be converted to a '    2380                                'Series')

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

Upvotes: 1

Views: 145

Answers (2)

Nitin Sharma
Nitin Sharma

Reputation: 140

This would work as long as you have no match or exactly 1 match

searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']
searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE 18','COMPLETETITLE']

for f, i in zip(files_txt1, num1): 
    df = pd.read_excel(f,encoding='sys.getfilesystemencoding()') 
    df.columns = df.columns.str.upper()

    cols = df.columns

    is_isbn = cols.isin(searchfor)
    df1['Isbn'] = df[cols[is_isbn]] if is_isbn.any() else None

    is_title = cols.isin(searchfor1)
    df1['Title'] = df[cols[is_title]] if is_title.any() else None

Upvotes: 0

Karn Kumar
Karn Kumar

Reputation: 8816

You don't need this all, if you know your columns beforehand, just try at teh time of creating dataFrame and exporting the File into Pandas itself that way you will reduce the memory usage significantly as well.

df = pd.read_csv(file_name, usecols=['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']).fillna('')

Upvotes: 1

Related Questions