Reputation: 13
I am trying to find solution to below given problem but seems like I am going wrong with the approach
I have a set of Excel with some columns like ISBN, Title etc. The columns names in Excel are not properly formatted. ISBN is named as ISBN in some of the Excel files while it is named as ISBN-13, Alias, ISBN13 etc. in others. Similarly for Title and other columns.
I have read all these Excels as data frame in python using read Excel and used str.contains
to find the columns based on substring. Please find code below:
searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN
13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13
DIGITS)','EAN','ALIAS','ITEMCODE']
searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE
NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE
18','COMPLETETITLE']
for f, i in zip(files_txt1, num1):
df = pd.read_excel(f,encoding='sys.getfilesystemencoding()')
df.columns = df.columns.str.upper()
df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
df1['Title']=
df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]
The code works fine if I have excel with text present in list. However throws error in case excel does not have any columns with name similar to list. Also code does not work for ISBN.
Please see detailed error below:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
376 result = np.empty(len(values), dtype=object)
--> 377 result[:] = values
378 except ValueError:
ValueError: could not broadcast input array from shape (31807,0) into shape (31807)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last) C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value) 2375 try:
-> 2376 value = Series(value) 2377 except:
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
247 data = _sanitize_array(data, index, dtype, copy,
--> 248 raise_cast_failure=True)
249
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure) 3028 else:
-> 3029 subarr = _asarray_tuplesafe(data, dtype=dtype) 3030
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
379 # we have a list-of-list
--> 380 result[:] = [tuple(x) for x in values]
381
ValueError: cannot copy sequence with size 0 to array axis with dimension 31807
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last) <ipython-input-23-9e043c13fef2> in <module>()
11 df.columns = df.columns.str.upper()
12 #print(list(df.columns))
---> 13 df1['Isbn'] = df[df.columns[df.columns.str.contains('|'.join(searchfor))]]
14 df1['Title'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor1))]]
15 df1['Curr'] = df[df.columns[df.columns.to_series().str.contains('|'.join(searchfor2))]]
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value) 2329 else: 2330
# set column
-> 2331 self._set_item(key, value) 2332 2333 def _setitem_slice(self, key, value):
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value) 2394 """ 2395
-> 2396 self._ensure_valid_index(value) 2397 value = self._sanitize_column(key, value) 2398 NDFrame._set_item(self, key, value)
C:\Users\Ruchir_Kumar_Jha\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value) 2376 value = Series(value) 2377 except:
-> 2378 raise ValueError('Cannot set a frame with no defined index ' 2379 'and a value that cannot be converted to a ' 2380 'Series')
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
Upvotes: 1
Views: 145
Reputation: 140
This would work as long as you have no match or exactly 1 match
searchfor = ['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']
searchfor1 = ['TITLE','BOOK NAME','NAME','TITLE NAME','TITLES','BOOKNAME','BKDESC','PRODUCT NAME','ITEM DESCRIPTION','TITLE 18','COMPLETETITLE']
for f, i in zip(files_txt1, num1):
df = pd.read_excel(f,encoding='sys.getfilesystemencoding()')
df.columns = df.columns.str.upper()
cols = df.columns
is_isbn = cols.isin(searchfor)
df1['Isbn'] = df[cols[is_isbn]] if is_isbn.any() else None
is_title = cols.isin(searchfor1)
df1['Title'] = df[cols[is_title]] if is_title.any() else None
Upvotes: 0
Reputation: 8816
You don't need this all, if you know your columns beforehand, just try at teh time of creating dataFrame and exporting the File into Pandas itself that way you will reduce the memory usage significantly as well.
df = pd.read_csv(file_name, usecols=['ISBN13','BAR CODE','ISBN NO#','ISBN','ISBN1','ISBN 13','ISBN_13','ITEM','ISBN NUMBER','ISBN No','ISBN-13','ISBN (13 DIGITS)','EAN','ALIAS','ITEMCODE']).fillna('')
Upvotes: 1