Reputation: 167
I have a code that runs through each row/item in a series and turns it into a bigram/trigram. The code is the following
def splitting(txt,gram=2):
tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
if(len(tx1)==0):
return np.nan
txlis = [w for w in tx1 if w.lower() not in stop_wrds]
if gram==2:
return map(tuple,set(map(frozenset,list(nltk.bigrams(txlis)))))
else:
return map(tuple,set(map(frozenset,list(nltk.trigrams(txlis)))))
#pdb.set_trace()
print len(namedat)
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
The error comes in the last line when I apply to a series data called namedat
that looks something like this:
0 inter-burgo ansan
1 dogo glory condo
2 w hotel
3 onyang grand hotel
4 onyang hot spring hotel
5 onyang cheil hotel (ex. onyang palace hotel)
6 springhill suites paso robles atascadero
7 best western plus colony inn
8 hesse
9 ibis styles aachen city
10 pullman aachen quellenhof
11 mercure aachen europaplatz
12 leonardo hotel aachen
13 aquis grana cityhotel
14 buschhausen
... ...
[166295 rows x 1 columns]
ValueError: could not broadcast input array from shape (2) into shape (1) when using df.apply
I tried debugging and the txts and bigrams are all generated succesfully, there seems to be no issue with the function called splitting
. I am out of ideas on how to go about solving this. Please help
The complete error message:
Traceback (most recent call last):
File "data_playground.py", line 163, in <module>
main()
File "data_playground.py", line 156, in main
createparams(db.hotelbeds_properties,"hotelbeds")
File "data_playground.py", line 139, in createparams
prop_params = analyze(prop_subdf)
File "data_playground.py", line 110, in analyze
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
result = self._constructor(data=results, index=index)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 330, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 461, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4604, in construction_error
raise e
ValueError: could not broadcast input array from shape (2) into shape (1)
An example of what my code does: It takes a row from the table shown above for ex:
name shaba boutique hotel
Name: 166278, dtype: object
and then returns bigrams made from it
[(u'shaba', u'boutique'), (u'boutique', u'hotel')]
If I do a simple for loop (using iterrows
), the function works and I get a list. I do not understand why the apply function fails.
Upvotes: 0
Views: 3042
Reputation: 550
The reason for this error is that df.apply(axis=1) is expecting a single value back to make a series out of it, you can read more about it here. Your code is returning the result of map(tuple(...)) which has a shape > 1 for any row that has more than two words. You can try this out on a small, fake dataframe and see that it works with it as is below,
namedat_s = pd.Series(['inter-burgo ansan', 'glory condo', 'w hotel'])
namedat = pd.DataFrame(namedat_s)
...but put 'dogo' back in, and you'll get the error again. This is a good example of why single long lines of code are not always useful, especially if you are just starting.
If you would have tried this, you probably would have found the answer sooner:
def splitting(txt,gram=2):
tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
if(len(tx1)==0):
return np.nan
txlis = [w for w in tx1 if w.lower() not in stop_wrds]
print 1, txlis
print 2, find_ngrams(txlis,2)
print 3, list(find_ngrams(txlis,2))
print 4, map(frozenset,list(find_ngrams(txlis,2)))
print 5, set(map(frozenset,list(find_ngrams(txlis,2))))
print 6, map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
print len(map(tuple,set(map(frozenset,list(find_ngrams(txlis,2))))))
if gram==2:
return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
else:
return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
You'd see that the error happens, as you said, not in the splitting function, but in what happens after the return, and knowing what is being returned would give you big clue as to why.
Upvotes: 1