ValueError: could not broadcast input array from shape (2) into shape (1) when using df.apply

Question

I have a code that runs through each row/item in a series and turns it into a bigram/trigram. The code is the following

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    if gram==2:
        return map(tuple,set(map(frozenset,list(nltk.bigrams(txlis)))))
    else:
        return map(tuple,set(map(frozenset,list(nltk.trigrams(txlis)))))

#pdb.set_trace()
print len(namedat)
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))

The error comes in the last line when I apply to a series data called namedat that looks something like this:

0                                       inter-burgo ansan
1                                        dogo glory condo
2                                                 w hotel
3                                      onyang grand hotel
4                                 onyang hot spring hotel
5            onyang cheil hotel (ex. onyang palace hotel)
6                springhill suites paso robles atascadero
7                            best western plus colony inn
8                                                  hesse 
9                                 ibis styles aachen city
10                              pullman aachen quellenhof
11                             mercure aachen europaplatz
12                                  leonardo hotel aachen
13                                  aquis grana cityhotel
14                                            buschhausen
...                                                   ...
[166295 rows x 1 columns]

ValueError: could not broadcast input array from shape (2) into shape (1) when using df.apply

I tried debugging and the txts and bigrams are all generated succesfully, there seems to be no issue with the function called splitting. I am out of ideas on how to go about solving this. Please help

The complete error message:

Traceback (most recent call last):
  File "data_playground.py", line 163, in 
    main()
  File "data_playground.py", line 156, in main
    createparams(db.hotelbeds_properties,"hotelbeds")
  File "data_playground.py", line 139, in createparams
    prop_params = analyze(prop_subdf)
  File "data_playground.py", line 110, in analyze
    prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
    ignore_failures=ignore_failures)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
    result = self._constructor(data=results, index=index)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 461, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
    construction_error(len(arrays), arrays[0].shape, axes, e)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4604, in construction_error
    raise e
ValueError: could not broadcast input array from shape (2) into shape (1)

An example of what my code does: It takes a row from the table shown above for ex:

name    shaba boutique hotel
Name: 166278, dtype: object

and then returns bigrams made from it

[(u'shaba', u'boutique'), (u'boutique', u'hotel')]

If I do a simple for loop (using iterrows), the function works and I get a list. I do not understand why the apply function fails.

Jeff Ellen · Accepted Answer

The reason for this error is that df.apply(axis=1) is expecting a single value back to make a series out of it, you can read more about it here. Your code is returning the result of map(tuple(...)) which has a shape > 1 for any row that has more than two words. You can try this out on a small, fake dataframe and see that it works with it as is below,

namedat_s = pd.Series(['inter-burgo ansan', 'glory condo', 'w hotel'])
namedat = pd.DataFrame(namedat_s)

...but put 'dogo' back in, and you'll get the error again. This is a good example of why single long lines of code are not always useful, especially if you are just starting.

If you would have tried this, you probably would have found the answer sooner:

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    print 1, txlis
    print 2, find_ngrams(txlis,2)
    print 3, list(find_ngrams(txlis,2))
    print 4, map(frozenset,list(find_ngrams(txlis,2)))
    print 5, set(map(frozenset,list(find_ngrams(txlis,2))))
    print 6, map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    print len(map(tuple,set(map(frozenset,list(find_ngrams(txlis,2))))))
    if gram==2:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    else:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))

You'd see that the error happens, as you said, not in the splitting function, but in what happens after the return, and knowing what is being returned would give you big clue as to why.

ValueError: could not broadcast input array from shape (2) into shape (1) when using df.apply

Answers (1)

Related Questions