Brian Feeny
Brian Feeny

Reputation: 451

How to get function output to add columns to my Dataframe

I have a function that produces an output like so when I pass it a name:

W2V('aamir')

array([ 0.12135 , -0.99132 ,  0.32347 ,  0.31334 ,  0.97446 , -0.67629 ,
        0.88606 , -0.11043 ,  0.79434 ,  1.4788  ,  0.53169 ,  0.95331 ,
       -1.1883  ,  0.82438 , -0.027177,  0.70081 ,  0.87467 , -0.095825,
       -0.5937  ,  1.4262  ,  0.2187  ,  1.1763  ,  1.6294  ,  0.91717 ,
       -0.086697,  0.16529 ,  0.19095 , -0.39362 , -0.40367 ,  0.83966 ,
       -0.25251 ,  0.46286 ,  0.82748 ,  0.93061 ,  1.136   ,  0.85616 ,
        0.34705 ,  0.65946 , -0.7143  ,  0.26379 ,  0.64717 ,  1.5633  ,
       -0.81238 , -0.44516 , -0.2979  ,  0.52601 , -0.41725 ,  0.086686,
        0.68263 , -0.15688 ], dtype=float32)

I have a data frame that has an index Name and a single column Y:

df1

    Y
Name    
aamir   0
aaron   0
... ...
zulema  1
zuzana  1

I wish to run my function on each value of Name and have it create columns like so:

    0   1   2   3   4   5   6   7   8   9   ... 40  41  42  43  44  45  46  47  48  49
Name                                                                                    
aamir   0.12135 -0.99132    0.32347 0.31334 0.97446 -0.67629    0.88606 -0.11043    0.794340    1.47880 ... 0.647170    1.56330 -0.81238    -0.445160   -0.29790    0.52601 -0.41725    0.086686    0.68263 -0.15688
aaron   -1.01850    0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250    -0.17397    -0.061715   0.55292 ... -0.144960   0.82696 -0.51106    -0.072066   0.43069 0.32686 -0.00886    -0.850310   -1.31530    0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema  0.56547 0.30961 0.48725 1.41000 -0.76790    0.39908 0.86915 0.68361 -0.019467   0.55199 ... 0.062091    0.62614 0.44548 -0.193820   -0.80556    -0.73575    -0.30031    -1.278900   0.24759 -0.55541
zuzana  -1.49480    -0.15111    -0.21853    0.77911 0.44446 0.95019 0.40513 0.26643 0.075182    -1.34340    ... 1.102800    0.51495 1.06230 -1.587600   -0.44667    1.04600 -0.38978    0.741240    0.39457 0.22857

What I have done is real messy, but works:

names = df1.index.to_list()

Lst = []
for name in names:
    Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)

df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)

I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.

Update:

I modified my function to do like so:

if isinstance(w, pd.core.series.Series):
        w = w.to_string()

Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:

df1

    Name    Y
0   aamir   0
1   aaron   0
... ... ...
7942    zulema  1
7943    zuzana  1

df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')

    0   1   2   3   4   5   6   7   8   9   ... 40  41  42  43  44  45  46  47  48  49
0   0.075014    0.824769    0.580976    0.493415    0.409894    0.142214    0.202602    -0.599501   -0.213184   -0.142188   ... 0.627784    0.136511    -0.162938   0.095707    -0.257638   0.396822    0.208624    -0.454204   0.153140    0.803400
1   0.073664    0.868665    0.574581    0.538951    0.394502    0.134773    0.233070    -0.639365   -0.194892   -0.110557   ... 0.722513    0.147112    -0.239356   -0.046832   -0.237434   0.321494    0.206583    -0.454038   0.251605    0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942    -0.002117   0.894570    0.834724    0.602266    0.327858    -0.003092   0.197389    -0.675813   -0.311369   -0.174356   ... 0.690172    -0.085517   -0.000235   -0.214937   -0.290900   0.361734    0.290184    -0.497177   0.285071    0.711388
7943    -0.047621   0.850352    0.729225    0.515870    0.439999    0.060711    0.226026    -0.604846   -0.344891   -0.128396   ... 0.557035    -0.048322   -0.070075   -0.265775   -0.330709   0.281492    0.304157    -0.552191   0.281502    0.750304
7944 rows × 50 columns

You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014

EDIT:

It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?

Upvotes: 0

Views: 445

Answers (5)

Pierre D
Pierre D

Reputation: 26251

I would do simply:

newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)

Example

To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:

import numpy as np
import pandas as pd

from contextlib import contextmanager


@contextmanager
def temp_seed(seed):
    state = np.random.get_state()
    np.random.seed(seed)
    try:
        yield
    finally:
        np.random.set_state(state)
        

mask = (1 << 32) - 1

def w2v(name, size=50):
    fingerprint = int(pd.util.hash_array(np.array([name])))
    with temp_seed(fingerprint & mask):
        return np.random.uniform(-1, 1, size)

For instance:

>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
        0.31315156,  0.22802972, -0.96076167,  0.62577993, -0.59024811,
        0.76365736,  0.93033898, -0.56155296,  0.4760905 , -0.92760642,
        0.00177959, -0.22761559,  0.81929959,  0.21138229, -0.49882747,
       -0.97637984, -0.19452496, -0.91354933,  0.70473533, -0.30394358,
       -0.47092087, -0.0329302 , -0.93178517,  0.79118799,  0.98286834,
       -0.16024194, -0.02793147, -0.52251214, -0.70732759,  0.10098142,
       -0.24880249,  0.28930319, -0.53444863,  0.37887522,  0.58544068,
        0.85804119,  0.67048213,  0.58389158, -0.19889071, -0.04281131,
       -0.62506126,  0.42872395, -0.12821543, -0.52458052, -0.35493892])

Now, we use the expression given as solution:

df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)

>>> newdf
              0         1         2         3         4         5         6    ...
aamir   0.654469 -0.927651 -0.781886 -0.626838 -0.239468  0.313152  0.228030   ...
aaron  -0.380524 -0.850608 -0.914642 -0.578885  0.177975 -0.633761 -0.736234   ...
zulema -0.250957  0.882491 -0.197833 -0.707652  0.754575  0.731236 -0.770831   ... 
zuzana -0.641296  0.065898  0.466784  0.652776  0.391865  0.918761  0.022798   ...

Upvotes: 0

azlefty
azlefty

Reputation: 11

I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.

This first section generates toy versions of your W2V function and your dataframe.

# substitute your W2V function for this:
n = 5
def W2V(name: str):
    return [random() for i in range(n)]

# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
                         'Y': [0, 0, 1, 1]},
                   index=list(range(4)))

df1 is

     Name  Y
0   aamir  0
1   aaron  0
2  zulema  1
3  zuzana  1

You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.

df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
                                    index=[i for i in range(n)]),
                axis=1)

My random-valued df2 is

          0         1         2         3         4
0  0.242761  0.415253  0.940213  0.074455  0.444372
1  0.935781  0.968155  0.850091  0.064548  0.737655
2  0.204053  0.845252  0.967767  0.352254  0.028609
3  0.853164  0.698195  0.292238  0.982009  0.402736

Then concatenate the new dataframe to the old one:

df3 = pd.concat([df1, df2], axis=1)

df3 is

     Name  Y         0         1         2         3         4
0   aamir  0  0.242761  0.415253  0.940213  0.074455  0.444372
1   aaron  0  0.935781  0.968155  0.850091  0.064548  0.737655
2  zulema  1  0.204053  0.845252  0.967767  0.352254  0.028609
3  zuzana  1  0.853164  0.698195  0.292238  0.982009  0.402736

Alternatively, you could do both steps in one line as:

df1 = pd.concat([df1, 
                 df1.apply(lambda x: pd.Series(W2V(x['Name']), 
                                               index=[i for i in range(n)]), 
                           axis=1)], 
                axis=1)

Upvotes: 1

Vitalizzare
Vitalizzare

Reputation: 7240

Let's say we have some function which transforms a string into a vector of a fixed size, for example:

import numpy as np

def W2V(name: str) -> np.ndarray:
    low, high, size = 0, 5, 10
    rng = np.random.default_rng(abs(hash(name)))
    return rng.integers(low, high, size, endpoint=True)

Also a data frame is given with a meaningful index and junk data:

import pandas as pd

names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)

When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:

output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')

With result_type='expand', returned vectors will be transformed into columns, which is the required output.


P.S. As an option:

df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')

P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:

df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')

the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153500

You can try something like this using map and np.vstack with a dataframe constructor then join:

df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))

Output:

   Y  0  1  2  3  4  5  6  7  8  9
A  0  4  0  2  1  0  0  0  0  3  3
B  1  4  0  0  4  4  3  4  3  4  3
C  2  1  5  5  5  3  3  1  3  5  0
D  3  3  5  1  3  4  2  3  1  0  1
E  4  4  0  2  4  4  0  3  3  4  2
F  5  4  3  5  1  0  2  3  2  5  2
G  6  4  5  2  0  0  2  4  3  4  3
H  7  0  2  5  2  3  4  3  5  3  1
I  8  2  2  0  1  4  2  4  1  0  4
J  9  0  2  3  5  0  3  0  2  4  0

Using @Vitalizzare function:

def W2V(name: str) -> np.ndarray:
    low, high, size = 0, 5, 10
    rng = np.random.default_rng(abs(hash(name)))
    return rng.integers(low, high, size, endpoint=True)

df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])

Upvotes: 0

NFeruch - FreePalestine
NFeruch - FreePalestine

Reputation: 1154

I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names

df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')

Upvotes: 0

Related Questions