AntiPawn79
AntiPawn79

Reputation: 384

Working with comparing dataframes and series and generating new dataframes on the fly in python pandas

I am creating a function that compares a dataframe (DF) to a series (S) and eventually returns a new dataframe. The common column is 'name'. I want the function to return a dataframe with the same number of rows as the series (S) and the same number of columns as the df. The function will search name columns in the df and find all of the matching names in the series (S). If a match is found I want a new row of a new dataframe to be created that matches the df row for that specific name. If a match is not found I want a new row to be created for the result dataframe regardless but to include all 0.0 for the cells for that particular row. I've been trying to figure this out for the past 6 hours. I'm having issues with broadcasting I believe. Here is what I have tried.

Here is some sample data

Series:

  S[500:505]
  500                 Nanotechnology
  501                          Music
  502       Logistics & Supply Chain
  503    Computer & Network Security
  504              Computer Software
  Name: name, dtype: object

DataFrame: NOTE: there is a column called name which is also industries. So row =0 here is Defense & Space in the name column.

          Defense & Space  Computer Software  Internet  Semiconductors  \
  0              1.0                0.0       0.0             0.0   
  1              0.0                1.0       0.5             0.5   
  2              0.0                0.5       1.0             0.5   
  3              0.0                0.5       0.5             1.0   
  4              0.5                0.0       0.0             0.0   


S.shape = (31454,)
df.shape = (100,101)

Generate an empty dataframe with all zeros

all_zeros = np.zeros((len(S),len(df.columns)))

Put the numpy array into a dataframe

result = pd.DataFrame(data = all_zeros, columns=df.columns, index = range(len(s)))

I don't want the name column to be in the final result

result = result.drop('name', axis=1)

Build a function to be used in a lambda to set the new values for the result dataframe

def set_cell_values(row):
    return df.iloc[1,:]

Here is the part where I set the new values for the new dataframe

for index in range(len(df)):
    names_are_equal = df['name'][index] == result['name']
    map(lambda x: set_cell_values(row), result[names_are_equal]))

To me this makes sense but it seems not to be working. Is there an easy way to make this work that I am unaware of? The map is there because I needed to broadcast the df row into the new dataframe at several rows (not just once).

Upvotes: 0

Views: 218

Answers (1)

Adriano Almeida
Adriano Almeida

Reputation: 5356

Don,
So, let's go:

# with this tables 
In [66]: S
Out[66]:
0    aaa
1    bbb
2    ccc
3    ddd
4    eee
Name: name, dtype: object

In [84]: df
Out[84]:
    a   b   c name
0  39  71  55  aaa
1   9  57   6  bbb
2  72  22  52  iii
3  68  97  81  jjj
4  30  64  78  kkk

# transform the series to a dataframe
Sd = pd.DataFrame(S)
# merge them with outer join (will keep both tables columns and values).
# fill the NAs with 0
In [86]: pd.merge(Sd,df, how='outer').fillna(0)
Out[86]:
  name   a   b   c
0  aaa  39  71  55
1  bbb   9  57   6
2  ccc   0   0   0
3  ddd   0   0   0
4  eee   0   0   0
5  iii  72  22  52
6  jjj  68  97  81
7  kkk  30  64  78

is that it?

Upvotes: 1

Related Questions