Ram
Ram

Reputation: 75

Group by in pandas dataframe and unioning a numpy array column

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following

first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]

When I load the this CSV with pandas data frame and using the following code

data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})

Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like

170.0, [19 23 234 376]
162.0, [1 2 3 4]

How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.

group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))

Upvotes: 2

Views: 2838

Answers (1)

TimCera
TimCera

Reputation: 592

With your current csv file the 'third' column comes in as a string, instead of a list.

There might be nicer ways to convert to a list, but here goes...

from ast import literal_eval

data = pd.read_csv('test_groupby.csv')

# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')

# Convert string to list...
data['third'] = data['third'].apply(literal_eval)

group_data=data.groupby('first')

# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
#     (np.array should work here, but...?)
ans = group_data.aggregate(
      {'third': lambda x: list(np.unique(
                               np.concatenate(x.values)))})

print(ans)
                    third
first                    
162          [1, 2, 3, 4]
170    [19, 23, 234, 376]

Upvotes: 4

Related Questions