bobby
bobby

Reputation: 78

Pandas group by values in list (in series)

I am trying to group by items in a list in DataFrame Series. The dataset being used is the Stack Overflow 2020 Survey.

The layout is roughly as follows:

           ... LanguageWorkedWith ... ConvertedComp ...
Respondent
    1               Python;C              50000
    2                C++;C                70000

I want to essentially want to use groupby on the unique values in the list of languages worked with, and apply the a mean aggregator function to the ConvertedComp like so...

LanguageWorkedWith
        C++                 70000
        C                   60000
      Python                50000

I have actually managed to achieve the desired output but my solution seems somewhat janky and being new to Pandas, I believe that there is probably a better way.

My solution is as follows:

# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')

# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')

# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)

# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)

# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')

# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]

# Group by LLW and apply median 
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()

Output:

LWW
Perl     76131.5
Scala    75669.0
Go       74034.0
Rust     74000.0
Ruby     71093.0
Name: Salary, dtype: float64

which matches the value calculated when choosing specific language: sos.loc[sos["LanguageWorkedWith"].str.contains('Perl').fillna(False), "ConvertedComp"].median()

Any tips on how to improve/functions that provide this functionality/etc would be appreciated!

Upvotes: 2

Views: 120

Answers (1)

r-beginners
r-beginners

Reputation: 35135

In the target column only data frame, decompose the language name and combine it with the salary. The next step is to convert the data from horizontal format to vertical format using melt. Then we group the language names together to get the median. melt docs

lww = sos[["LanguageWorkedWith","ConvertedComp"]]
lwws = pd.concat([lww['ConvertedComp'], lww['LanguageWorkedWith'].str.split(';', expand=True)], axis=1)
lwws.reset_index(drop=True, inplace=True)
df_long = pd.melt(lwws, id_vars='ConvertedComp', value_vars=lwws.columns[1:], var_name='lang', value_name='lang_name')
df_long.groupby('lang_name')['ConvertedComp'].median().sort_values(ascending=False).head()

lang_name
Perl     76131.5
Scala    75669.0
Go       74034.0
Rust     74000.0
Ruby     71093.0
Name: ConvertedComp, dtype: float64

Upvotes: 1

Related Questions