Reputation: 693
I have a dataframe like this:
col1 col2
[abc, bcd, dog] [[.4], [.5], [.9]]
[cat, bcd, def] [[.9], [.5], [.4]]
the numbers in the col2
lists describe the element (based on list index location) in col1
. So ".4" in col2
describes "abc" in col1
.
I want to create 2 new columns, one that pulls only the elements in col1
that are >= .9 in col2
, and the other column as the number in col2
; so ".9" for both rows.
Result:
col3 col4
[dog] .9
[cat] .9
I think going a route where removing the nested list from col2
is fine. But that's harder than it sounds. I've been trying for an hour to remove those fing brackets.
Attempts:
spec_chars3 = ["[","]"]
for char in spec_chars3: # didn't work, turned everything to nan
df1['avg_jaro_company_word_scores'] = df1['avg_jaro_company_word_scores'].str.replace(char, '')
df.col2.str.strip('[]') #didn't work b/c the nested list is still in a list, not a string
I haven't even figured out how to pull out the list index number and filter col1 on that
Upvotes: 0
Views: 684
Reputation: 62503
str
type, and need to be converted to list
type
.applymap
with ast.literal_eval
.str
type, then use df[col] = df[col].apply(literal_eval)
pandas.DataFrame.explode
explode
casts values from lists to scalars (i.e. [0.4]
to 0.4
).df
with df_new
, use df.join(df_new, rsuffix='_extracted')
python 3.10
, pandas 1.4.3
import pandas as pd
from ast import literal_eval
# setup the test data: this data is lists
# data = {'c1': [['abc', 'bcd', 'dog'], ['cat', 'bcd', 'def']], 'c2': [[[.4], [.5], [.9]], [[.9], [.5], [.4]]]}
# setup the test data: this data is strings
data = {'c1': ["['abc', 'bcd', 'dog', 'cat']", "['cat', 'bcd', 'def']"], 'c2': ["[[.4], [.5], [.9], [1.0]]", "[[.9], [.5], [.4]]"]}
# create the dataframe
df = pd.DataFrame(data)
# the description leads me to think the data is columns of strings, not lists
# convert the columns from string type to list type
# the following line is only required if the columns are strings
df = df.applymap(literal_eval)
# explode the lists in each column, and the explode the remaining lists in 'c2'
df_new = df.explode(['c1', 'c2'], ignore_index=True).explode('c2')
# use Boolean Indexing to select the desired data
df_new = df_new[df_new['c2'] >= 0.9]
# display(df_new)
c1 c2
2 dog 0.9
3 cat 1.0
4 cat 0.9
Upvotes: 2
Reputation: 5183
You can use list comprehensions to populate new columns with your criteria.
df['col3'] = [
[value for value, score in zip(c1, c2) if score[0] >= 0.9]
for c1, c2 in zip(df['col1'], df['col2'])
]
df['col4'] = [
[score[0] for score in c2 if score[0] >= 0.9]
for c2 in df['col2']
Output
col1 col2 col3 col4
0 [abc, bcd, dog] [[0.4], [0.5], [0.9]] [dog] [0.9]
1 [cat, bcd, def] [[0.9], [0.5], [0.4]] [cat] [0.9]
Upvotes: 1