Reputation: 35
I am using a defined function to query a REST API which returns multiple rows per request from a pandas dataframe using apply. It looks like what I have so far is running the function correctly but because the for loop returns more than one row I only get the last one.
What I want to do is return multiple rows for each row I pass to the function.
This is my function I'm using:
def get_entity_rec(row):
try:
documents = row.content
textcon = row.content[0:2000]
doclang = [textcon]
outputs = []
result = client.recognize_entities(documents = doclang)[0]
entitylength = len(result)
for entity in result.entities:
row['text'] = entity.text
row['category'] = entity.category
row['subcategory'] = entity.subcategory
return row
except Exception as err:
print("Encountered exception. {}".format(err))
And my code where I apply it:
apandas3 = apandas2.apply(get_entity_rec, axis=1)
I get (what i think is) the last result like this:
path | text | category | subcategory |
---|---|---|---|
path of file | i am text | i am the category returned | i am the subcategory returned |
I want to return a dataframe with the original columns repeated with each "entity" returned by the function lke this:
path | text | category | subcategory |
---|---|---|---|
path of file | i am text | i am the category returned | i am the subcategory returned |
path of file | i am text | i am the first category returned | i am the first subcategory returned |
path of file | i am text | i am the second category returned | i am the second subcategory returned |
Upvotes: 2
Views: 1114
Reputation: 8219
apply
when applied rows can only use a function that returns one row
To achieve what you want, you can stick your categories/subcategories into a list
that is stored in a row, and then explode
. Let me demonstrate. Since your code is not self-contained to run (please review this before your next post!), here is an example that hopefully explains the idea
# create an example df
df = pd.DataFrame({'path':['A','B','C'], 'text' : ['cat1A subcat1A cat2A subcat2A','cat1B subcat1B cat2B subcat2B', 'cat1C subcat1C cat2C subcat2C']})
# define our processing function
def get_entity_rec(row):
text = row['text']
tokens = text.split() # simulated processing
categories = [tokens[0], tokens[2]] # note how we stick them in a list
subcategories = [tokens[1], tokens[3]] # note how we stick them in a list
row['cat'] = categories
row['subcat'] = subcategories
return row
Here we create a simple df and a processing function that needs to return multiple rows per each return row. But since that is not allowed for the apply
function, it returns one row where the multiple values are stored as lists
when we apply this function to the df
df.apply(get_entity_rec, axis=1)
we obtain this
path text cat subcat
0 A cat1A subcat1A cat2A subcat2A [cat1A, cat2A] [subcat1A, subcat2A]
1 B cat1B subcat1B cat2B subcat2B [cat1B, cat2B] [subcat1B, subcat2B]
2 C cat1C subcat1C cat2C subcat2C [cat1C, cat2C] [subcat1C, subcat2C]
note how categories and subcategories are in lists inside the df
Now we can explode
our columns -- since we want to explode cat
and subcat
in parallel, here is how we do it:
df.apply(get_entity_rec, axis=1).apply(pd.Series.explode)
to obtain
path text cat subcat
0 A cat1A subcat1A cat2A subcat2A cat1A subcat1A
0 A cat1A subcat1A cat2A subcat2A cat2A subcat2A
1 B cat1B subcat1B cat2B subcat2B cat1B subcat1B
1 B cat1B subcat1B cat2B subcat2B cat2B subcat2B
2 C cat1C subcat1C cat2C subcat2C cat1C subcat1C
2 C cat1C subcat1C cat2C subcat2C cat2C subcat2C
Upvotes: 3