Jay T
Jay T

Reputation: 35

How can I return multiple rows from a python function to a pandas dataframe using apply?

I am using a defined function to query a REST API which returns multiple rows per request from a pandas dataframe using apply. It looks like what I have so far is running the function correctly but because the for loop returns more than one row I only get the last one.

What I want to do is return multiple rows for each row I pass to the function.

This is my function I'm using:

def get_entity_rec(row):
try:
    documents = row.content
    textcon = row.content[0:2000]
    doclang = [textcon]
    outputs = []
    result = client.recognize_entities(documents = doclang)[0]
    entitylength = len(result)
    for entity in result.entities:
            row['text'] = entity.text
            row['category'] = entity.category
            row['subcategory'] = entity.subcategory
    return row
except Exception as err:
    print("Encountered exception. {}".format(err))

And my code where I apply it:

apandas3 = apandas2.apply(get_entity_rec, axis=1)

I get (what i think is) the last result like this:

path text category subcategory
path of file i am text i am the category returned i am the subcategory returned

I want to return a dataframe with the original columns repeated with each "entity" returned by the function lke this:

path text category subcategory
path of file i am text i am the category returned i am the subcategory returned
path of file i am text i am the first category returned i am the first subcategory returned
path of file i am text i am the second category returned i am the second subcategory returned

Upvotes: 2

Views: 1114

Answers (1)

piterbarg
piterbarg

Reputation: 8219

apply when applied rows can only use a function that returns one row

To achieve what you want, you can stick your categories/subcategories into a list that is stored in a row, and then explode. Let me demonstrate. Since your code is not self-contained to run (please review this before your next post!), here is an example that hopefully explains the idea

# create an example df
df = pd.DataFrame({'path':['A','B','C'], 'text' : ['cat1A subcat1A cat2A subcat2A','cat1B subcat1B cat2B subcat2B', 'cat1C subcat1C cat2C subcat2C']})

# define our processing function
def get_entity_rec(row):
    text = row['text']
    tokens = text.split() # simulated processing
    categories = [tokens[0], tokens[2]] # note how we stick them in a list
    subcategories = [tokens[1], tokens[3]] # note how we stick them in a list
    row['cat'] = categories
    row['subcat'] = subcategories
    return row

Here we create a simple df and a processing function that needs to return multiple rows per each return row. But since that is not allowed for the apply function, it returns one row where the multiple values are stored as lists

when we apply this function to the df

df.apply(get_entity_rec, axis=1)

we obtain this

    path    text                            cat             subcat
0   A       cat1A subcat1A cat2A subcat2A   [cat1A, cat2A]  [subcat1A, subcat2A]
1   B       cat1B subcat1B cat2B subcat2B   [cat1B, cat2B]  [subcat1B, subcat2B]
2   C       cat1C subcat1C cat2C subcat2C   [cat1C, cat2C]  [subcat1C, subcat2C]

note how categories and subcategories are in lists inside the df

Now we can explode our columns -- since we want to explode cat and subcat in parallel, here is how we do it:

df.apply(get_entity_rec, axis=1).apply(pd.Series.explode)

to obtain

 path   text                            cat     subcat
0   A   cat1A subcat1A cat2A subcat2A   cat1A   subcat1A
0   A   cat1A subcat1A cat2A subcat2A   cat2A   subcat2A
1   B   cat1B subcat1B cat2B subcat2B   cat1B   subcat1B
1   B   cat1B subcat1B cat2B subcat2B   cat2B   subcat2B
2   C   cat1C subcat1C cat2C subcat2C   cat1C   subcat1C
2   C   cat1C subcat1C cat2C subcat2C   cat2C   subcat2C

Upvotes: 3

Related Questions