dakro
dakro

Reputation: 126

Separate values in a pandas series and stack them all into a list?

I have a column of objects in a pandas dataframe. Each row in this column has multiple values as a string. I want to create a new column of a nice list with one code on each row. I know of a few really sloppy ways to do this, but none of them feel right. Here we go:

The Data

  data = [['A01.001 - Apple; R02.049 - Banana; B32:111 - Candy'],
            ['C30.086 - Deer; V83.038 - Ears; U23.133 - Race Car'], 
            ['H14.200 - Silver; B32.111 - Candy']]
    my_series = pd.DataFrame(data, columns=['Column'])

Table

 Column
______
0    A01.001 - Apple; R02.049 - Banana; B32.111 - Candy
1    C30.086 - Deer; V83.038 - Ears; U23.133 - Race Car
2    H14.200 - Silver; B32.111 - Candy

Desired Result (I'm ok with a duplicates, can get rid of those later)

New Column
___________

A01.001 - Apple
R02.049 - Banana
B32.111 - Candy
C30.086 - Deer
V83.038 - Ears
U23.133 - Race Car
H14.200 - Silver
B32.111 - Candy

My Crappy Code

a_list = []
def get_data(data):
    for i in data:
        pattern = r'[^;]+'
        regex = re.findall(pattern, i)
        for match in regex:
            a_list.append(match)
get_data(my_series)
print(a_list)

(I'm not concerned with turning this list back to a Series at the moment, there will be a few destinations for this list. This code returns about 50,000 results, and dropping duplicates reduces it to about 10,000, if that matters to anyone.)

Despite the fact that this seemed to 'work', I know it's crap and here's why I think it's crap. I declared a variable and my function secretly writes to it. My gut tells me this is really stupid, but I've been at this for way longer than it should have taken. I know I'm doing something wrong, but I can't figure out what it is. I haven't had any luck finding an example of this scenario on the internet, so I appreciate anyone who can take a swing at it, maybe smack me in the head. Thank you for your time!

Upvotes: 1

Views: 86

Answers (1)

BENY
BENY

Reputation: 323316

You just need split the columns and re-create the dataframe

df=pd.DataFrame(my_series.Column.str.split(';').sum(),columns=['columns'])
df
               columns
0      A01.001 - Apple
1     R02.049 - Banana
2      B32:111 - Candy
3       C30.086 - Deer
4       V83.038 - Ears
5   U23.133 - Race Car
6     H14.200 - Silver
7      B32.111 - Candy

Upvotes: 4

Related Questions