user16799
user16799

Reputation: 13

How to extract specific items from a data frame column and use them as labels for the remaining items?

I have a data frame with only one column which I would like to extract certain items and turn them into a separate column to be used as a label for the other remaining items. It's a bit hard to explain, if for example what I have is:

pd.DataFrame({'Fruits': ['Apple', 'Gala', 'Fuji', 'Grannysmith', 'Honeycrisp', 'Golden', 'pink', 'Orange', 'blood orange', 'Mandrin', 'Tangerine', 'Clementine', 'Banana', 'baby', 'manzano', 'burro']})

          Fruits
0          Apple
1           Gala
2           Fuji
3    Grannysmith
4     Honeycrisp
5         Golden
6           pink
7         Orange
8   blood orange
9        Mandrin
10     Tangerine
11    Clementine
12        Banana
13          baby
14       manzano
15         burro

But what I'm trying to convert it into is:

    Fruits  Types
0   Apple   Gala
1   Apple   Fuji
2   Apple   Grannysmith
3   Apple   Honeycrisp
4   Apple   Golden
5   Apple   pink
6   Orange  blood orange
7   Orange  Mandrin
8   Orange  Tangerine
9   Orange  Clementine
10  Banana  baby
11  Banana  manzano
12  Banana  burro

How would I transform the 1st data frame into the 2nd? I feel stump by this, especially when there can be many categories of fruit and their respective types.

Upvotes: 1

Views: 99

Answers (2)

piRSquared
piRSquared

Reputation: 294488

I'd build a dictionary mapping with some standard logic then use it with Pandas operations

fruit_classes = ['Apple', 'Orange', 'Banana']

last_class = None
fruit_map = {}

for fruit in df.Fruits:
  if fruit in fruit_classes:
    last_class = fruit
  elif last_class is not None:
    fruit_map[fruit] = last_class

df.assign(Types=df.Fruits, Fruits=df.Fruits.map(fruit_map)).dropna()

    Fruits         Types
1    Apple          Gala
2    Apple          Fuji
3    Apple   Grannysmith
4    Apple    Honeycrisp
5    Apple        Golden
6    Apple          pink
8   Orange  blood orange
9   Orange       Mandrin
10  Orange     Tangerine
11  Orange    Clementine
13  Banana          baby
14  Banana       manzano
15  Banana         burro

Alternatively with a comprehension

fruit_classes = ['Apple', 'Orange', 'Banana']

fruit_classes = ['Apple', 'Orange', 'Banana']

pd.DataFrame(
    [[x, None][::(x in fruit_classes) * 2 - 1] for x in df.Fruits],
    columns=['Fruits', 'Types']
).assign(Fruits=lambda d: d.Fruits.ffill()).dropna()

    Fruits         Types
1    Apple          Gala
2    Apple          Fuji
3    Apple   Grannysmith
4    Apple    Honeycrisp
5    Apple        Golden
6    Apple          pink
8   Orange  blood orange
9   Orange       Mandrin
10  Orange     Tangerine
11  Orange    Clementine
13  Banana          baby
14  Banana       manzano
15  Banana         burro

Upvotes: 1

jezrael
jezrael

Reputation: 863156

First is necessary define fruits in list, then create new column with repeating fruits by where with isin for missing values and forward filling, then remove same values in both columns by boolean indexing and last set new columns names:

L = ['Apple','Orange','Banana']

df['a'] = df['Fruits'].where(df['Fruits'].isin(L)).ffill()
df = df.loc[df['a'] != df['Fruits'], ['a','Fruits']]
df.columns = ['Fruits','Types']
print (df)

    Fruits         Types
1    Apple          Gala
2    Apple          Fuji
3    Apple   Grannysmith
4    Apple    Honeycrisp
5    Apple        Golden
6    Apple          pink
8   Orange  blood orange
9   Orange       Mandrin
10  Orange     Tangerine
11  Orange    Clementine
13  Banana          baby
14  Banana       manzano
15  Banana         burro

Upvotes: 1

Related Questions