Reputation: 91
I have a DataFrame with species and their occurence. An excerpt follows:
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Family | Genus | Species | Species code | Nearctic Mexico | Mesoamerica | Caribbean | Los Llanos | Amazonia |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra castaneifolia Britton ex Rusby | Aphecast | | | | | |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra pilosa Leonard | Aphpilo | | + | | + | + |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra pulcherrima (Jacq.) Kunth | Aphepulc | | + | + | + | + |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra sinclairiana Nees | Aphesinc | | + | | | |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
I need to analyze genera only, but preserve their occurences as follows. I'm sorry if that's a real rookie question, but I'm having trouble putting it in words what I intend to do. This is the expected result:
+-------------+------------+-----------------+-------------+-----------+------------+----------+
| Family | Genus | Nearctic Mexico | Mesoamerica | Caribbean | Los Llanos | Amazonia |
+-------------+------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | | + | + | + | + |
+-------------+------------+-----------------+-------------+-----------+------------+----------+
I have found similar problems, but dealing with numeric data and I couldn't manage to adapt the solution.
Upvotes: 1
Views: 451
Reputation: 862441
Use GroupBy.first
for first non misisng values for 5th+ columns:
#if necessary replace empty space to missing values
df = df.replace('',np.nan)
cols = df.columns[4:].tolist()
df = df.groupby(['Family','Genus'], as_index=False)[cols].first()
print (df)
Family Genus Nearctic Mexico Mesoamerica Caribbean Los Llanos \
0 Acanthaceae Aphelandra NaN + + +
Amazonia
0 +
Upvotes: 1