Reputation: 3950
Im assembling data from multiple sources... specifically, reactions and reaction formulas
Some sources have both the reaction name and the formula, while other sources have may only have the formula, as an example, see rows 2 and 3 in the example
If I have a DataFrame w the following:
│ Row │ reaction │ formula │
├─────┼──────────┼─────────┤
│ 1 │ "a" │ 1 │
│ 2 │ "b" │ 2 │
│ 3 │ "" │ 2 │
│ 4 │ "c" │ 3 │
As the table suggest, rows 2 and 3 have the same reaction formula, but only row 2 has the reaction name. What I'd like to do, is remove those rows that have a formula, that dont have a name, but already exist someplace else with the same formula but also having the reaction name
i.e remove rows those rows which are duplicates w.r.t column 2 (formula) if, leaving the duplicate row that has the reaction name, that is, reaction name not being empty so as to get
│ Row │ reaction │ formula │
├─────┼──────────┼─────────┤
│ 1 │ "a" │ 1 │
│ 2 │ "b" │ 2 │
│ 3 │ "c" │ 3 │
Upvotes: 1
Views: 125
Reputation: 945
Let's say you have:
df = DataFrame(reaction = ["a", "b", "", "c"], formula = [1, 2, 2, 3]);
What you can do is the following:
# This index allows you to determine whether or not a reaction is missing:
ind = df[:reaction].!="";
# Then, you filter your DataFrame to remove those entries:
df2=df[ind,:];
Edit: You can increase the complexity of the selector, better defining ind, according to your needs.
Upvotes: 1