Reputation: 2296
I have a data frame similar to the following:
virus_RNA specie id date
XXYGHS, ZZYRSC, Mk4RRE human 1 04.08.2010
XXYGHS, KRSTYC, ZZQERT human 2 02.06.2007
Mk4RRE, TTYCY3, WEQ478 bat 3 03.04.2002
Mk4RRE, XXYGHS, ZZQ478, 23RTYB rat 4 01.01.2001
VYsr67, XXYGHS, ZZQ478, 23RTYB rat 5 01.01.2001
XXYRTC, RTyy7u, MZrgTY human 6 01.03.2004
Mk4RRE, SfjB23, ZrtY6V, XXYGHS dog 7 01.12.1993
XXRSHS, KFK22C, ZZYRSC human 8 02.06.2003
I want to achieve two things: First: I want to count how many times all of the RNA sequences of viruses have been observed in species till the given date and create a new dataframe. For instance if we take the first row as the benchmark:
virus_RNA date human bat rat dog
XXYGHS 04.08.2010 1 0 2 1
ZZYRSC 04.08.2010 1 0 0 0
Mk4RRE 04.08.2010 0 1 1 1
Second: I need to get back to the original dataframe and add the sum of all previous occurrences of each virus RNA in species the same as the focal specie and different than the focal specie. Following the previous example:
Occurrence in human specie (as the focal specie) : XXYGHS and ZZYRSC have been observed in human while Mk4RRE has not been observed (1 + 1 + 0 = 2).
Occurrence in different species (different than the human) : bat (0 + 0 + 1 = 1); rat (2 + 0 + 1 = 3); dog (1 + 0 + 1= 2) and sum = 6
If we add it back to the original dataframe:
virus_RNA specie date same_specie different_specie id
XXYGHS, ZZYRSC, Mk4RRE human 04.08.2010 2 6 1
Upvotes: 0
Views: 71
Reputation: 81
[Edited to answer your question better, I hope I didn't introduce any bugs !]
I think I may be able to help. Let's first generate the DataFrame:
df = pd.DataFrame({"virus_RNA":
[
"XXYGHS, ZZYRSC, Mk4RRE" ,
"XXYGHS, KRSTYC, ZZQERT",
"Mk4RRE, TTYCY3, WEQ478",
"Mk4RRE, XXYGHS, ZZQ478, 23RTYB",
"VYsr67, XXYGHS, ZZQ478, 23RTYB ",
"XXYRTC, RTyy7u, MZrgTY",
"Mk4RRE, SfjB23, ZrtY6V, XXYGHS",
"XXRSHS, KFK22C, ZZYRSC",
],
"specie":
[
"human",
"human",
"bat",
"rat",
"rat",
"human",
"dog",
"human",
],
"date":
[
"04.08.2010",
"02.06.2007",
"03.04.2002",
"01.01.2001",
"01.01.2001",
"01.03.2004",
"01.12.1993",
"02.06.2003",
]
})
The first thing I would do is create one row per virus RNA, as my answer will be based on group by:
# Start by creating a list from the strings after having removed the spaces
df.loc[:, "virus_RNA"] = df.loc[:, "virus_RNA"].str.replace(" ", "").str.split(",")
# Explode: one row per element of the list (I changed the previous code to solve the memory error here please confirm this works)
df = df.explode("virus_RNA")
You then get:
>>> df.head()
date specie virus_RNA
0 04.08.2010 human XXYGHS
1 04.08.2010 human ZZYRSC
2 04.08.2010 human Mk4RRE
3 02.06.2007 human XXYGHS
4 02.06.2007 human KRSTYC
Then given you need later the focal I'd add a column with the focal's specie name, and then use get_dummies from pandas to get a one-hot encoded version of the specie variable. This way we can do a group by sum.
df["focal"] = df.loc[:, "specie"]
df = pd.get_dummies(df, columns=["specie"], prefix="", prefix_sep="")
This gives us:
>>> df.head()
date virus_RNA focal bat dog human rat
0 04.08.2010 XXYGHS human 0 0 1 0
1 04.08.2010 ZZYRSC human 0 0 1 0
2 04.08.2010 Mk4RRE human 0 0 1 0
3 02.06.2007 XXYGHS human 0 0 1 0
4 02.06.2007 KRSTYC human 0 0 1 0
Now we can do several group by. For the first question, you could group by date and sum, this will give you the sum of the dummies created before:
df.groupby(["virus_RNA", "date"]).sum()
With this output:
>>> df.groupby(["virus_RNA", "date"]).sum().head()
bat dog human rat
virus_RNA date
23RTYB 01.01.2001 0 0 0 2
KFK22C 02.06.2003 0 0 1 0
KRSTYC 02.06.2007 0 0 1 0
MZrgTY 01.03.2004 0 0 1 0
Mk4RRE 01.01.2001 0 0 0 1
And for the second question, I would calculate the cases per specie for each virus RNA. For this you can get the sum with a group by and put it back into the first dataframe:
df_grouped = df.groupby(["virus_RNA"]).sum().reset_index()
df = df.drop(columns=["bat", "dog", "human", "rat"]).merge(df_grouped, on="virus_RNA").drop_duplicates()
Giving the result:
>>> df.head()
date virus_RNA focal bat dog human rat
0 04.08.2010 XXYGHS human 0 1 2 2
1 02.06.2007 XXYGHS human 0 1 2 2
2 01.01.2001 XXYGHS rat 0 1 2 2
4 01.12.1993 XXYGHS dog 0 1 2 2
5 04.08.2010 ZZYRSC human 0 0 2 0
We could have less data by removing the date and dropping duplicates, but we need the date to re-create the initial dataframe.
We can now create column for "different_specie" and one for "same_specie" using apply on the dataframe:
df["different_specie"] = df.apply(lambda row: sum([val for key, val in row.items()
if key not in ["virus_RNA", "date", "focal"]
and row["focal"] != key]),
axis=1)
df["same_specie"] = df.apply(lambda row: row[row["focal"]],
axis=1)
df = df.loc[:, ["virus_RNA", "focal", "different_specie", "same_specie", "date"]]
This gives us:
>>> df.head()
virus_RNA focal different_specie same_specie date
0 XXYGHS human 3 2 04.08.2010
1 XXYGHS human 3 2 02.06.2007
2 XXYGHS rat 3 2 01.01.2001
4 XXYGHS dog 4 1 01.12.1993
5 ZZYRSC human 0 2 04.08.2010
Then we need to re-create the initial dataframe. We can group by date here as it seems your dates are unique, but then we need to concatenate the virus_RNA which will give us a temporary dataframe that lacks a few columns. We can then merge it back with the first dataframe and drop the duplicates.
temp = df.groupby(["date"])["virus_RNA"].apply(','.join).reset_index()
df = df.merge(temp, on="date").drop("virus_RNA_x", axis=1).drop_duplicates()
Which gives us:
>>> df.head()
focal different_specie same_specie date virus_RNA_y
0 human 3 2 04.08.2010 XXYGHS,ZZYRSC,Mk4RRE
1 human 0 2 04.08.2010 XXYGHS,ZZYRSC,Mk4RRE
2 human 3 1 04.08.2010 XXYGHS,ZZYRSC,Mk4RRE
3 human 3 2 02.06.2007 XXYGHS,KRSTYC,ZZQERT
4 human 0 1 02.06.2007 XXYGHS,KRSTYC,ZZQERT
This may however have different orders in the virus_RNA, which you can overcome by sorting and creating an id at the start. You also need to clean some things like the virus_RNA_y name.
To get a cumulative sum you can order by date and group by focal and virus rna:
df[["cumsum_different", "cumsum_same"]] = df.groupby(["virus_RNA_y", "focal"]).cumsum()
I think you didn't actually want the dataframe re-merged given the result?
I hope this helps. Also, this is my first answer here, I hope the formatting and other requirements of this website are ok in my answer.
Upvotes: 2