edyvedy13
edyvedy13

Reputation: 2296

Counting number the occurrences of specific string at particular time in Pandas dataframe

I have a data frame similar to the following:

virus_RNA                             specie   id       date
XXYGHS, ZZYRSC, Mk4RRE                human     1      04.08.2010
XXYGHS, KRSTYC, ZZQERT                human     2      02.06.2007
Mk4RRE, TTYCY3, WEQ478                bat       3      03.04.2002
Mk4RRE, XXYGHS, ZZQ478, 23RTYB        rat       4      01.01.2001  
VYsr67, XXYGHS, ZZQ478, 23RTYB        rat       5      01.01.2001  
XXYRTC, RTyy7u, MZrgTY                human     6      01.03.2004
Mk4RRE, SfjB23, ZrtY6V, XXYGHS        dog       7     01.12.1993  
XXRSHS, KFK22C, ZZYRSC                human     8      02.06.2003

I want to achieve two things: First: I want to count how many times all of the RNA sequences of viruses have been observed in species till the given date and create a new dataframe. For instance if we take the first row as the benchmark:

virus_RNA  date           human    bat   rat dog
XXYGHS     04.08.2010       1       0    2    1
ZZYRSC     04.08.2010       1       0    0    0
Mk4RRE     04.08.2010       0       1    1    1

Second: I need to get back to the original dataframe and add the sum of all previous occurrences of each virus RNA in species the same as the focal specie and different than the focal specie. Following the previous example:

Occurrence in human specie (as the focal specie) : XXYGHS and ZZYRSC have been observed in human while Mk4RRE has not been observed (1 + 1 + 0 = 2).  

Occurrence in different species (different than the human) : bat (0 + 0 + 1 = 1); rat (2 + 0 + 1 = 3); dog (1 + 0 + 1= 2) and sum = 6

If we add it back to the original dataframe:

virus_RNA              specie    date       same_specie different_specie id
XXYGHS, ZZYRSC, Mk4RRE human     04.08.2010         2              6      1

Upvotes: 0

Views: 71

Answers (1)

Gabriel Trégoat
Gabriel Trégoat

Reputation: 81

[Edited to answer your question better, I hope I didn't introduce any bugs !]

I think I may be able to help. Let's first generate the DataFrame:

df = pd.DataFrame({"virus_RNA": 
                   [
                       "XXYGHS, ZZYRSC, Mk4RRE"         , 
                       "XXYGHS, KRSTYC, ZZQERT",  
                       "Mk4RRE, TTYCY3, WEQ478",  
                       "Mk4RRE, XXYGHS, ZZQ478, 23RTYB",  
                       "VYsr67, XXYGHS, ZZQ478, 23RTYB ",  
                       "XXYRTC, RTyy7u, MZrgTY",  
                       "Mk4RRE, SfjB23, ZrtY6V, XXYGHS",  
                       "XXRSHS, KFK22C, ZZYRSC", 
                   ],
                  "specie":
                   [ 
                       "human",
                       "human",
                       "bat",   
                       "rat",   
                       "rat",   
                       "human", 
                       "dog",   
                       "human",
                   ],
                   "date":
                   [
                       "04.08.2010",
                       "02.06.2007",
                       "03.04.2002",
                       "01.01.2001", 
                       "01.01.2001", 
                       "01.03.2004",
                       "01.12.1993", 
                       "02.06.2003",
                   ]
                  })

The first thing I would do is create one row per virus RNA, as my answer will be based on group by:

# Start by creating a list from the strings after having removed the spaces
df.loc[:, "virus_RNA"] = df.loc[:, "virus_RNA"].str.replace(" ", "").str.split(",")
# Explode: one row per element of the list (I changed the previous code to solve the memory error here please confirm this works)
df = df.explode("virus_RNA")

You then get:

>>> df.head()

    date        specie  virus_RNA
0   04.08.2010  human   XXYGHS
1   04.08.2010  human   ZZYRSC
2   04.08.2010  human   Mk4RRE
3   02.06.2007  human   XXYGHS
4   02.06.2007  human   KRSTYC

Then given you need later the focal I'd add a column with the focal's specie name, and then use get_dummies from pandas to get a one-hot encoded version of the specie variable. This way we can do a group by sum.

df["focal"] = df.loc[:, "specie"]
df = pd.get_dummies(df, columns=["specie"], prefix="", prefix_sep="")

This gives us:

>>> df.head()

    date        virus_RNA   focal   bat dog human   rat
0   04.08.2010  XXYGHS      human   0   0   1       0
1   04.08.2010  ZZYRSC      human   0   0   1       0
2   04.08.2010  Mk4RRE      human   0   0   1       0
3   02.06.2007  XXYGHS      human   0   0   1       0
4   02.06.2007  KRSTYC      human   0   0   1       0

Now we can do several group by. For the first question, you could group by date and sum, this will give you the sum of the dummies created before:

df.groupby(["virus_RNA", "date"]).sum()

With this output:

>>> df.groupby(["virus_RNA", "date"]).sum().head()

                    bat dog human   rat
virus_RNA   date                
23RTYB  01.01.2001  0   0   0       2
KFK22C  02.06.2003  0   0   1       0
KRSTYC  02.06.2007  0   0   1       0
MZrgTY  01.03.2004  0   0   1       0
Mk4RRE  01.01.2001  0   0   0       1

And for the second question, I would calculate the cases per specie for each virus RNA. For this you can get the sum with a group by and put it back into the first dataframe:

df_grouped = df.groupby(["virus_RNA"]).sum().reset_index()
df = df.drop(columns=["bat", "dog", "human", "rat"]).merge(df_grouped, on="virus_RNA").drop_duplicates()

Giving the result:

>>> df.head()

    date        virus_RNA   focal   bat dog human   rat
0   04.08.2010  XXYGHS      human   0   1   2       2
1   02.06.2007  XXYGHS      human   0   1   2       2
2   01.01.2001  XXYGHS      rat     0   1   2       2
4   01.12.1993  XXYGHS      dog     0   1   2       2
5   04.08.2010  ZZYRSC      human   0   0   2       0

We could have less data by removing the date and dropping duplicates, but we need the date to re-create the initial dataframe.

We can now create column for "different_specie" and one for "same_specie" using apply on the dataframe:

df["different_specie"] = df.apply(lambda row: sum([val for key, val in row.items() 
                                   if key not in ["virus_RNA", "date", "focal"] 
                                   and row["focal"] != key]),
                  axis=1)
df["same_specie"] = df.apply(lambda row: row[row["focal"]],
                  axis=1)
df = df.loc[:, ["virus_RNA", "focal", "different_specie", "same_specie", "date"]]

This gives us:

>>> df.head()

    virus_RNA   focal   different_specie    same_specie date
0   XXYGHS      human   3                   2           04.08.2010
1   XXYGHS      human   3                   2           02.06.2007
2   XXYGHS      rat     3                   2           01.01.2001
4   XXYGHS      dog     4                   1           01.12.1993
5   ZZYRSC      human   0                   2           04.08.2010

Then we need to re-create the initial dataframe. We can group by date here as it seems your dates are unique, but then we need to concatenate the virus_RNA which will give us a temporary dataframe that lacks a few columns. We can then merge it back with the first dataframe and drop the duplicates.

temp = df.groupby(["date"])["virus_RNA"].apply(','.join).reset_index()
df = df.merge(temp, on="date").drop("virus_RNA_x", axis=1).drop_duplicates()

Which gives us:

>>> df.head()


    focal   different_specie    same_specie date        virus_RNA_y
0   human   3                   2           04.08.2010  XXYGHS,ZZYRSC,Mk4RRE
1   human   0                   2           04.08.2010  XXYGHS,ZZYRSC,Mk4RRE
2   human   3                   1           04.08.2010  XXYGHS,ZZYRSC,Mk4RRE
3   human   3                   2           02.06.2007  XXYGHS,KRSTYC,ZZQERT
4   human   0                   1           02.06.2007  XXYGHS,KRSTYC,ZZQERT

This may however have different orders in the virus_RNA, which you can overcome by sorting and creating an id at the start. You also need to clean some things like the virus_RNA_y name.

To get a cumulative sum you can order by date and group by focal and virus rna:

df[["cumsum_different", "cumsum_same"]] = df.groupby(["virus_RNA_y", "focal"]).cumsum()

I think you didn't actually want the dataframe re-merged given the result?

I hope this helps. Also, this is my first answer here, I hope the formatting and other requirements of this website are ok in my answer.

Upvotes: 2

Related Questions