Finding and Counting String Values in Pandas DataFrame

I have a pandas data frame and in it are string values I want to count. The strings I want to count are "SYNONYMOUS_CODING" and "NON_SYNONYMOUS_CODING". I've found that these strings are located in columns 23, 24, 25, 29 and 31.

Columns 23 looks like this:

15392                                               OAnc=C
15393                                                  114
15394    EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc...
15395                                       0/0:30:90.29:0
15396                                            pSC=0.441
15397                                            pSC=0.030
15398                                              bSC=884
...

Column 24 looks like this:

3092    EXON(MODIFIER||||870|RSPH10B|protein_coding|CO...
3093    NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT...
3094                      INTERGENIC(MODIFIER||||||||||1)
3095                      INTERGENIC(MODIFIER||||||||||1)
3096    DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003...
3097    DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003...
3098                                                DP=12
...

Column 25 looks like:

13062                                                      C
13063                                                      C
13064    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13065    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13066                                                 CAnc=G
13067                                                      C
13068                                                      G

Column 29 looks like:

15688                                                  0:0
15689                                                  0:0
15690                                                  NaN
15691    EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|...
15692                                                  0:0
15693                                                  NaN
15694                                                  0:1

and Column 31 looks like:

3081                                                   45
3082                                               1432:0
3083                                                  0:0
3084    SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T473|482|...
3085                                                    9
3086                                                  0:0
3087                                                  0:0

I wanted to know how can I go through the five columns and count the number of times the strings "SYNONYMOUS_CODING" or "NON_SYNONYMOUS_CODING" appears without double counting. Because there might be rows where these strings appear in two or more different columns.

Thank you.

Rodrigo

Upvotes: 1

Views: 496

Answers (2)

OYRM
OYRM

Reputation: 1415

Here's something that I worked through, I include the code used to create the dataframe. You can see the algorithm by focusing on the main() method

def create_df():
    grid = (
        {'A': ["EXON(MODIFIER||||870|RSPH10B|protein_coding|CO)",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "INTERGENIC(MODIFIER||||||||||1)",
               "DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003)",
               "DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003)"],
         'B': ["FOO",
               "EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "pSC=0.441",
               "bSC=884"],
         'C': ["BAR",
               "BAR",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
         'D': ["EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "0:0",
               "0:0",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
        }
    )
    return pd.DataFrame(grid)

def get_masks(df):
    non_syn = pd.DataFrame(index=df.index, columns=df.columns)
    synonymous = pd.DataFrame(index=df.index, columns=df.columns)

    for i in df:
        non_syn[i] = df[i].str.contains("NON_SYNONYMOUS_CODING")
        synonymous[i] = df[i][~non_syn[i]].str.contains("SYNONYMOUS_CODING")

    return non_syn, synonymous.dropna()

def count_unique_truths(df):
    # make unique across rows, and then restore to regular
    df = df.transpose().drop_duplicates().transpose()
    return np.sum(df).sum()

def main():
    df = create_df()
    non_syn, synonymous = get_masks(df)
    non_syn_count = count_unique_truths(non_syn)
    synonymous_count = count_unique_truths(synonymous)
    print(df)
    print("Synonymous Count = {:d}\nNon_Synonymous Count = {:d}".format(int(synonymous_count), int(non_syn_count)))
    df.groupby()

if __name__ == '__main__':
    main()

Upvotes: 1

I can get the counts for the times the strings, "SYNONOMOUS_CODING" and "NON_SYNONOMOUS_CODING" appear in each column by:

column23 = str(df_test[23])
column24 = str(df_test[24])
column25 = str(df_test[25])
column29 = str(df_test[29])
column31 = str(df_test[31])

count = 0

if "SYNONYMOUS_CODING" in column23:
    print "YES Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column24:
    print "YES Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column25:
    print "YES Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column29:
    print "YES Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column31:
    print "YES Syn in Column 31"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column23:
    print "YES Non_Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column24:
    print "YES Non_Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column25:
    print "YES Non_Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column29:
    print "YES Non_Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column31:
    print "YES Non_Syn in Column 31"
    count += 1

    print "Count value:"
    print count

But this is highly repetitive and non-pythonic like I wanted...

Upvotes: 0

Related Questions