Reputation: 361
I have a pandas data frame and in it are string values I want to count. The strings I want to count are "SYNONYMOUS_CODING" and "NON_SYNONYMOUS_CODING". I've found that these strings are located in columns 23, 24, 25, 29 and 31.
Columns 23 looks like this:
15392 OAnc=C
15393 114
15394 EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc...
15395 0/0:30:90.29:0
15396 pSC=0.441
15397 pSC=0.030
15398 bSC=884
...
Column 24 looks like this:
3092 EXON(MODIFIER||||870|RSPH10B|protein_coding|CO...
3093 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT...
3094 INTERGENIC(MODIFIER||||||||||1)
3095 INTERGENIC(MODIFIER||||||||||1)
3096 DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003...
3097 DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003...
3098 DP=12
...
Column 25 looks like:
13062 C
13063 C
13064 EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13065 EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13066 CAnc=G
13067 C
13068 G
Column 29 looks like:
15688 0:0
15689 0:0
15690 NaN
15691 EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|...
15692 0:0
15693 NaN
15694 0:1
and Column 31 looks like:
3081 45
3082 1432:0
3083 0:0
3084 SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T473|482|...
3085 9
3086 0:0
3087 0:0
I wanted to know how can I go through the five columns and count the number of times the strings "SYNONYMOUS_CODING" or "NON_SYNONYMOUS_CODING" appears without double counting. Because there might be rows where these strings appear in two or more different columns.
Thank you.
Rodrigo
Upvotes: 1
Views: 496
Reputation: 1415
Here's something that I worked through, I include the code used to create the dataframe. You can see the algorithm by focusing on the main() method
def create_df():
grid = (
{'A': ["EXON(MODIFIER||||870|RSPH10B|protein_coding|CO)",
"NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
"INTERGENIC(MODIFIER||||||||||1)",
"DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003)",
"DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003)"],
'B': ["FOO",
"EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc",
"NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
"pSC=0.441",
"bSC=884"],
'C': ["BAR",
"BAR",
"EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
"EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
'D': ["EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
"0:0",
"0:0",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
}
)
return pd.DataFrame(grid)
def get_masks(df):
non_syn = pd.DataFrame(index=df.index, columns=df.columns)
synonymous = pd.DataFrame(index=df.index, columns=df.columns)
for i in df:
non_syn[i] = df[i].str.contains("NON_SYNONYMOUS_CODING")
synonymous[i] = df[i][~non_syn[i]].str.contains("SYNONYMOUS_CODING")
return non_syn, synonymous.dropna()
def count_unique_truths(df):
# make unique across rows, and then restore to regular
df = df.transpose().drop_duplicates().transpose()
return np.sum(df).sum()
def main():
df = create_df()
non_syn, synonymous = get_masks(df)
non_syn_count = count_unique_truths(non_syn)
synonymous_count = count_unique_truths(synonymous)
print(df)
print("Synonymous Count = {:d}\nNon_Synonymous Count = {:d}".format(int(synonymous_count), int(non_syn_count)))
df.groupby()
if __name__ == '__main__':
main()
Upvotes: 1
Reputation: 361
I can get the counts for the times the strings, "SYNONOMOUS_CODING" and "NON_SYNONOMOUS_CODING" appear in each column by:
column23 = str(df_test[23])
column24 = str(df_test[24])
column25 = str(df_test[25])
column29 = str(df_test[29])
column31 = str(df_test[31])
count = 0
if "SYNONYMOUS_CODING" in column23:
print "YES Syn in Column 23"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column24:
print "YES Syn in Column 24"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column25:
print "YES Syn in Column 25"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column29:
print "YES Syn in Column 29"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column31:
print "YES Syn in Column 31"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column23:
print "YES Non_Syn in Column 23"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column24:
print "YES Non_Syn in Column 24"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column25:
print "YES Non_Syn in Column 25"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column29:
print "YES Non_Syn in Column 29"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column31:
print "YES Non_Syn in Column 31"
count += 1
print "Count value:"
print count
But this is highly repetitive and non-pythonic like I wanted...
Upvotes: 0