Reputation: 69
I am working with a csv file (100s of rows) containing data as follows. I would like to get counts per each gene for each element in csv/tab format.
Input
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
Expected output
Gene G-Box MYC
---------- ------- -----
STBZIP1 2 3
STBZIP10 4 3
Can someone please help me to come up with a bash script (or python) in this regard?
Update
I am trying the following and stuck for the time being :| ;
import pandas as pd
df = pd.read_csv("Promoter_Element_Distribution.csv")
print (df)
df.groupby(['Gene', 'Element']).size().unstack(fill_value=0)
Upvotes: 1
Views: 90
Reputation: 6414
Since you asked also for a bash version, here is a use of awk
1. It's commented, and also the output is "well" formatted, so the code is a little huge (about 20 lines without the comments).
awk '# First record line:
# Storing all column names into elements, including
# the first column name
NR == 1 {firstcol=$1;element[$1]++}
# Each line starting with the second one are datas
# Occurrences are counted with an indexed array
# count[x][y] contains the count of Element y for the Gene x
NR > 2 {element[$2]++;count[$1][$2]++}
# Done, time for displaying the results
END {
# Let us display the first line, column names
## Left-justify the first col, because it is text
printf "%-10s ", firstcol
## Other are counts, so we right-justify
for (i in element) if (i != firstcol) printf "%10s ", i
printf "\n"
# Now an horizontal bar
for (i in element) {
c = 0
while (c++ < 10) { printf "-"}
printf " ";
}
printf "\n"
# Now, loop through the count records
for (i in count) {
# Left justification for the column name
printf "%-10s ", i ;
for(j in element)
# For each counted element (ie except the first one),
# print it right-justified
if (j in count[i]) printf "%10s", count[i][j]
printf "\n"
}
}' tab-separated-input.txt
Result:
Gene G-box MYC
---------- ---------- ----------
STBZIP10 4 3
STBZIP1 2 3
1 This solution requires GNU awk
for arrays of arrays (count[$1][$2]
syntax) - Thanks to Ed Morton
Upvotes: 2
Reputation: 11351
With the file in the form (named input.csv
here):
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
this
import pandas as pd
df = pd.read_csv('input.csv', delim_whitespace=True, skiprows=1)
df.columns = ['Gene', 'Element']
df['Count'] = 1
df = df.pivot_table(index='Gene', columns='Element', aggfunc=sum)
print(df)
gives you
Count
Element G-box MYC
Gene
STBZIP1 2 3
STBZIP10 4 3
Upvotes: 2