Reputation: 11
I have 2 files: file 1 (below) has bp start and stop coordinates
Chrom | Gene start (bp) | Gene end (bp) |
---|---|---|
1 | 50902700 | 50902978 |
1 | 103817769 | 103828355 |
file 2 has mean coverage values per base pair position:
Chrom | pos | mean |
---|---|---|
1 | 12141 | 0.029005 |
1 | 12142 | 0.029216 |
what I need: I need to match chrom, start and end from file 1 (taking start and end as a range) to chrom, pos; and calculate average of means (mean column in file 2) within that range of coordinates of file 1.
desired output: Chromosome/scaffold name Gene start (bp) Gene end (bp) average coverage per gene
Chrom | Gene start (bp) | Gene end (bp) | average of means |
---|---|---|---|
1 | 50902700 | 50902978 | (mean coverage for this gene) |
1 | 103817769 | 103828355 | (mean coverage for this gene) |
I have tried using dictionaries and nested for loops:
Code:
`# importing gene start/end files
df_gene = pd.read_csv('gene_list.csv')
# importing exome data file
df_data = pd.read_csv('exomes.coverage.summary.tsv', sep = '\t')
# Creating a Dictionary to store mean values
dict_mean=df_data.set_index('pos')['mean'].to_dict()
start = df_gene['Gene start (bp)'].to_list()
end = df_gene['Gene end (bp)'].to_list()
list_mean=[]
x=0
df_mean = pd.DataFrame(columns=['start','end','mean coverage'])
### looping:
for s,e in zip(start,end):
for key,val in dict_mean.items():
if key>=s and key<=e:
list_mean.append(val)
x=np.mean(list_mean) #calculating average of means
my_series = pd.Series(data=[s, e, x], index=['start', 'end', 'mean coverage'])
df_mean=df_mean.append(my_series,ignore_index=True)
### Add mean coverage to gene dataframe
df_gene['mean coverage'] = df_mean['mean coverage']
df_gene.to_csv('gene_out.csv', index=False)
`
The code works, but it doesn't account for the chrom number. How can I find the average of means within the range of start and stop?
Upvotes: 0
Views: 157
Reputation: 3133
This should solve your problem. It is very dependent on having consistent data, meaning that every entry (chrom, pos)
in file 2 has a corresponding (chrom, s, e)
in file 1. If it is not the case, you'll have to perform additional checks in the inner while
loop.
# importing gene start/end files
df_gene = pd.read_csv('gene_list.csv')
# importing exome data file
df_data = pd.read_csv('exomes.coverage.summary.tsv', sep = '\t')
# Creating a Dictionary to store mean values
chroms_f2 = df_data.['Chrom'].to_list()
positions = df_data.['pos'].to_list()
means = df_data.['mean'].to_list()
f2_as_list = sorted(zip(chroms_f2, zip(positions, means))
starts = df_gene['Gene start (bp)'].to_list()
ends = df_gene['Gene end (bp)'].to_list()
chroms_f1 = df_gene['Chrom'].to_list()
f1_as_list = sorted(zip(chroms_f1, zip(starts, ends)))
df_mean = pd.DataFrame(columns=['chrom', 'start','end','mean coverage'])
### looping:
i1 = 0
c1, (s, e) = f1_as_list[i1]
list_mean = []
for c2, (p, m) in f2_as_list:
if not (c1 == c2 and s <= p <= e):
my_series = pd.Series(
data=[c, s, e, np.mean(list_mean)],
index=['chrom', 'start', 'end', 'mean coverage']
)
df_mean=df_mean.append(my_series,ignore_index=True)
list_mean = []
while not (c1 == c2 and s <= p <= e):
i1 += 1
c1, (s, e) = f1_as_list[i1]
list_mean.append(m)
### Add mean coverage to gene dataframe
df_gene['mean coverage'] = df_mean['mean coverage']
df_gene.to_csv('gene_out.csv', index=False)
Upvotes: 0