Reputation: 21
I have two data frames as follows:
df1
chr_number start end strand
0 chr1 111478338 111478339 +
1 chr1 111478370 111478371 +
2 chr1 111478372 111478373 +
3 chr1 157123306 157123307 -
4 chr1 157123307 157123308 -
5 chr1 212619741 212619742 +
6 chr1 212619742 212619743 +
df2
Chromosome Start End Log2 Fold Change Strand Gene \
0 chr1 111478330 111478444 3.036912 + C1orf162
1 chr1 157123300 157123338 3.293174 - ETV3
2 chr1 207079296 207079412 3.916122 + PFKFB2
3 chr1 212619736 212619771 3.880546 + ATF3
Ensembl ID Feature
0 ENSG00000143110.11 3' UTR
1 ENSG00000117036.12 3' UTR
2 ENSG00000123836.15 3' UTR
3 ENSG00000162772.17 3' UTR
I need to look if start from df1 is located between Start and End in df2. If so, I'd like to have a new data frame which contains start value from df1 with corresponding row in df2.
Here is the example of what I need for each start value from df1:
CrossLink Chromosome Start End Log2 Fold Change Strand \
1 111478338 chr1 111478330.0 111478444.0 3.036912 +
Gene Ensembl ID Feature
1 C1orf162 ENSG00000143110.11 3' UTR
I wrote this code:
df3 = pd.DataFrame([])
df3["CrossLink"] = np.nan
for v in df1["start"]:
df4 = df2[(df2["Start"] <= v) & (df2["End"] > v)]
df3 = df3.append(df4)
df3["CrossLink"] = df1["start"]
And I get this output:
CrossLink Chromosome Start End Log2 Fold Change Strand \
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
Gene Ensembl ID Feature
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR
It does not contain all my start values from df1 and it gives me duplicates. I am quite new in python and pandas and I searched a lot but I couldn't figure it out.
Thanks a lot in advance for your help!
Upvotes: 2
Views: 1350
Reputation: 1008
A solution using a two step process:
Let's say we have
df = pd.DataFrame({'chr_number':['chr1', 'chr2'], 'start':[3, 5],})
df2 = pd.DataFrame({'index': ['chr1', 'chr3'], 'col': ['a', 'b'], 'start': [1, 2], 'end':[4, 5]})
print(df)
print(df2)
chr_number start
0 chr1 3
1 chr2 5
index col start end
0 chr1 a 1 4
1 chr3 b 2 5
We can then apply aggregation and explode to get the desired output.
df2.start = df2.apply(lambda x: df.loc[(x['start'] <= df.start) & (df.start <= x['end'])].start.agg(list), axis=1)
print(df2.explode('start'))
index col start end
0 chr1 a 3 4
1 chr3 b 3 5
1 chr3 b 5 5
Edit: I realized that I was doing the incorrect operation comparing df2
values instead of df
. The edited code now replaces df2.start
with all df.start
values that fall between df2.start
and df2.end
for rows of df2
.
Upvotes: 0