Reputation: 305
This is the chromosome data that I have(3.2GB). For every 2 lines, there is specific id which I took column 3 (the part after # is unnecessary). I wrote those Ids in a text file. Now it's size is 1.2 GB. I need to take one id from that Ids.txt file and find two pairs from chromosome file and get their 0th and 1st column with their id. Eg: Assume i am doing it for
ID = HWI-ST1113_0139:6:1309:6672:91333 .
The result has to be
HWI-ST1113_0139:6:1309:6672 chr1 807003 chr1 805329
I could not create proper algorithm for that also i am new with pandas and numpy. If you can help I will appriciate that guys. Thank you.
0 . 1 . 2 . 3 .
chr1 42559 42609 HWI-ST216_0359:4:1106:4167:41680#AT.AAG/1 26 -
chr1 96644 96694 HWI-ST1113_0139:5:2205:21258:88747#..GGT./1 25 -
chr1 228403 228452 HWI-ST1113_0139:5:1115:14469:39910#....../1 19 +
chr1 532565 532615 HWI-ST216_0359:4:1313:3216:51668#C.G.C./1 23 -
chr1 557412 557462 HWI-ST1113_0139:5:1205:9402:77620#..G.../1 30 -
chr1 807003 807051 HWI-ST1113_0139:6:1309:6672:91333#.G...A/1 42 -
chr1 805329 805379 HWI-ST1113_0139:6:1309:6672:91333#.G...A/2 26 +
Upvotes: 1
Views: 447
Reputation: 863651
Use boolean indexing
for filtering:
ID = 'HWI-ST1113_0139:6:1309:6672:91333'
mask = df[3].str.startswith(ID)
df1 = df.loc[mask, [0, 1, 3]]
print (df1)
0 1 3
5 chr1 807003 HWI-ST1113_0139:6:1309:6672:91333#.G...
6 chr1 805329 HWI-ST1113_0139:6:1309:6672:91333#.G...A
#split by `#` and then remove last 6 chars
df1[3] = df1[3].str.split('#').str[0].str.rsplit(':', 1).str[0]
print (df1)
0 1 3
5 chr1 807003 HWI-ST1113_0139:6:1309:6672
6 chr1 805329 HWI-ST1113_0139:6:1309:6672
If want also reshape output to one row use unstack
:
df1 = (df1.set_index([3, np.arange(len(df1.index))])
.unstack()
.sort_index(axis=1, level=1)
.reset_index())
df1.columns = range(df1.shape[1])
print (df1)
0 1 2 3 4
0 HWI-ST1113_0139:6:1309:6672 chr1 807003 chr1 805329
Upvotes: 3