alkanschtein
alkanschtein

Reputation: 305

How to get data from dataframe with pandas

This is the chromosome data that I have(3.2GB). For every 2 lines, there is specific id which I took column 3 (the part after # is unnecessary). I wrote those Ids in a text file. Now it's size is 1.2 GB. I need to take one id from that Ids.txt file and find two pairs from chromosome file and get their 0th and 1st column with their id. Eg: Assume i am doing it for

ID = HWI-ST1113_0139:6:1309:6672:91333 .

The result has to be

HWI-ST1113_0139:6:1309:6672 chr1 807003 chr1 805329

I could not create proper algorithm for that also i am new with pandas and numpy. If you can help I will appriciate that guys. Thank you.

    0 .    1 .     2 .      3 .                                     
    chr1   42559   42609    HWI-ST216_0359:4:1106:4167:41680#AT.AAG/1  26  -
    chr1   96644   96694    HWI-ST1113_0139:5:2205:21258:88747#..GGT./1  25  -
    chr1  228403  228452    HWI-ST1113_0139:5:1115:14469:39910#....../1  19  +
    chr1  532565  532615    HWI-ST216_0359:4:1313:3216:51668#C.G.C./1  23  -
    chr1  557412  557462    HWI-ST1113_0139:5:1205:9402:77620#..G.../1  30  -
    chr1  807003  807051    HWI-ST1113_0139:6:1309:6672:91333#.G...A/1  42  -
    chr1  805329  805379    HWI-ST1113_0139:6:1309:6672:91333#.G...A/2  26  +

Upvotes: 1

Views: 447

Answers (1)

jezrael
jezrael

Reputation: 863651

Use boolean indexing for filtering:

ID = 'HWI-ST1113_0139:6:1309:6672:91333'
mask = df[3].str.startswith(ID)

df1 = df.loc[mask, [0, 1, 3]]
print (df1)
      0       1                                         3
5  chr1  807003   HWI-ST1113_0139:6:1309:6672:91333#.G...
6  chr1  805329  HWI-ST1113_0139:6:1309:6672:91333#.G...A

#split by `#` and then remove last 6 chars
df1[3] = df1[3].str.split('#').str[0].str.rsplit(':', 1).str[0]
print (df1)
      0       1                            3
5  chr1  807003  HWI-ST1113_0139:6:1309:6672
6  chr1  805329  HWI-ST1113_0139:6:1309:6672

If want also reshape output to one row use unstack:

df1 = (df1.set_index([3, np.arange(len(df1.index))])
          .unstack()
          .sort_index(axis=1, level=1)
          .reset_index())
df1.columns = range(df1.shape[1])
print (df1)
                             0     1       2     3       4
0  HWI-ST1113_0139:6:1309:6672  chr1  807003  chr1  805329

Upvotes: 3

Related Questions