Phlya
Phlya

Reputation: 5976

Fill pandas dataframe with values in between

I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:

        0    1    2
0   chr2L    1    4
1   chr2L    9   12
2   chr2L   17   20
3   chr2L   23   23
4   chr2L   26   27
5   chr2L   30   40
6   chr2L   45   47
7   chr2L   52   53
8   chr2L   56   56
9   chr2L   61   62
10  chr2L   66   80

I want to get something like this:

            0    1    2    3
    0   chr2L    0    1    0
    1   chr2L    1    2    1
    2   chr2L    2    3    1
    3   chr2L    3    4    1
    4   chr2L    4    5    0
    5   chr2L    5    6    0
    6   chr2L    6    7    0
    7   chr2L    7    8    0
    8   chr2L    8    9    0
    9   chr2L    9   10    1
   10   chr2L   10   11    1
   11   chr2L   11   12    1
   12   chr2L   12   13    0
   And so on...

So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.

In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.

Upvotes: 0

Views: 1943

Answers (1)

cphlewis
cphlewis

Reputation: 16249

This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):

import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})



# splitting up a range, translated into df row terms:
def onebyone(dfrow):
    a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
    count = b - a
if count >= 2:
    interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
    interior = [0.5]
elif count == 0:
    interior = []

return {'n':[n]*count, 'a':range(a, a + count),
        'b':range(a + 1, a + count + 1),
        'insideness':interior}

Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:

# Into a new dataframe:
intermediate = []

for label in set(df.n):
    for row in df[df.n == label].iterrows():
        intermediate.append(pd.DataFrame(onebyone(row)))

df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))

And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:

 # for times in the overall range describing 'a'
 for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())): 
    # if a time isn't in an existing 0.5-1-0.5 range:
    if i not in newd[newd.n=='a'].a.values:
        # these are the values to fill in a 0-row
        print '%d, %d, 0'%(i, i+1)

Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().

Upvotes: 1

Related Questions