Reputation: 481
I have as input a dataset like the following:
labels = ['chrom', 'start', 'end', 'read']
my_data = [['chr1', 784344, 800125, 'read1'],
['chr1', 784344, 800124, 'read2'],
['chr1', 784344, 800124, 'read3']]
Which I convert to a pandas dataframe using:
my_data_pd = pd.DataFrame.from_records(my_data, columns=labels)
and that looks like this:
chrom start end read
0 chr1 784344 800125 read1
1 chr1 784344 800124 read2
2 chr1 784344 800124 read3
What I want to do is the following: I wan't merge the rows that have indentical chrom,start,end values, and count the number of disntinct occurences of the values in the 'read' column for those rows that were merged. Finally, I want to convert convert that output to a list/tuple, as in this example (note that the last column has the count information):
[('chr1', 784344, 800125,1), ('chr1', 784344, 800124,2)]
What I have been able to do:
Unsing Pandas Groupby and the nunique() with the command:
my_data_pd.groupby(['chrom','start','end'],sort=False).read.nunique()
I arrive to a Pandas.Series object that looks to what I want:
chrom start end
chr1 784344 800125 1
800124 2
Name: read, dtype: int64
However, when I convert it to a list/tuple using:
sortedd.index.tolist()
the last column gets excluded, leading to the resulting output:
[('chr1', 784344, 800125), ('chr1', 784344, 800124)]
Any idea about how can I get around trough this problem?
For all those that might come up with a solution, I am doing this inside a big program thousands of times, so speed is a big issue. Thats the reason I am avoiding other tools like BedTools and pybedtools
Thanks!
Upvotes: 3
Views: 1529
Reputation: 862871
First reset_index
and then in list comprehension
convert to tuples
:
L = [tuple(x) for x in sortedd.reset_index().values.tolist()]
print (L)
[('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]
Upvotes: 3
Reputation: 30605
You can use multi index i.e
idx = pd.MultiIndex.from_arrays(sortedd.reset_index().values.T)
idx.tolist()
[('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]
Upvotes: 3
Reputation: 323306
You can set_index
sortedd.to_frame('val').set_index('val',append=True).index.tolist()
Out[277]: [('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]
Upvotes: 3