Reputation: 7255
I want to make a pandas Dataframe
with following columns.
my_cols = ['chrom', 'len_of_PIs']
and following values inside specific columns:
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
[np.random.randint(18, 55, 92)],
[np.random.randint(25, 61, 98)]])
I am expecting the output simply like:
chrom len_PIs
chr1 49, 32, 30, 27, 52, 52,.....
chr2 27, 20, 40, 41, 44, 50,.....
chr3 35, 45, 56, 42, 58, 50,.....
where, the len_PIs
can be a list
or str
, so I can do easy downstream analyses. But, I am not getting the data as expected when I do:
new_df = pd.DataFrame()
new_df['chrom'] = chrom
# this code is giving me an output like
new_df['len_PIs'] = len_of_PIs.astype(str)
chrom len_PIs
0 chr1 [array([49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1 chr2 [array([27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2 chr3 [array([35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...
# and each one of these below codes are giving me an output like
new_df['len_PIs'] = len_of_PIs.as_matrix()
new_df.insert(loc=1, value=len_of_PIs.astype(list) , column='len_PIs')
new_df['len_PIs'] = pd.DataFrame(len_of_PIs, columns=['len_PIs'], index=len_of_PIs.index)
chrom len_PIs
0 chr1 [[49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1 chr2 [[27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2 chr3 [[35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...
How can I update this method? If there are alternate and comprehensive method from beginning of column and data prepration
that would be nice too.
Upvotes: 1
Views: 87
Reputation: 5109
Notice, 49, 32, 30
is not a proper type in Python. If it is a list/tuple, it should have brackets/parentheses like [49, 32, 30]
; and if it is a string, it should have quotes like "49, 32, 30"
. The latter, however, can be printed without quotes and give you exactly what you want. But it would be very hard to work with later on. The following modification of jpp's code will give you a result that looks exactly like your desired outcome; but given the fact that you will work on this DataFrame, you should stick with his answer.
import pandas as pd, numpy as np
my_cols = ['chrom', 'len_of_PIs']
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([", ".join(np.random.randint(15, 59, 86).astype(str)),
", ".join(np.random.randint(18, 55, 92).astype(str)),
", ".join(np.random.randint(25, 61, 98).astype(str))])
df = pd.DataFrame({'chrom': chrom,
'len_of_PIs': len_of_PIs},
columns=my_cols)
print(df) returns:
chrom len_of_PIs
0 chr1 17, 37, 38, 25, 51, 39, 26, 24, 38, 44, 51, 21...
1 chr2 23, 33, 20, 48, 22, 45, 51, 45, 20, 39, 29, 25...
2 chr3 49, 42, 35, 46, 25, 52, 57, 39, 26, 29, 58, 26...
The difficulty of working with this result is as follows. Take the first row of the len_of_PIs
column as an example. It has to be processed before it can be used as a collection of numbers:
[float(e) for e in df.len_of_PIs[0].split(", ")]
which is a pain. So, yeah, there you go.
Upvotes: 1
Reputation: 862651
If want string
s use list comprehension with extract inner list, cast to string
and last join
:
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
[np.random.randint(18, 55, 92)],
[np.random.randint(25, 61, 98)]])
a = [', '.join(x[0].astype(str)) for x in len_of_PIs]
df1 = pd.DataFrame({'len_PIs':a, 'chrom':chrom})
print (df1)
chrom len_PIs
0 chr1 57, 32, 44, 29, 38, 40, 19, 34, 24, 38, 42, 46...
1 chr2 19, 32, 36, 21, 44, 33, 53, 36, 21, 18, 43, 30...
2 chr3 27, 58, 60, 39, 54, 53, 32, 43, 33, 36, 60, 39...
And for lists for nested lists use list comprehension or str[0]
:
df1 = pd.DataFrame({'len_PIs':[x[0] for x in len_of_PIs], 'chrom':chrom})
#alternative solution
#df1 = pd.DataFrame({'len_PIs':len_of_PIs.str[0], 'chrom':chrom})
print (df1)
chrom len_PIs
0 chr1 [18, 42, 34, 31, 57, 49, 56, 28, 56, 40, 19, 5...
1 chr2 [48, 29, 23, 21, 54, 28, 23, 27, 44, 51, 18, 3...
2 chr3 [47, 53, 57, 26, 49, 39, 37, 41, 29, 36, 36, 5...
Upvotes: 1
Reputation: 164673
I don't believe you need the inner lists in your len_of_PIs
series. You may also find it convenient to instantiate your pd.DataFrame
from a dictionary. The below produces your desired output.
It's generally not good practice to convert numeric data to strings, unless you absolutely must, so I have kept your array data as numeric.
import pandas as pd, numpy as np
my_cols = ['chrom', 'len_of_PIs']
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([np.random.randint(15, 59, 86),
np.random.randint(18, 55, 92),
np.random.randint(25, 61, 98)])
df = pd.DataFrame({'chrom': chrom,
'len_of_PIs': len_of_PIs},
columns=my_cols)
# chrom len_of_PIs
# 0 chr1 [17, 52, 48, 22, 27, 49, 26, 18, 46, 16, 22, 1...
# 1 chr2 [39, 52, 53, 29, 38, 51, 30, 44, 47, 49, 28, 4...
# 2 chr3 [46, 37, 46, 29, 49, 39, 56, 48, 29, 46, 28, 2...
Upvotes: 2