everestial
everestial

Reputation: 7255

How to merge multiple pandas series to a dataframe, where series have list of values

I want to make a pandas Dataframe with following columns.

my_cols = ['chrom', 'len_of_PIs']

and following values inside specific columns:

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

I am expecting the output simply like:

chrom    len_PIs
chr1     49, 32, 30, 27, 52, 52,.....
chr2     27, 20, 40, 41, 44, 50,.....
chr3     35, 45, 56, 42, 58, 50,.....

where, the len_PIs can be a list or str, so I can do easy downstream analyses. But, I am not getting the data as expected when I do:

new_df = pd.DataFrame()
new_df['chrom'] = chrom

# this code is giving me an output like
new_df['len_PIs'] = len_of_PIs.astype(str)

      chrom                                            len_PIs
0  chr1  [array([49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [array([27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [array([35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

# and each one of these below codes are giving me an output like
new_df['len_PIs'] = len_of_PIs.as_matrix()
new_df.insert(loc=1, value=len_of_PIs.astype(list) , column='len_PIs')
new_df['len_PIs'] = pd.DataFrame(len_of_PIs, columns=['len_PIs'], index=len_of_PIs.index)

      chrom                                            len_PIs
0  chr1  [[49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [[27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [[35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

How can I update this method? If there are alternate and comprehensive method from beginning of column and data prepration that would be nice too.

Upvotes: 1

Views: 87

Answers (3)

FatihAkici
FatihAkici

Reputation: 5109

Notice, 49, 32, 30 is not a proper type in Python. If it is a list/tuple, it should have brackets/parentheses like [49, 32, 30]; and if it is a string, it should have quotes like "49, 32, 30". The latter, however, can be printed without quotes and give you exactly what you want. But it would be very hard to work with later on. The following modification of jpp's code will give you a result that looks exactly like your desired outcome; but given the fact that you will work on this DataFrame, you should stick with his answer.

import pandas as pd, numpy as np

my_cols = ['chrom', 'len_of_PIs']

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([", ".join(np.random.randint(15, 59, 86).astype(str)),
                        ", ".join(np.random.randint(18, 55, 92).astype(str)),
                        ", ".join(np.random.randint(25, 61, 98).astype(str))])

df = pd.DataFrame({'chrom': chrom,
                   'len_of_PIs': len_of_PIs},
                  columns=my_cols)

print(df) returns:
  chrom                                         len_of_PIs
0  chr1  17, 37, 38, 25, 51, 39, 26, 24, 38, 44, 51, 21...
1  chr2  23, 33, 20, 48, 22, 45, 51, 45, 20, 39, 29, 25...
2  chr3  49, 42, 35, 46, 25, 52, 57, 39, 26, 29, 58, 26...

The difficulty of working with this result is as follows. Take the first row of the len_of_PIs column as an example. It has to be processed before it can be used as a collection of numbers:

[float(e) for e in df.len_of_PIs[0].split(", ")]

which is a pain. So, yeah, there you go.

Upvotes: 1

jezrael
jezrael

Reputation: 862651

If want strings use list comprehension with extract inner list, cast to string and last join:

chrom = pd.Series(['chr1', 'chr2', 'chr3'])

len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

a = [', '.join(x[0].astype(str)) for x in len_of_PIs]
df1 = pd.DataFrame({'len_PIs':a, 'chrom':chrom})
print (df1)
  chrom                                            len_PIs
0  chr1  57, 32, 44, 29, 38, 40, 19, 34, 24, 38, 42, 46...
1  chr2  19, 32, 36, 21, 44, 33, 53, 36, 21, 18, 43, 30...
2  chr3  27, 58, 60, 39, 54, 53, 32, 43, 33, 36, 60, 39...

And for lists for nested lists use list comprehension or str[0]:

df1 = pd.DataFrame({'len_PIs':[x[0] for x in len_of_PIs], 'chrom':chrom})
#alternative solution
#df1 = pd.DataFrame({'len_PIs':len_of_PIs.str[0], 'chrom':chrom})
print (df1)
 chrom                                            len_PIs
0  chr1  [18, 42, 34, 31, 57, 49, 56, 28, 56, 40, 19, 5...
1  chr2  [48, 29, 23, 21, 54, 28, 23, 27, 44, 51, 18, 3...
2  chr3  [47, 53, 57, 26, 49, 39, 37, 41, 29, 36, 36, 5...

Upvotes: 1

jpp
jpp

Reputation: 164673

I don't believe you need the inner lists in your len_of_PIs series. You may also find it convenient to instantiate your pd.DataFrame from a dictionary. The below produces your desired output.

It's generally not good practice to convert numeric data to strings, unless you absolutely must, so I have kept your array data as numeric.

import pandas as pd, numpy as np

my_cols = ['chrom', 'len_of_PIs']

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([np.random.randint(15, 59, 86),
                        np.random.randint(18, 55, 92),
                        np.random.randint(25, 61, 98)])

df = pd.DataFrame({'chrom': chrom,
                   'len_of_PIs': len_of_PIs},
                  columns=my_cols)

#   chrom                                         len_of_PIs
# 0  chr1  [17, 52, 48, 22, 27, 49, 26, 18, 46, 16, 22, 1...
# 1  chr2  [39, 52, 53, 29, 38, 51, 30, 44, 47, 49, 28, 4...
# 2  chr3  [46, 37, 46, 29, 49, 39, 56, 48, 29, 46, 28, 2...

Upvotes: 2

Related Questions