J-H
J-H

Reputation: 1869

Efficient Pandas Dataframe insert

I'm trying to add float values like [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)],[...]] to a Pandas dataframe which looks like a matrix build from the first value of the tuples

df = 1 2 3 1 0.44 0.5 0.1 2 0.85 0.63 0.11 3 ... ... ...

I tried this:

    for key, value in enumerate(outer_list):
      for tuplevalue in value:
        df.ix[key][tuplevalue[0]] = tuplevalue[1]

The Problem is that my NxN-Matrix contains about 10000x10000 elements and hence it takes really long with my approach. Is there another possibility to speed this up?

(Unfortunately the values in the list are not ordered by the first tuple element)

Upvotes: 3

Views: 889

Answers (2)

Kevin
Kevin

Reputation: 8207

This works using a dictionary (if you need to preserve your column order, or if the column names were a string). Maybe Alexander will update his answer to account for that, I'm nearly certain he'll have a better solution than my proposed one :)

Here's an example:

from collections import defaultdict

a = [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)]]
b = [[('A',0.44),('B',0.5),('C',0.1)],[('B',0.63),('A',0.85),('C',0.11)]]

First on a:

row_to_dic = [{str(y[0]): y[1] for y in x} for x in a]

dd = defaultdict(list)
for d in (row_to_dic):
    for key, value in d.iteritems():
        dd[key].append(value)

pd.DataFrame.from_dict(dd)

    1   2   3
0   0.44    0.50    0.10
1   0.85    0.63    0.11

and b:

row_to_dic = [{str(y[0]): y[1] for y in x} for x in b]

dd = defaultdict(list)
for d in (row_to_dic):
    for key, value in d.iteritems():
        dd[key].append(value)

pd.DataFrame.from_dict(dd)
      A     B   C
0   0.44    0.50    0.10
1   0.85    0.63    0.11

Upvotes: 1

Alexander
Alexander

Reputation: 109526

Use list comprehensions to first sort and extract your data. Then create your dataframe from the sorted and cleaned data.

data = [[(1, 0.44), (2, 0.50), (3, 0.10)],
        [(2, 0.63), (1, 0.85), (3, 0.11)]]

# First, sort each row.
_ = [row.sort() for row in data]

# Then extract the second element of each tuple.
new_data = [[t[1] for t in row] for row in data]

# Now create a dataframe from your data.
>>> pd.DataFrame(new_data)
      0     1     2
0  0.44  0.50  0.10
1  0.85  0.63  0.11

Upvotes: 2

Related Questions