Filipe Aleixo
Filipe Aleixo

Reputation: 4244

Python - make dataframe's columns consistent with list elements

From what I've read, it's easy to add and delete columns from a DataFrame, but I was wondering if there's already a method to do what I'm trying to achieve, in order to avoid reinventing the wheel.

Suppose I have the DataFrame x:

   a  b   c
0  1  5   8
1  2  6   9
2  3  7  10

I want to verify whether the column names correspond solely to the elements contained in a list l. Case there are less elements in l than columns in x, I want the missing columns to be deleted.

For instance, if l = ["a", "b"], x would become:

   a   c
0  1   8
1  2   9
2  3  10

On the other hand, if there are more elements in l than columns in x, I want to create new, correspondingly named columns, with all the values on that column being set as 0.

For instance, if l = ["a", "b", "c", "d"], x would become:

   a  b   c  d
0  1  5   8  0
1  2  6   9  0
2  3  7  10  0

I could do a loop to check consistency between column names in x and elements in l, but is there anything more efficient than that?

Upvotes: 3

Views: 645

Answers (4)

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

just use (addition of .astype(np.int) thanks to @Bill if needed. Note that this converts the whole dataframe to ints):

df.loc[:, l].fillna(0).astype(np.int)

Case 1:

l = ["a", "b"]
df.loc[:, l].fillna(0).astype(np.int)

    a   b
0   1   5
1   2   6
2   3   7

Case 2:

l = ["a", "b", "c", "d"]
df.loc[:, l].fillna(0).astype(np.int)

    a   b   c   d
0   1   5   8   0
1   2   6   9   0
2   3   7   10  0

Upvotes: 4

Vaishali
Vaishali

Reputation: 38415

Again a function but less complicated,

def df_from_list(df, l):
    for i in l:
        if i not in df.columns:
            df[i]=0
    return df[l]

Now call the function

l = ["a", "b","z"]    
df_from_list(df, l)

You get

    a   b   z
0   1   5   0
1   2   6   0
2   3   7   0

Upvotes: 1

Simon
Simon

Reputation: 333

I wrote a simple function that gets what you're looking for. The identification is done using set operations, but then it loops to create the new columns using insert. Perhaps there is a better way to do this one loop?

def func_df(df, l):

    # First find intersection
    intersect = set(df.columns).intersection(set(l))
    df = df.loc[:, intersect]

    # Now find list elements not here.
    additions = set(l).difference(overlap)
    for i in additions:
        df.insert(0, i, 0)

    return df


df = pd.DataFrame(
        [[1, 5, 8],
         [2, 6, 9],
         [3, 7, 10]], columns=['a', 'b', 'c'])


out = func_df(df, ['a', 'b', 'd', 'k'])

print(out)
   k  d  a  b
0  0  0  1  5
1  0  0  2  6
2  0  0  3  7

Upvotes: 1

Bill
Bill

Reputation: 11613

I think pd.concat might be a way to achieve.

In [47]: import pandas as pd

In [48]: data = {
    ...: 'a': [1, 2, 3],
    ...: 'b': [5, 6, 7],
    ...: 'c': [8, 9, 10]
    ...: }

In [49]: x = pd.DataFrame(data)

In [50]: x
Out[50]: 
   a  b   c
0  1  5   8
1  2  6   9
2  3  7  10

In [51]: l = ["a", "b"]

In [52]: x[l]
Out[52]: 
   a  b
0  1  5
1  2  6
2  3  7

In [53]: l = ["a", "b", "c", "d"]

In [55]: y = pd.DataFrame(columns=l)

In [56]: y
Out[56]: 
Empty DataFrame
Columns: [a, b, c, d]
Index: []

In [57]: pd.concat((x, y))
Out[57]: 
     a    b     c    d
0  1.0  5.0   8.0  NaN
1  2.0  6.0   9.0  NaN
2  3.0  7.0  10.0  NaN

In [58]: pd.concat((x, y)).fillna(0)
Out[58]: 
     a    b     c  d
0  1.0  5.0   8.0  0
1  2.0  6.0   9.0  0
2  3.0  7.0  10.0  0

Upvotes: 1

Related Questions