ubuntu_noob
ubuntu_noob

Reputation: 2365

Concat dataframe having duplicate columns

I have data frame series which looks like this:

   a    b    r
1  43  630  587    

   d    b    c
1  34  30  87

I want to create a new dataframe which looks like:

 a   b     r    d   c
 43  630  587   0   0
 0    30   0    34  87

I have used the code:

appended_data= pd.concat(appended_data, axis=0)

where the list appended_data contains the individual dataframe series as elements. Earlier when I used it with other dataset it didnt throw any error but with the new dataset its showing ValueError: Plan shapes are not aligned

Note: Earlier dataset also had duplicate columns and it worked fine then and I also updated pandas.These were the solutions I found online.

full code:

dir_list = [benign_freq_dir,malign_freq_dir]

    appended_data = []

    for l in dir_list: 
        for root, dirs, files in os.walk(l):
            #print(root)
            for name in files:

                file = open(root+"/"+name,'r') 
                print(name)
                print("\n")
                df = pd.read_csv(file,header=None,error_bad_lines=False)   #In windows and python3 always pass file object not the path directly in pd.read_csv                
                #print(df)
                df = df.rename(columns={0: 'col'})
                #print(df)   
                df = pd.DataFrame(df.col.str.split(' ',1).tolist(), columns = ['col1','col2']).T.reset_index(drop=True)          
                df = df.rename(columns=df.iloc[0]).drop(df.index[0])
                print(df)


                appended_data.append(df)
                if l==benign_freq_dir:
                    df['class']=0
                else:
                    df['class']=1

    #for l in appended_data:
    #   print(l)
    #   print(type(l))  
    appended_data= pd.concat(appended_data, axis=0,sort=False)

enter image description here

edit:

output for:

for dfx in appended_data: 
        print(dfx.head(2).to_dict())

enter image description here

Upvotes: 0

Views: 11753

Answers (2)

Tom Wojcik
Tom Wojcik

Reputation: 6179

You will need an outer join for that.

import pandas as pd

df1 = pd.DataFrame({
    'a': [43],
    'b': [630],
    'r': [587]
})

df2 = pd.DataFrame({
    'd': [34],
    'b': [30],
    'c': [87]
})

df3 = df1.merge(df2, how='outer').fillna(0)
print(df3)

Yields what you need.

      a    b      r     d     c
0  43.0  630  587.0   0.0   0.0
1   0.0   30    0.0  34.0  87.0

Docs on pd.merge
Docs on outer join

EDIT: OP, pd.concat should work as expected and Anton has proven that.

Since pd.merge was my answer, I have to stick with that.

Some pseudocode if you want to merge a list of dataframes.

def merge(lst, df=None):
    if df is None:
        df = lst.pop()
    to_be_merged = lst.pop()
    merged = df.merge(to_be_merged, how='outer')
    if lst:
        return merge(lst, merged)
    return merged.fillna(0)

df = merge(list_of_dfs)

That way you will know instantly which df is at fault because clearly there's a problem with your data. Catch the exception and use .describe() and .info() to debug this issue.

Upvotes: 3

Anton vBR
Anton vBR

Reputation: 18906

You can use pd.concat. You should however pass both dataframes.

pd.concat([df1,df2], axis=0, sort=False).fillna(0) #.astype(int) for ints

#      a    b      r     d     c
#0  43.0  630  587.0   0.0   0.0
#0   0.0   30    0.0  34.0  87.0

Sample data from Tom Wojcik.

Upvotes: 2

Related Questions