Kumar P
Kumar P

Reputation: 300

Adding Columns to Dataframe having different Row Counts

I am trying to read a file and have the distinct values of each column in a dataframe having similar column names.

File has say 3 columns

EMP ID  DEPT    Salary
=============================
100 Sales   10000
200 MFG 10000
300 IT  10000
400 Sales   10000
500 MFG 10000
600 IT  10000

Expected Output

EMP ID  DEPT    Salary
========================
100 Sales   10000
200 MFG 
300 IT  
400     
500     
600     

I have read the file, got the list of unique values as below

df=pd.read_csv('C:/Users/jaiveeru/Downloads/run_test1.csv')
cols=df.columns.tolist()
df1=pd.DataFrame()
df2=pd.DataFrame()
for i in cols:
    lst=df[i].unique().tolist()
    str1 = ','.join(lst)
    lst2=[str1]
    df1[i]=lst2
    df2=pd.concat([df2,df1])

However as each column can have different number of unique values I am getting the below error

ValueError: Length of values does not match length of index

Upvotes: 1

Views: 44

Answers (1)

perl
perl

Reputation: 9941

There may be two different versions, depending on how exactly you want to treat unique values. I've set Salary of EMPID=400 to 20,000 to illustrate this.

  1. Using the duplicated method on columns:
df[df.apply(lambda x: x.duplicated())] = np.nan
df = df.fillna('')

print(df)

Output:

   EMPID   DEPT Salary
0    100  Sales  10000
1    200    MFG       
2    300     IT       
3    400         20000
4    500              
5    600              
  1. Alternatively, if you don't want to preserve the indices of these unique values, but instead want to collect them at the top of each column:
df = pd.DataFrame(
    [df[x].unique() for x in df.columns],
    index=df.columns).T.fillna('')

print(df)

Output:

   EMPID   DEPT Salary
0  100.0  Sales  10000
1  200.0    MFG  20000
2  300.0     IT       
3  400.0              
4  500.0              
5  600.0              

Upvotes: 2

Related Questions