AmyV
AmyV

Reputation: 73

Combine first row and header with pandas

I'm trying to replicate code I made in R in Python, but I ran into difficulties when I tried to fix my header. I want to merge the header and the first row, but I can't seem to make it work.

My data currently looks like this:

        Acronym  Project Number    Title     Dates         Unnamed: 4   Unnamed: 5 
0         NaN            NaN        NaN      Start date    Duration     Number of Participants
1         YoYo           5678       blabla   01-01-2020    36           4
2         SPY            1452       blabla   06-03-2018    12           6         
3         NoNo           6280       blabla   02-05-2019    48           8       
4         MaGiK          2749       blabla   01-05-2016    12           9    

The header of my data is spread across two rows. The first three headers are as they should be. From column 4 onward, however, my column names are on the second row instead. Simply combining the header and the first row won't work, as some columns (like the 4th column) have values in both the headers and the first row.

Ideally, what I'd like to do is keep the header in the first three columns and use the value in the first row as the header from column 4 onward, so that it becomes:

Acronym | Project Number | Title | Start date  | Duration | Number of Participants

In R, this would be accomplished with

Projects <- Projects %>% 
  set_names(c(names(Projects)[1:3], as.character(.[1,])[4:6])) %>% 
  rownames_to_column("index") %>% 
  filter(index != 1 ) %>% 
  select(-index)

Based on this question I tried this in Python

Projects_clean.columns = np.concatenate([Projects_clean.iloc[0, :2], Projects_clean.columns[0:3]])

and

Projects_clean.columns = np.append(Projects_clean.iloc[0,:2], Projects_clean.columns[0:3])

But this returns

 "values have {new} elements".format(old=old_len, new=new_len)

Length mismatch: Expected axis has 24 elements, new values have 10 elements

(On my actual data). Apparently, my new data does not have the same amount of columns as my old data. What am I doing wrong?

Upvotes: 2

Views: 5478

Answers (1)

jezrael
jezrael

Reputation: 862661

Convert first 3 columns names and all data of first row after 3th value:

Projects_clean.columns = (Projects_clean.columns[:3].tolist() +  
                          Projects_clean.iloc[0, 3:].tolist())

Or:

Projects_clean.columns = np.concatenate([Projects_clean.columns[:3], 
                                         Projects_clean.iloc[0, 3:]])

Anf then:

Projects_clean = Projects_clean.iloc[1:]
print (Projects_clean)
  Acronym  Project Number   Title  Start date Duration Number of Participants
1    YoYo          5678.0  blabla  01-01-2020       36                      4
2     SPY          1452.0  blabla  06-03-2018       12                      6
3    NoNo          6280.0  blabla  02-05-2019       48                      8
4   MaGiK          2749.0  blabla  01-05-2016       12                      9

If possible create MultiIndex by first 2 rows by parameter header=[0,1] in read_csv then use:

Projects_clean = pd.read_csv(file, header=[0,1])


Projects_clean.columns = (Projects_clean.columns.get_level_values(0)[:3].tolist() +  
                          Projects_clean.columns.get_level_values(1)[3:].tolist())
print (Projects_clean)
  Acronym  Project Number   Title  Start date  Duration  \
1    YoYo            5678  blabla  01-01-2020        36   
2     SPY            1452  blabla  06-03-2018        12   
3    NoNo            6280  blabla  02-05-2019        48   
4   MaGiK            2749  blabla  01-05-2016        12   

   Number of Participants  
1                       4  
2                       6  
3                       8  
4                       9  

Upvotes: 5

Related Questions