Reputation: 73
I'm trying to replicate code I made in R in Python, but I ran into difficulties when I tried to fix my header. I want to merge the header and the first row, but I can't seem to make it work.
My data currently looks like this:
Acronym Project Number Title Dates Unnamed: 4 Unnamed: 5
0 NaN NaN NaN Start date Duration Number of Participants
1 YoYo 5678 blabla 01-01-2020 36 4
2 SPY 1452 blabla 06-03-2018 12 6
3 NoNo 6280 blabla 02-05-2019 48 8
4 MaGiK 2749 blabla 01-05-2016 12 9
The header of my data is spread across two rows. The first three headers are as they should be. From column 4 onward, however, my column names are on the second row instead. Simply combining the header and the first row won't work, as some columns (like the 4th column) have values in both the headers and the first row.
Ideally, what I'd like to do is keep the header in the first three columns and use the value in the first row as the header from column 4 onward, so that it becomes:
Acronym | Project Number | Title | Start date | Duration | Number of Participants
In R, this would be accomplished with
Projects <- Projects %>%
set_names(c(names(Projects)[1:3], as.character(.[1,])[4:6])) %>%
rownames_to_column("index") %>%
filter(index != 1 ) %>%
select(-index)
Based on this question I tried this in Python
Projects_clean.columns = np.concatenate([Projects_clean.iloc[0, :2], Projects_clean.columns[0:3]])
and
Projects_clean.columns = np.append(Projects_clean.iloc[0,:2], Projects_clean.columns[0:3])
But this returns
"values have {new} elements".format(old=old_len, new=new_len)
Length mismatch: Expected axis has 24 elements, new values have 10 elements
(On my actual data). Apparently, my new data does not have the same amount of columns as my old data. What am I doing wrong?
Upvotes: 2
Views: 5478
Reputation: 862661
Convert first 3 columns names and all data of first row after 3th value:
Projects_clean.columns = (Projects_clean.columns[:3].tolist() +
Projects_clean.iloc[0, 3:].tolist())
Or:
Projects_clean.columns = np.concatenate([Projects_clean.columns[:3],
Projects_clean.iloc[0, 3:]])
Anf then:
Projects_clean = Projects_clean.iloc[1:]
print (Projects_clean)
Acronym Project Number Title Start date Duration Number of Participants
1 YoYo 5678.0 blabla 01-01-2020 36 4
2 SPY 1452.0 blabla 06-03-2018 12 6
3 NoNo 6280.0 blabla 02-05-2019 48 8
4 MaGiK 2749.0 blabla 01-05-2016 12 9
If possible create MultiIndex
by first 2 rows by parameter header=[0,1]
in read_csv
then use:
Projects_clean = pd.read_csv(file, header=[0,1])
Projects_clean.columns = (Projects_clean.columns.get_level_values(0)[:3].tolist() +
Projects_clean.columns.get_level_values(1)[3:].tolist())
print (Projects_clean)
Acronym Project Number Title Start date Duration \
1 YoYo 5678 blabla 01-01-2020 36
2 SPY 1452 blabla 06-03-2018 12
3 NoNo 6280 blabla 02-05-2019 48
4 MaGiK 2749 blabla 01-05-2016 12
Number of Participants
1 4
2 6
3 8
4 9
Upvotes: 5