Reputation: 25
I am a beginner when it comes to Pandas and I recently had a problem which I could not solve with MS Excel. I have a DataFrame df like:
Jobtype JobCount Degree
0 BI Analyst 3 Winf
1 BI Analyst 2 Wing
2 BI Analyst 5 Wiwi
3 Marketing 1 Winf
4 Marketing 2 Wing
5 Marketing 6 Wiwi
(Side note: these are German abbreviations for the degree)
What I want to do is create a new, empty DataFrame which has the columns "JobType" and "Degree" in it with the same values BUT instead of having a column "JobCount" I want the same row to be JobCount-times added to the new, empty DataFrame. For example, I want to have a new DataFrame looking like this:
Jobtype Studiengang
0 BI Analyst Winf
1 BI Analyst Winf
2 BI Analyst Winf
3 BI Analyst Wing
4 BI Analyst Wing
...
So copy a row JobCount-times to the new DataFrame, without the JobCount column being in it (JobCount-times means the value for JobCount in that specific row). I actually solved this issue but it took me too long and I don't think that my solution is the right way, there has to be a more simple way to do so and therefore I ask for help since I couldn't find a proper solution.
My Code (which works but is way too complicated):
df = pd.read_csv("tete.csv", sep = ";")
cols = ["Jobtype", "Degree"]
new = pd.DataFrame(columns = cols)
df_row_count = range(0, len(df.index))
new_row_count = 0
for i in df_row_count:
job_count = 0
while job_count < df.iloc[i]["JobCount"]:
new.loc[new_row_count] = df.loc[I]
job_count += 1
new_row_count += 1
print(new)
Upvotes: 1
Views: 64
Reputation: 26676
Use numpy repeat to create new datframe with the same columns names as original and drop the column JobCount
import numpy as np
newdf = pd.DataFrame(np.repeat(df.values,df.JobCount,axis=0), columns=df.columns).drop(columns=['JobCount'])#Replicate rows as per condition
Jobtype Degree
0 BIAnalyst Winf
1 BIAnalyst Winf
2 BIAnalyst Winf
3 BIAnalyst Wing
4 BIAnalyst Wing
5 BIAnalyst Wiwi
6 BIAnalyst Wiwi
7 BIAnalyst Wiwi
8 BIAnalyst Wiwi
9 BIAnalyst Wiwi
10 Marketing Winf
11 Marketing Wing
12 Marketing Wing
13 Marketing Wiwi
14 Marketing Wiwi
15 Marketing Wiwi
16 Marketing Wiwi
17 Marketing Wiwi
18 Marketing Wiwi
Upvotes: 1
Reputation: 403128
Let's try repeat
and reindex
:
(df.drop('JobCount', 1)
.reindex(df.index.repeat(df['JobCount']))
.reset_index(drop=True))
Or, if you want something a bit shorter, you can pop while repeating:
(df.reindex(df.index.repeat(df.pop('JobCount')))
.reset_index(drop=True))
Jobtype Degree
0 BI Analyst Winf
1 BI Analyst Winf
2 BI Analyst Winf
3 BI Analyst Wing
4 BI Analyst Wing
5 BI Analyst Wiwi
6 BI Analyst Wiwi
7 BI Analyst Wiwi
8 BI Analyst Wiwi
9 BI Analyst Wiwi
10 Marketing Winf
11 Marketing Wing
12 Marketing Wing
13 Marketing Wiwi
14 Marketing Wiwi
15 Marketing Wiwi
16 Marketing Wiwi
17 Marketing Wiwi
18 Marketing Wiwi
Upvotes: 1