Reputation: 411
I have a CSV file of movies that I'm trying to clean up. I'm using Jupyter notebook.
It has 10,000 rows and 5 columns. Below are some sample data:
Movie Name | Genre | Date Released | Length | Rating |
The Godfather | Crime | March 24, 1972 | 175 | R |
The Avengers | Action | May 5, 2012 | 143 | PG-13 |
The Dark Knight | Action | Crime | July 18, 2008 | 152 | PG-13
Notice that for "The Dark Knight", since there are 2 genres, the rows get shifted to the right. I want to clean the data such that the row becomes:
The Dark Knight | Action, Crime | July 18, 2008 | 152 | PG-13
What I did is (in Jupyter notebook)
import pandas as pd
path = 'movies.csv'
df = pd.read_csv(path, header=0, names=['Movie Name', 'Genre', 'Date Released','Length','Rating','Extra'])
ctrCheck = 0
months = ["January","February","March","April","May","June","July","August","September","October","November","December"]
while ctrCheck < len(df.index):
check = str(df['Date Released'][ctrCheck])
if any(month in check for month in months):
replaceStr = df.loc[ctrCheck, 'Genre'] + "," + df.loc[ctrCheck, 'Date Released']
df.loc[ctrCheck, 'Genres'] = replaceStr
df.loc[ctrCheck, 'Date Released'] = df.loc[ctrCheck, 'Length']
df.loc[ctrCheck, 'Length'] = df.loc[ctrCheck, 'Rating']
df.loc[ctrCheck, 'Rating'] = df.loc[ctrCheck, 'Extra']
ctrCheck = ctrCheck + 1
df.drop(labels='Extra', inplace=True, axis='columns')
Is there a better way to do this, other than iterate through the 10,000 rows?
Thanks!
Upvotes: 0
Views: 1391
Reputation: 524
If i understand correctly, you're looking for a method which does not include an explicit for loop and instead use vectorized pandas methods.
We can first notice that the rows which need transformation are the rows which has a value other than Nan in the last column
Therefore i can suggest the following code:
import pandas as pd
# Name the last unnamed column
df = df.rename(columns={'Unnamed: 5': 'Extra'})
# Save the valid lines in a different dataframe
mask = (df['Extra'].isnull())
df_valid = df[mask]
# Fix the invalid lines
# Fix the Genre
df['Genre'] = df['Genre'] + ' ' + df['Date Released']
# Shift left the columns after 'Genre'
cols = df.columns[:-1]
df.drop('Date Released', axis=1, inplace=True)
df.columns = cols
# Restore valid lines
df.loc[mask, :] = df_valid
The resulting dataframe:
Movie Name Genre Date Released Length Rating
0 The Godfather Crime March 24 1972 175 R
1 The Avengers Action May 5 2012 143 PG-13
2 The Dark Knight Action Crime July 18 2008 152 PG-13
Notice This method only works if the maximum number of genres per movie is 2, which is the case if i understand correctly :)
Upvotes: 2