python_enthusiast
python_enthusiast

Reputation: 346

pandas dataset transformation to normalize the data

I have a csv file like this: Input DataFrame

I want to transform it into a pandas dataframe like this: Output DataFrame

Basically i'm trying to normalize the dataset to populate a sql table.

I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.

Some suggestions would be highly appreciated.

Upvotes: 1

Views: 254

Answers (1)

ManojK
ManojK

Reputation: 1640

If the genre_id is the only numeric value (as shown in the picture), you can use the following:

#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)

#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id') 

#finally remove the extra space
df['genre_id']  = df['genre_id'].str.lstrip() 

#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]

Upvotes: 2

Related Questions