Paul Stoner
Paul Stoner

Reputation: 1512

pandas how to eliminate duplicate rows before they occur

I have a dataframe consisting of the State name and City Name. However, the City names are not simply Pittsburg, Philadelphia, etc. The city name may contain what I call prestige names. Here is a small sample

State            RegionName
Pennsylvania     California (California Uni...
Pennsylvania     Carlisle (Dickinson College)
Pennsylvania     Cecil B. Moore, Philadelphia, also...
...
Pennsylvania     University City, Philadelphia (Drexel Universi...

I need to clean up this data by removing the parenthetical information and such. But my question is this. Both Cecil B. Moore and University City are parts of Philadelphia. If I rename these values the I have two rows of Pennsylvania Philadelphia in my data set. I don't want that.

So from a data science perspective, is it acceptable for me to simply delete one of these rows and rename the RegionName value in the other? Or is there some way, in pandas, to "combine" these rows after cleanup and renaming.

This data will eventually be married to housing values by state and region name (city).

Thank you

Upvotes: 0

Views: 68

Answers (1)

James
James

Reputation: 36691

Just ingest all of the row, then use .drop_duplicates() to remove the duplicate rows from the data frame.

Upvotes: 4

Related Questions