pandas how to eliminate duplicate rows before they occur

Question

I have a dataframe consisting of the State name and City Name. However, the City names are not simply Pittsburg, Philadelphia, etc. The city name may contain what I call prestige names. Here is a small sample

State            RegionName
Pennsylvania     California (California Uni...
Pennsylvania     Carlisle (Dickinson College)
Pennsylvania     Cecil B. Moore, Philadelphia, also...
...
Pennsylvania     University City, Philadelphia (Drexel Universi...

I need to clean up this data by removing the parenthetical information and such. But my question is this. Both Cecil B. Moore and University City are parts of Philadelphia. If I rename these values the I have two rows of Pennsylvania Philadelphia in my data set. I don't want that.

So from a data science perspective, is it acceptable for me to simply delete one of these rows and rename the RegionName value in the other? Or is there some way, in pandas, to "combine" these rows after cleanup and renaming.

This data will eventually be married to housing values by state and region name (city).

Thank you

James · Accepted Answer

Just ingest all of the row, then use .drop_duplicates() to remove the duplicate rows from the data frame.

pandas how to eliminate duplicate rows before they occur

Answers (1)

Related Questions