Merging several string columns with possible duplicates in a pandas dataframe

Question

I'm trying to migrate data between two of our systems and one system has descriptions split across multiple columns while the destination system only has one column. So I need to merge these 5 columns into a single column while removing possible duplicates.

Here is what I have so far, which works, but is there a way to make it faster? Right now it takes a rather long time to iterate over the 13,000 records I'm working on. (Once I add in more data from our other systems in the future the data could easily reach 30,000 records so every second counts)

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
    descriptions = []
    for col in columns:
        if r[col] not in descriptions:
            descriptions.append(r[col])
    description = ""
    for d in descriptions:
        description += "
" + d
    description = description.strip()
    description_col.append(description)
df["Description"] = description_col

So I guess my question really boils down to, is there a better way to do this?

Edit: To clarify, I have to make sure the data is maintained in both systems however the order of the records is not important so long as the data for each record is kept together.

Also, the order of merging the description columns does not matter since most records aren't going to have any data in more than 3 of them at a time. (Most have data in exactly 1 however there are quite a few that have data in 2 or 3 of the columns)

Edit 2: As requested here is some sample data:

columns = [
    "Item.Name",
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description",
    "Other Data"
]
df = pd.DataFrame([
    ["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
    ["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
    ["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
    ["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)

Branden Ciranni · Accepted Answer

You can use pandas.unique(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

Here is my implementation, I replaced your character with a space so the table would print out without large gaps, just replace that in your code.

import pandas as pd
import numpy as np

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]

rows = [
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['little', 'lamb', 'little', 'lamb', np.nan],
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['whose', 'fleece', 'was', 'white', 'as'],
    ['snow', np.nan, np.nan, np.nan, np.nan]
]

df = pd.DataFrame(data=rows, columns=columns).fillna('')

def merge_row(row):
    return ' '.join(pd.unique(row)).strip()

df['Description'] = list(map(merge_row, df.loc[:,columns].values))

	Item.Asset Description	Item.Fixed Asset Sales Description	Item.Item Description	Item.Purchase Description	Item.Sales Description	Description
0	mary	had	a	little	lamb	mary had a little lamb
1	little	lamb	little	lamb		little lamb
2	mary	had	a	little	lamb	mary had a little lamb
3	whose	fleece	was	white	as	whose fleece was white as
4	snow					snow

Merging several string columns with possible duplicates in a pandas dataframe

Answers (1)

Related Questions