tansonnhut
tansonnhut

Reputation: 3

Merging several string columns with possible duplicates in a pandas dataframe

I'm trying to migrate data between two of our systems and one system has descriptions split across multiple columns while the destination system only has one column. So I need to merge these 5 columns into a single column while removing possible duplicates.

Here is what I have so far, which works, but is there a way to make it faster? Right now it takes a rather long time to iterate over the 13,000 records I'm working on. (Once I add in more data from our other systems in the future the data could easily reach 30,000 records so every second counts)

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
    descriptions = []
    for col in columns:
        if r[col] not in descriptions:
            descriptions.append(r[col])
    description = ""
    for d in descriptions:
        description += "\n" + d
    description = description.strip()
    description_col.append(description)
df["Description"] = description_col

So I guess my question really boils down to, is there a better way to do this?

Edit: To clarify, I have to make sure the data is maintained in both systems however the order of the records is not important so long as the data for each record is kept together.

Also, the order of merging the description columns does not matter since most records aren't going to have any data in more than 3 of them at a time. (Most have data in exactly 1 however there are quite a few that have data in 2 or 3 of the columns)

Edit 2: As requested here is some sample data:

columns = [
    "Item.Name",
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description",
    "Other Data"
]
df = pd.DataFrame([
    ["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
    ["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
    ["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
    ["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)

Upvotes: 0

Views: 129

Answers (1)

Branden Ciranni
Branden Ciranni

Reputation: 492

You can use pandas.unique(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

Here is my implementation, I replaced your \n character with a space so the table would print out without large gaps, just replace that in your code.

import pandas as pd
import numpy as np

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]

rows = [
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['little', 'lamb', 'little', 'lamb', np.nan],
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['whose', 'fleece', 'was', 'white', 'as'],
    ['snow', np.nan, np.nan, np.nan, np.nan]
]

df = pd.DataFrame(data=rows, columns=columns).fillna('')

def merge_row(row):
    return ' '.join(pd.unique(row)).strip()

df['Description'] = list(map(merge_row, df.loc[:,columns].values))
Item.Asset Description Item.Fixed Asset Sales Description Item.Item Description Item.Purchase Description Item.Sales Description Description
0 mary had a little lamb mary had a little lamb
1 little lamb little lamb little lamb
2 mary had a little lamb mary had a little lamb
3 whose fleece was white as whose fleece was white as
4 snow snow

Upvotes: 1

Related Questions