Reputation: 23
I am trying to remove duplicates on each row in a column, comparing with the correspondent row in the second column (same df) For example, if column initial_data contains value present in additional_data, then remove that value from initial_data row and create a new column with that data.
my df
initial_data additional_data
HJC, casco helmets integral de moto HJC Helmets
Silverstone Technology Concentrador de Datos SilverStone Technology
Urban Leather UR-52 Chaqueta de Cuero URBAN 5884
expected output
new_data
casco integral de moto
Concentrador de Datos
Leather UR-52 Chaqueta de Cuero
df = [['HJC, casco helmets integral de moto', 'HJC Helmets'], ['Silverstone Technology Concentrador de Datos ', 'SilverStone Technology'], ['Urban Leather UR-52 Chaqueta de Cuero', 'URBAN 5884']]
df = pd.DataFrame(df, columns = ['initial_data', 'additional_data'])
A few mentions, I need to keep the same order, same casing, basically, just to remove matching words, without changing anything else.
Thank you so much for any suggestion you may share. I have tried multiple versions before publishing the question but nothing worked (zip, lists, etc).
Upvotes: 2
Views: 1119
Reputation: 3621
This problem is a bit more complex than it might first appear. Generally, I think you need two puzzle pieces to solve it. First, you need a way to iterate through rows in a pandas DataFrame and generate a column value from the contents to two columns, which is where the apply()
function is handy. Use the parameter axis=1
to specify that you're going row by row, not column by column.
Second, you need to break the strings into tokens so you can compare lists. This is sometimes called text alignment. In my example, I'm performing a one-way alignment, checking to see if any terms from the left column are absent from the right but, in theory, there could also be terms in the right column that aren't in the left column. To make sure the words will match, my function compares lowercase versions of all words and omits any punctuation, such as the comma in your example (though maybe you want to keep that?).
import string
mydict = {'initial_data':['HJC, casco helmets integral de moto', 'Silverstone Technology Concentrador de Datos', 'Urban Leather UR-52 Chaqueta de Cuero'],
'additional_data':['HJC Helmets', 'SilverStone Technology', 'URBAN 5884'] }
df = pd.DataFrame(mydict)
def align_columns(row):
left = row['initial_data'].split()
right = row['additional_data'].split()
unmatched = []
for i in left:
word = "".join([z for z in i.lower() if z not in string.punctuation])
if word not in [r.lower() for r in right]:
unmatched.append(i)
return " ".join(unmatched)
df['new_data'] = df.apply(align_columns, axis=1)
df
Upvotes: 2
Reputation: 2479
not much of a comprehensive answer, but enough syntax to get you started:
a = df['initial_data'].str.lower().str.split().explode().reset_index()
b = df['additional_data'].str.lower().str.split().explode().reset_index()
'''
[df] a:
+----+---------+----------------+
| | index | initial_data |
|----+---------+----------------|
| 0 | 0 | hjc |
| 1 | 0 | casco |
| 2 | 0 | helmets |
| 3 | 0 | integral |
| 4 | 0 | de |
| 5 | 0 | moto |
| 6 | 1 | silverstone |
| 7 | 1 | technology |
| 8 | 1 | concentrador |
| 9 | 1 | de |
| 10 | 1 | datos |
| 11 | 2 | urban |
| 12 | 2 | leather |
| 13 | 2 | ur-52 |
| 14 | 2 | chaqueta |
| 15 | 2 | de |
| 16 | 2 | cuero |
+----+---------+----------------+
'''
a.columns=['index', 'new_data']
b.columns=['index', 'new_data']
b = b.loc[b['new_data'].isin(a['new_data'])]
'''
[df] b:
+----+---------+-------------+
| | index | new_data |
|----+---------+-------------|
| 0 | 0 | hjc |
| 1 | 0 | helmets |
| 2 | 1 | silverstone |
| 3 | 1 | technology |
| 4 | 2 | urban |
+----+---------+-------------+
'''
c = pd.concat([a, b], axis=0).drop_duplicates(keep=False) # << KEY IDEA/SYNTAX
'''
[df] c:
+----+---------+--------------+
| | index | new_data |
|----+---------+--------------|
| 1 | 0 | casco |
| 3 | 0 | integral |
| 4 | 0 | de |
| 5 | 0 | moto |
| 8 | 1 | concentrador |
| 9 | 1 | de |
| 10 | 1 | datos |
| 12 | 2 | leather |
| 13 | 2 | ur-52 |
| 14 | 2 | chaqueta |
| 15 | 2 | de |
| 16 | 2 | cuero |
+----+---------+--------------+
'''
c.groupby('index')['new_data'].agg(lambda x: ' '.join(x))
output:
index
0 casco integral de moto
1 concentrador de datos
2 leather ur-52 chaqueta de cuero
Name: new_data, dtype: object
Upvotes: 1