Reputation: 234
I have two datasets. First dataset includes all raw values that must be replaced with acceptable values that are given in the second dataset. If matching acceptable value is not found in second dataset, then leave it its own way.
First looks like this:
SOURCE_ID | TITLE |
---|---|
1 | Emaar Beachfront |
2 | EmaarBeachfront |
3 | emaar beachfront |
4 | dubai hills estate |
5 | Dubai Hills |
6 | Nad Al Sheba |
7 | Nadalsheba |
8 | dubai hills residences |
9 | The Cove Ru |
10 | Homes |
Second looks like this:
ID | TITLE |
---|---|
1 | Emaar Beachfront |
2 | Dubai Hills |
3 | Nad Al Sheba |
4 | The Cove |
So that in the end my dataset looks like this:
SOURCE_ID | TITLE |
---|---|
1 | Emaar Beachfront |
2 | Emaar Beachfront |
3 | Emaar Beachfront |
4 | Dubai Hills |
5 | Dubai Hills |
6 | Nad Al Sheba |
7 | Nad Al Sheba |
8 | Dubai Hills |
9 | The Cove |
10 | Homes |
I thought it is possible via regex, but i am not sure
Upvotes: 0
Views: 54
Reputation:
One solution could be this:
first = ["Emaar Beachfront",
"EmaarBeachfront",
"emaar beachfront",
"dubai hills estate",
"Dubai Hills",
"Nad Al Sheba",
"Nadalsheba",
"dubai hills residences",
"The Cove Ru",
"Homes"]
second = [
"Emaar Beachfront",
"Dubai Hills",
"Nad Al Sheba",
"The Cove"
]
second_transformed = [item.replace(" ", "").lower() for item in second]
out = []
for item in first:
item_transformed = item.replace(" ", "").lower()
item_found = False
for second_item, second_item_transformed in zip(second, second_transformed):
if second_item_transformed in item_transformed:
out.append(second_item)
item_found = True
break
if not item_found:
out.append(item)
print(out)
Upvotes: 1