Reputation: 3805
I am working on a dirty dataset where two columnns that I need to match are not properly formatted :
I want to find for which rows "id" is the same as "parent_id". However, I cannot manage to match it like this :
df["is_the_same"] = (df["id"]==df["parent_id"])
cause some of them would not match (for example, the id "01004" has "1004" as parent_id, and would not match in this case)
How can I select columns that have "id" equal to "parent_id" once potential zeroes have been removed ?
I also tried :
df["is_the_same"] = df["id"].str.endswith(df["parent_id"])
But it seems .str.endswith only work with constant strings (another column)
Upvotes: 2
Views: 1170
Reputation: 34076
Using Series.lstrip
, you can strip off leading zeros from id
column and then match it with parent_id
column, like this:
df["id"].str.lstrip('0') == df["parent_id"]
Assume this to be df
:
In [68]: df
Out[68]:
id parent_id
0 01004 1004
1 1004 1004
2 546 100
In [70]: df["is_the_same"] = df["id"].str.lstrip('0') == df["parent_id"]
In [71]: df
Out[71]:
id parent_id is_the_same
0 01004 1004 True
1 1004 1004 True
2 546 100 False
Upvotes: 1
Reputation: 862841
Use list comprehension with endswith
:
df = pd.DataFrame({'id':['01004','1004','54620'], 'parent_id':['1004','203','20']})
df["is_the_same"] = [x.endswith(y) for x, y in df[["id","parent_id"]].values]
#alternative
#df["is_the_same"] = df.apply(lambda x: x["id"].endswith(x["parent_id"]), axis=1)
print (df)
id parent_id is_the_same
0 01004 1004 True
1 1004 203 False
2 54620 20 True
If difference only leading zeros and numbers compare converted values to integers:
df["is_the_same"] = df["id"].astype(int) == df["parent_id"].astype(int)
print (df)
id parent_id is_the_same
0 01004 1004 True
1 1004 203 False
2 54620 20 False
Upvotes: 2