Reputation: 183
I have a column in a df that is full of strings like like ["1120", "2230", "1120", "1234" ...]
where not every value in this column is unique.
I want to convert these strings into SEQUENTIAL NUMBERS from 0 to N where N is the number of unique values in that column so that I can make a scatterplot with this data. Simply changing the type of the column is not sufficient for this task. Any guidance is much appreciated. I tried using dummy variables, but don't really know where to start.
Upvotes: 0
Views: 1002
Reputation: 1573
Just use .unique()
and .reset_index()
to get a lookup table from the strings to sequential ID and then .join()
the tables:
df = pd.DataFrame(["1120", "2230", "1120", "1234"], columns=["num"])
sequential = pd.Series(df["num"].unique()).reset_index().rename(columns={0: "num"})
df.merge(sequential, on="num")
EDIT:
If you want to sort the number strings by their integer values first, you can add this line of code to sort (before you create the sequential Series):
df["num"] = df["num"].astype("int").sort_values().reset_index(drop=True)
Upvotes: 3
Reputation: 51643
You can hack this - I am not good enough to give you a pure pandas implementation, but this might work:
import pandas as pd
# prepare demo df
df = pd.DataFrame({"nums" : ["1120", "2230", "1120", "1234","9999","9"]})
# map(int, ...) needed for numeric sorting instead of lexicographical sorting
s = {str(v):i for i,v in enumerate(sorted(set(map(int, df["nums"]))))}
df["ReIndex"] = [s[key] for key in df["nums"]]
print(df)
Output:
nums ReIndex
0 1120 1
1 2230 3
2 1120 1
3 1234 2
4 9999 4
5 9 0
You can del df["nums"]
and df.rename(columns = {'ReIndex':'nums'}, inplace = True)
to get the numbers to be what you need them to be.
Upvotes: 0