bjwphneb
bjwphneb

Reputation: 57

Pandas - Keep the first n characters where n is defined in the column N

I have a DataFrame from a source where the names are repeated back to back without a delimiter to split upon.

Example:

In [1] 
data = {"Names": ["JakeJake", "ThomasThomas", "HarryHarry"],
       "Scores": [70, 81, 23]}
df = pd.DataFrame(data)

Out [1]

    Names       Scores
0   JakeJake        70
1   ThomasThomas    81
2   HarryHarry      23

I would like a method to keep just the first half of the 'Names' column. My initial thought was to do the following:

In [2]
df["N"] = df["Names"].str.len()//2
df["X"] = df["Names"].str[:df["N"]]

However this gives the output

Out [2]

Names             Scores N    X
0   JakeJake         70  4  nan
1   ThomasThomas     81  6  nan
2   HarryHarry       23  5  nan

The desired output would be

Out [2]

Names            Scores N        X
0   JakeJake        70  4   Jake
1   ThomasThomas    81  6   Thomas
2   HarryHarry      23  5   Harry

I'm sure the answer will be something simple but I can't get my head around it. Cheers

Upvotes: 1

Views: 2203

Answers (4)

SeaBean
SeaBean

Reputation: 23217

You can use .map() on column Names, as follows:

df['X'] = df['Names'].map(lambda x: x[:len(x)//2])

Result:

print(df)

          Names  Scores       X
0      JakeJake      70    Jake
1  ThomasThomas      81  Thomas
2    HarryHarry      23   Harry

Upvotes: 2

Umar.H
Umar.H

Reputation: 23099

use a regex to split the camel case, we can set the rule to split any uppercase letter that is immediately followed by a lower case letter

 n = df['Names'].str.split('(?<=[a-z])(?=[A-Z])',expand=True)[0]
 df['N'], df['X'] = n, n.str.len()

print(df)

          Names  Scores       N  X
0      JakeJake      70    Jake  4
1  ThomasThomas      81  Thomas  6
2    HarryHarry      23   Harry  5

Upvotes: 0

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

With a regex to extract names and str.len for the lengths:

df["X"] = df.Names.str.extract(r"^(.+)\1$")
df["N"] = df.X.str.len()

where regex looks for a fullmatch of anything repeated 2 times (\1 refers to the first capturing group within the regex).

>>> df

          Names  Scores       X  N
0      JakeJake      70    Jake  4
1  ThomasThomas      81  Thomas  6
2    HarryHarry      23   Harry  5

Upvotes: 2

ThePyGuy
ThePyGuy

Reputation: 18426

You can use apply on Names column, then take only the part of the required string.

>>> df.assign(x=df['Names'].apply(lambda x: x[:len(x)//2]))

          Names  Scores       x
0      JakeJake      70    Jake
1  ThomasThomas      81  Thomas
2    HarryHarry      23   Harry

Upvotes: 2

Related Questions