hello kee
hello kee

Reputation: 319

Applying Regex across entire column of a Dataframe

I have a Dataframe with 3 columns:

id,name,team 
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n

I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.

This is what I have tried thus far:

df['team'] = re.sub(r'[\n\r]*','',df['team'])

But this throws an error AttributeError: 'Series' object has no attribute 're'

Could anyone advice how could I loop this regex through the entire Dataframe df['team'] column

Upvotes: 10

Views: 59778

Answers (4)

Chadee Fouad
Chadee Fouad

Reputation: 2948

Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.

RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)    
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)

Upvotes: 1

josem8f
josem8f

Reputation: 339

As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)

Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

Upvotes: 13

user1966723
user1966723

Reputation: 95

Another example (but without regex) but maybe still usefull for someone.

id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series(['     marketing','admin\n', 'finance\n'])

testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)

Upvotes: 0

YOLO
YOLO

Reputation: 21759

You are almost there, there are two simple ways of doing this:

# option 1 - faster way
df['team'] =  [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]

# option 2
df['team'] =  df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))

Upvotes: 23

Related Questions