will2609
will2609

Reputation: 33

how can i delete specified words that occur in list

I have a data-frame that has text in the first column named 'original_column'.

I have successfully been able to pick specific words out of the text column 'original_column' with a list and have them appended to another column and deleted from the original column with the following code:

list1 = {’text’ , ‘and’ , ‘example’}

finder = lambda x: next(iter([y for y in x.split() if y in list1]), None)

df['list1'] = df.original_column.apply(finder)

df['original column']=df['original column'].replace(regex=r'(?i)'+ df['list1'],value="")

I would now like to build on this code by being able to delete ONLY THE FIRST instance of the the specific words in the list from the 'original_column' after appending the listed word to a new column.

The data-frame currently looks like this:

|   original column  |
__________________________
|   text text word   | 
--------------------------
|    and other and   | 

My current code outputs this:

|   original column   | list1
______________________________
|        word         | text
------------------------------
|        other        |  and

My desired to output this:

|   original column   | list1
_______________________________
|      text word      | text
-------------------------------
|      other and      |  and

Upvotes: 2

Views: 52

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Assuming the given dataframe as:

df = pd.DataFrame({"original_column": ["text text word", "text and text"]})

Use:

import re

pattern = '|'.join(f"\s*{item}\s*" for item in list1)
regex = re.compile(pattern)

def extract_words(s):
    s['list1'] = ' '.join(map(str.strip, regex.findall(s['original_column'])))
    s['original_column'] = regex.sub(' ', s['original_column']).strip()
    return s

df = df.apply(extract_words, axis=1)
print(df)

This results the dataframe df as:

  original_column list1
0       text text  word
1       text text   and

Upvotes: 0

BENY
BENY

Reputation: 323226

Let us do replace

df['original column']=df['original column'].replace(regex=r'(?i)'+ df['list1'],value="")
df
Out[101]: 
  original column list1
0      text text   word
1      text  text   and

Upvotes: 1

Related Questions