Pavi
Pavi

Reputation: 1

using regex ,searching for patterns in a list which is in data frame and placing the matching results into new column in pandas

I have a csv file with text column, PF sample data like below

text
['Hello world', 'Welcome to the universe.']
['Hey Hello world', 'I am learning Pandas Welcome to the universe.']
['Hello world how are you', 'Good Morning', 'I am learning Pandas.']
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']

I want to iterate each row and check for the patterns

if the sentence has a pattern Hello or world then I want to place that sentence in a new column text1
if the sentence has a pattern Welcome or universe then I want to place that sentence in a new column text2

so My output looks like below after searching for pattern and placing it in new columns

text,text1,text2
['Hello world', 'Welcome to the universe.'],Hello world,Welcome to the universe.
['Hey Hello world', 'I am learning Pandas Welcome to the universe.'],Hey Hello world,I am learning Pandas Welcome to the universe.
['Hello how are you', 'Good Morning', 'I am learning Pandas.'],Hello how are you,None
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.'],None,Iam version 3.6 Welcome

Can anyone please Guide me how to do this?

Upvotes: 0

Views: 73

Answers (1)

tlentali
tlentali

Reputation: 3455

From your DataFrame :

>>> df = pd.DataFrame({'text': ["['Hello world', 'Welcome to the universe.']",
...                             "['Hey Hello world', 'I am learning Pandas Welcome to the universe.']",
...                             "['Hello world how are you', 'Good Morning', 'I am learning Pandas.']",
...                             "['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']"]}, 
...                   index = [0, 1, 2, 3])
>>> df
    text
0   ['Hello world', 'Welcome to the universe.']
1   ['Hey Hello world', 'I am learning Pandas Welc...
2   ['Hello world how are you', 'Good Morning', 'I...
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...

We can apply two functions, find_substring_text1 and find_substring_text2 on the text column, which is eval as a list :

>>> def find_substring_text1(row):
...     return [s for s in row if any(k in s for k in ['Hello', 'world'])]
    
>>> def find_substring_text2(row):
...     return [s for s in row if any(k in s for k in ['Welcome', 'universe'])]

>>> df['text1'] = df['text'].apply(eval).apply(find_substring_text1)
>>> df['text2'] = df['text'].apply(eval).apply(find_substring_text2)

Then we get the expected result :

>>> df
    text                                                text1                       text2
0   ['Hello world', 'Welcome to the universe.']         [Hello world]               [Welcome to the universe.]
1   ['Hey Hello world', 'I am learning Pandas Welc...   [Hey Hello world]           [I am learning Pandas Welcome to the universe.]
2   ['Hello world how are you', 'Good Morning', 'I...   [Hello world how are you]   []
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...   []                          [Iam version 3.6 Welcome]

If needed, we can even change the list format to string like so :

>>> df['text1'] = [','.join(map(str, l)) for l in df['text1']]
>>> df['text2'] = [','.join(map(str, l)) for l in df['text2']]
>>> df
    text                                                text1                    text2
0   ['Hello world', 'Welcome to the universe.']         Hello world              Welcome to the universe.
1   ['Hey Hello world', 'I am learning Pandas Welc...   Hey Hello world          I am learning Pandas Welcome to the universe.
2   ['Hello world how are you', 'Good Morning', 'I...   Hello world how are you 
3   ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...                            Iam version 3.6 Welcome

Upvotes: 1

Related Questions