Pandas - Extract Text from Rows

Let's say I have a dataframe that looks like this:

df2 = pd.DataFrame(['Apple, 10/01/2016, 31/10/18, david/kate', 'orange', 'pear', 'Apple', '10/01/2016', '02/20/2017'], columns=['A'])

>>> df2

                                         A       file_name
0  Apple, 10/01/2016, 31/10/18, david/kate          a.txt
1                                   orange          a.txt
2                                     pear          b.txt
3                                    Apple          a.txt
4                               10/01/2016          d.txt
5                               02/20/2017          e.txt

What I would like is to just extract the dates in this dataframe, so output would be like this:

                        A        file_name
0    10/01/2016, 31/10/18           a.txt
1    Nothing to return              a.txt
2    Nothing to return              b.txt
3    Nothing to return              a.txt
4    10/01/2016                     d.txt
5    02/20/2017                     e.txt

Does anyone have any suggestions on how to do this? I am not sure where to begin.

Edit #1:

I edited my original dataframe and output results to better reflect what I am looking for.

Upvotes: 1

Views: 623

Answers (3)

mad_
mad_

Reputation: 8273

import datetime
import re
def my_func(row):
    temp=''
    for d in row.split(","):
        match=re.match('(\d*/\d*/\d*)',d.strip())
        if match:
            temp =temp + match.group(0)+','
    if(temp):
        return temp[:-1]
    return "Nothing to return"
df2.A=df2.A.apply(lambda x : my_func(x))

Output:

                        A        file_name
0    10/01/2016, 31/10/18           a.txt
1    Nothing to return              a.txt
2    Nothing to return              b.txt
3    Nothing to return              a.txt
4    10/01/2016                     d.txt
5    02/20/2017                     e.txt

Upvotes: 1

BENY
BENY

Reputation: 323316

Using extractall add reindex(df2.index).fillna('Nothing to return')

df2.A.str.extractall(r'(((?:\d+[/-])?\d+[/-]\d+))')[0].groupby(level=0).apply(','.join)
Out[459]: 
0    10/01/2016,31/10/18
4             10/01/2016
5             02/20/2017
Name: 0, dtype: object

Update

df2.A.str.extractall(r'(((?:\d+[/-])?\d+[/-]\d+))')[0].groupby(level=0).apply(','.join).reindex(df2.index).fillna('Nothing to return')
Out[463]: 
0    10/01/2016,31/10/18
1      Nothing to return
2      Nothing to return
3      Nothing to return
4             10/01/2016
5             02/20/2017
Name: 0, dtype: object

Upvotes: 1

Denziloe
Denziloe

Reputation: 8132

Doesn't exactly match your desired output but this structure is probably better and can be easily converted into what you want.

Basically this is a job for regex. This code should find anything of the form number/number/number:

s = df2["A"]
result = s.str.extractall(r"(\d+/\d+/\d+)")[0]
print(result)

>>>    match
    0  0        10/01/2016
       1          31/10/18
    4  0        10/01/2016
    5  0        02/20/2017

Upvotes: 2

Related Questions