Reputation: 38542

pandas find all exact 4 consecutive digits from string

I have a large text file that contains the text pattern, from that file I make a pandas data frame like below, from this pattern column I want to select a pattern that contains digits and the length of the consecutive digits is exactly 4.

For example, a1234bc5678 is accepted and we can get two new 4 digit patterns from it e.g. 1234 and 5678 but a12345 is not accepted because the length of the consecutive digit is 5, not 4.

print (df.sample(20))

              pattern
13457358     187019980
9892646         920204
2258941        dong998
5792706     diao511001
9144372       a2805938
15519502       YUEH008
15831448     752099429
15659305     469919209
13769825      majunsui
3446320       sishenD2
12970622     woaini123
11633295      guswjddl
12708217  342423198706
2079106     zj87755202
12551254   mxt19950626
4572063        1985625
7805173     theend0512
484820      jzm5583385
15017582       1981122
10868176      30061984

What I have tried: It's a single string that produces the wrong output for me because I only want the exact 4 consecutive digits. Also, I need to do it on pandas data frame if I can get it done.

text = '1234sunwei198734'
postcodes = re.findall('\d{4}',text)
print(postcodes)

Finally I did this way,

df2['pins'] = df2['pattern'].apply(lambda x: re.findall('(?<!\d)\d{4}(?!\d)',x))
df3 = df2[df2['pins'].apply(lambda x: len(x)) > 0]

Upvotes: 1

Answers (3)

Karn Kumar

Reputation: 8826

While the regex already been given, but answering to your comment with pandas column..

sample Dataset from your post:

>>> df1
         pattern
0      187019980
1         920204
2        dong998
3     diao511001
4       a2805938
5        YUEH008
6      752099429
7      469919209
8       majunsui
9       sishenD2
10     woaini123
11      guswjddl
12  342423198706
13    zj87755202
14   mxt19950626
15       1985625
16    theend0512
17    jzm5583385
18       1981122
19      30061984

Applying regex to pandas column ie pattern here, you can use below syntax which basically returns a list object like below..

>>> df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)')
0         []
1         []
2         []
3         []
4         []
5         []
6         []
7         []
8         []
9         []
10        []
11        []
12        []
13        []
14        []
15        []
16    [0512]    <-- this is your matched pattern
17        []
18        []
19        []
Name: pattern, dtype: object

So, you can convert these null list objects to string which will becom NaN then and the drop all them as you need only matching value..

>>> df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)').str[0].dropna()
#  df1['pattern'].str.extract(r'((?<!\d)+\d{4})+(?!\d)').dropna()
16    0512

Better approach for the last bit of your Post:

Just to the better solution which you required and doing with your approach by importing re module, that's not required, you simply can do that as follows. though choice is your :-) ..

>>> df1['pins'] = df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)')
>>> df1[df1['pins'].apply(lambda x: len(x)) > 0]
       pattern    pins
16  theend0512  [0512]

Upvotes: 2