Reputation: 38542
I have a large text file that contains the text pattern, from that file I make a pandas data frame like below, from this pattern column I want to select a pattern that contains digits and the length of the consecutive digits is exactly 4.
For example, a1234bc5678 is accepted and we can get two new 4 digit patterns from it e.g. 1234 and 5678 but a12345 is not accepted because the length of the consecutive digit is 5, not 4.
print (df.sample(20))
pattern
13457358 187019980
9892646 920204
2258941 dong998
5792706 diao511001
9144372 a2805938
15519502 YUEH008
15831448 752099429
15659305 469919209
13769825 majunsui
3446320 sishenD2
12970622 woaini123
11633295 guswjddl
12708217 342423198706
2079106 zj87755202
12551254 mxt19950626
4572063 1985625
7805173 theend0512
484820 jzm5583385
15017582 1981122
10868176 30061984
What I have tried: It's a single string that produces the wrong output for me because I only want the exact 4 consecutive digits. Also, I need to do it on pandas data frame if I can get it done.
text = '1234sunwei198734'
postcodes = re.findall('\d{4}',text)
print(postcodes)
Finally I did this way,
df2['pins'] = df2['pattern'].apply(lambda x: re.findall('(?<!\d)\d{4}(?!\d)',x))
df3 = df2[df2['pins'].apply(lambda x: len(x)) > 0]
Upvotes: 1
Views: 1213
Reputation: 8826
While the regex already been given, but answering to your comment with pandas column..
>>> df1
pattern
0 187019980
1 920204
2 dong998
3 diao511001
4 a2805938
5 YUEH008
6 752099429
7 469919209
8 majunsui
9 sishenD2
10 woaini123
11 guswjddl
12 342423198706
13 zj87755202
14 mxt19950626
15 1985625
16 theend0512
17 jzm5583385
18 1981122
19 30061984
Applying regex to pandas column ie pattern
here, you can use below syntax which basically returns a list object like below..
>>> df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)')
0 []
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
9 []
10 []
11 []
12 []
13 []
14 []
15 []
16 [0512] <-- this is your matched pattern
17 []
18 []
19 []
Name: pattern, dtype: object
So, you can convert these null list objects to string which will becom NaN
then and the drop all them as you need only matching value..
>>> df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)').str[0].dropna()
# df1['pattern'].str.extract(r'((?<!\d)+\d{4})+(?!\d)').dropna()
16 0512
Just to the better solution which you required and doing with your approach by importing re
module, that's not required, you simply can do that as follows. though choice is your :-) ..
>>> df1['pins'] = df1['pattern'].str.findall(r'(?<!\d)\d{4}(?!\d)')
>>> df1[df1['pins'].apply(lambda x: len(x)) > 0]
pattern pins
16 theend0512 [0512]
Upvotes: 2
Reputation: 2271
I think the best regex for the task would be
(?<!\d)(1234|2345|3456|4567|5678|6789|7890|8901|9012|0123)(?!\d)
Of course you could remove what you do not want
If you do not want to explicitly state the 4 consecutive digits pattern you can use less effective means like:
\D(\d{4})\D
Upvotes: 2
Reputation: 43169
Following your description, you might be looking for
(?<!\d)\d{4}(?!\d)
Upvotes: 4