logziii
logziii

Reputation: 93

Extract pattern from text

I want to extract some pattern which is bit complex.

I want to extract min 5 and max 9 digit alphanumeric characters from column text and print those in new column .If these are multiple i want to do in comma separated format. where all are in comma separated format.

The pattern starts with either alphabet or numeric but dont want that pattern that starts from D or DF_.

df = data.frame(Text=c(("in which some columns are 1A265T up for some rows."),
                    ("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
                    ("some data to the axis A6651F correct columns for these rows"),
                    ("Any output that would allow me to identify that AJ_DF125AA12."),
                    ("how do I find some locations 564789.")))`enter code here`  

Desired output is:

       Text                                                   Pattern

 1       in which some columns are 1A265T , SDFG123 
         up for some rows.                                      1A265T , SDFG123
 2       It's too large to 12345AB MB eyeball in order to 
         identify P12345AB                                      12345AB
 3       some data to the axis A6651F correct columns 
         for these rows                                         A6651F
 4       Any output that would allow me to identify
         that AJ_DF125AA12.                                       NA
 5       how do I find some locations 564789.                   564789  

I have use str_detect function.

df %>% 
  filter(str_detect(text, ".+[A-Z0-9,]+"))

Does anybody know the correct way??

Upvotes: 2

Views: 81

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

You may use

df = data.frame(Text=c(("in which some columns are 1A265T , SDFG123 up for some rows."),
                     ("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
                     ("some data to the axis A6651F correct columns for these rows"),
                     ("Any output that would allow me to identify that AJ_DF125AA12."),
                     ("how do I find some locations 564789.")))

df$Pattern <- lapply(str_extract_all(df$Text, "\\b[A-CE-Z0-9][A-Z0-9]{4,8}\\b"), paste, collapse=",")
df[df==''] <- NA

Output:

                                                            Text        Pattern
1       in which some columns are 1A265T , SDFG123 up for some rows. 1A265T,SDFG123
2 It's too large to 12345AB MB eyeball in order to identify D12345AB        12345AB
3        some data to the axis A6651F correct columns for these rows         A6651F
4      Any output that would allow me to identify that AJ_DF125AA12.             NA
5                               how do I find some locations 564789.         564789

The regex matches

  • \b - a word boundary
  • [A-CE-Z0-9] - ASCII digits or uppercase letters other than D
  • [A-Z0-9]{4,8} - four to eight ASCII digits or uppercase letters
  • \b - a word boundary.

See the regex demo.

Note you may "simplify" the pattern by means of a negative lookahead:

\b(?!D)[A-Z0-9]{5,9}\b

See this regex demo where (?!D) requires that the next char should not be D.

Upvotes: 3

Daniel O
Daniel O

Reputation: 4358

in Base-R

AllNumbers <- regmatches(df$Text, gregexpr("[A-z0-9]+\\d+[A-z0-9]+", df$Text))
AllNumbers <- sapply(AllNumbers, function(x) gsub("^D[A-z0-9]+","",x) )
AllLengths <- sapply(AllNumbers, nchar)

df$Pattern <- sapply(1:length(AllNumbers), function(x)  AllNumbers[[x]][AllLengths[[x]]>=5 & AllLengths[[x]]<=9])

output:

> df
                                                                Text Pattern
1                 in which some columns are 1A265T up for some rows.  1A265T
2 It's too large to 12345AB MB eyeball in order to identify D12345AB 12345AB
3        some data to the axis A6651F correct columns for these rows  A6651F
4      Any output that would allow me to identify that AJ_DF125AA12.        
5                               how do I find some locations 564789.  564789

Upvotes: 0

Related Questions