Wilcar
Wilcar

Reputation: 2513

removing strings from a vector

Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)

I want easily to remove everything else (with true -false?)

my datas

ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.

expected results

with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.

The data is imported into R with:

readlines ("clipboard")

I am able to identify lines including artist names in capital letters with different regex

e.g.

[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']

I am able to identify lines including artworks

^[0-9]+[\s]

Any help would be greatly appreciated.

Upvotes: 4

Views: 132

Answers (3)

hwnd
hwnd

Reputation: 70722

You can use POSIX character classes if you want. However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class.

I'd recommend turning on Perl regular expressions and use Unicode properties.

x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]

Another interesting way would be to match the accented letters, dissuading from POSIX.

r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]

You can view the compiled demo of both regular expressions be used.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

Just a side-note: [:upper:] matches uppercase letters in the current locale (see source). Thus, this solution is good if you work with one locale:

ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]

See the IDEONE demo

The regex breakdown:

  • ^ - start of string
  • [[:digit:]]+ - 1 or more digits
  • [[:blank:]] - 1 space or tab
  • | - or
  • [[:upper:]]['[:upper:]] - an uppercase letter followed by ' or another uppercase letter.

And here is a way to achieve what you need with a Perl-like regex:

ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]

The regex matches:

  • ^ - start of string
  • \\d+\\s - 1 or more digits and then a whitespace
  • | - or...
  • \\p{Lu}['\\p{Lu}] - an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter.

The output of the sample demo:

[1] "ÁÀDFDS (artist 1)"                                                     
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
[3] "AB (artist 2)"                                                         
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"                                                     
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
[7] "4 Mauris condimentum velit eu consequat feugiat."                      
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
[9] "ÉÈDFSF (artist 4)"                                                     
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."    

To clean up the beginning of strings, you can use

ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)

The regex ^[\\P{L}\\D]*?([\\p{L}\\d]) matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \1 backreference with gsub call. Use it before grepping.

See IDEONE demo

Upvotes: 4

jeremycg
jeremycg

Reputation: 24945

You can use grep:

z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
 [1] "AADFDS (artist 1)"                                                     
 [2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
 [3] "AB (artist 2)"                                                         
 [4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
 [5] "BBDDED (artist 3)"                                                     
 [6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
 [7] "4 Mauris condimentum velit eu consequat feugiat."                      
 [8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
 [9] "CCDDFSF (artist 4)"                                                    
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."  

Upvotes: 1

Related Questions