Reputation: 2513
Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)
I want easily to remove everything else (with true -false?)
my datas
ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.
expected results
with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.
The data is imported into R with:
readlines ("clipboard")
I am able to identify lines including artist names in capital letters with different regex
e.g.
[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
I am able to identify lines including artworks
^[0-9]+[\s]
Any help would be greatly appreciated.
Upvotes: 4
Views: 132
Reputation: 70722
You can use POSIX character classes if you want. However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class.
I'd recommend turning on Perl regular expressions and use Unicode properties.
x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]
Another interesting way would be to match the accented letters, dissuading from POSIX.
r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]
You can view the compiled demo of both regular expressions be used.
Upvotes: 1
Reputation: 626748
Just a side-note: [:upper:]
matches uppercase letters in the current locale (see source). Thus, this solution is good if you work with one locale:
ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]
See the IDEONE demo
The regex breakdown:
^
- start of string[[:digit:]]+
- 1 or more digits[[:blank:]]
- 1 space or tab|
- or[[:upper:]]['[:upper:]]
- an uppercase letter followed by '
or another uppercase letter.And here is a way to achieve what you need with a Perl-like regex:
ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]
The regex matches:
^
- start of string\\d+\\s
- 1 or more digits and then a whitespace|
- or...\\p{Lu}['\\p{Lu}]
- an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter.The output of the sample demo:
[1] "ÁÀDFDS (artist 1)"
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."
[3] "AB (artist 2)"
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."
[7] "4 Mauris condimentum velit eu consequat feugiat."
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."
[9] "ÉÈDFSF (artist 4)"
[10] "6 Sed cursus augue in tempus scelerisque."
[11] "7 in commodo enim in laoreet gravida."
To clean up the beginning of strings, you can use
ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)
The regex ^[\\P{L}\\D]*?([\\p{L}\\d])
matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \1
backreference with gsub
call. Use it before grep
ping.
See IDEONE demo
Upvotes: 4
Reputation: 24945
You can use grep
:
z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
[1] "AADFDS (artist 1)"
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."
[3] "AB (artist 2)"
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."
[7] "4 Mauris condimentum velit eu consequat feugiat."
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."
[9] "CCDDFSF (artist 4)"
[10] "6 Sed cursus augue in tempus scelerisque."
[11] "7 in commodo enim in laoreet gravida."
Upvotes: 1