Reputation: 4055
I would like to remove any text that appears after a certain character match either THE END
or FINIS
. I know this is very similar to this other topic, but I am just not skilled enough in regex to make this work for me.
My text is Shakespear books taken from Project Gutenberg. They typically look something like
txt <- "... thou hast tam'd a curst shrow. LUCENTIO. 'Tis a wonder,
by your leave, she will be tam'd so. Exeunt THE END <<THIS ELECTRONIC VERSION OF THE
COMPLETE WORKS OF WILLIAM ..."
or
txt <- "... thou hast tam'd a curst shrow. LUCENTIO. 'Tis a wonder,
by your leave, she will be tam'd so. Exeunt FINIS <<THIS ELECTRONIC VERSION OF THE
COMPLETE WORKS OF WILLIAM ..."
My ideal would look something like gsub("^[THE END]*|^[FINIS]*", "", txt)
returning "... thou hast tam'd a curst shrow. LUCENTIO. 'Tis a wonder, by your leave, she will be tam'd so. Exeunt
Upvotes: 1
Views: 199
Reputation: 30985
You are pretty close to do it, you have to use:
gsub("(THE END|FINIS).*", "", txt)
Btw, as thelatemail pointed in his comment with sub
would be enough for one replacement.
Upvotes: 3