Reputation: 123
I am battling with regex and I can't figure it out.
I have a bid data base extracted from last.fm (www.lastfm.com). The file is a .txt file where each column from each line is delimited by "," (comma) with over 1.7 GB and there are some characters messing up the reading into R. Until now I managed to understand where everything goes wrong and the main problem comes from " (quotation marks) inside other quotation marks.
To elucidate, here is an example of the .txt file when readLines is applied.
[1] "user,\"Method Man & Redman\",\"Da Rockwilder\",0,2012,2,10,8,0,41"
[2] "user,\"Method Man & Redman\",\"Y.O.U.\",0,2012,2,10,7,56,25"
[3] "user,\"Method Man & Redman\",\"Blackout\",0,2012,2,10,7,51,53"
[4] "user,\"Chuckie\",\"Who Is Ready To Jump (Club Mix)\",0,2012,2,10,7,40,12"
[5] "user,\"Opgezwolle\",\"Volle Kracht\",0,2012,2,10,7,36,31"
[6] "user,\"Opgezwolle\",\"Ut Is Wat Het Is\",0,2012,2,10,7,33,25"
Basically this becomes a data frame with 10 columns: username, "Artist", "Track", loved (0/1), year, month, day, hour, minute, second
The above example can easily be read without any problems but I get problems when something like this happens:
[1] "user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33"
In the first case, due to the double quotation marks, the comma in the name of the track makes this into two different columns and instead of the 10 columns I get 11 columns. On the second case, the 12" leaves the string "open" and only stops until it finds a similar case. When this happens, I loose several lines of the data frame.
What I want as a solution? I want to remove all the " (quotations marks) except the ones that surround the name of the Artist and name of the Track.
Output: The output would have in total four (4) " (quotation marks) in each line. "Artist" and "Track Name". So the output for those 2 lines that give me problem would be:
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
I tried to use Regex with gsub and gstring but I can't get it to extract only the " marks that are in excess.
If this is too complicated, something that would extract all the " except the first 3 (quotation marks around Artist name and first quotation mark around Track name) and the last one (quotation mark at the end of Track name), might work for most of the cases (and I would do the rest manually). I am assuming here that no Artist name contains quotation marks.
Any help would be appreciated and if you need any further explanation or data please let me know.
Upvotes: 0
Views: 2899
Reputation: 174706
Use negative lookarounds to remove all the \"
which are neither preceded nor followed by commas.
(?<!,)\\"(?!,)
> x <- c('user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37', 'user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33')
> gsub("(?<!,)\\\"(?!,)", "", x, perl=T)
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
Notice that there needs to be an extra backslash in the pattern argument, because backslashes are escape operators in both R and the regex-engine.
Upvotes: 4
Reputation: 263352
Character classes with alphanumeric and double quote and backreferences can do it:
gsub("([ 0-9a-zA-Z\"])(\\\")([ 0-9a-zA-Z\"])", "\\1\\3",test)
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
Could also consider:
gsub("([ [:alpha:][:digit:]\"])(\\\")([ [:alpha:][:digit:]\"\"])",
"\\1\\3", test)
Basically removing any double-quote mark that is flanked both sides by a class that doesn't have a comma in it. Would break down if there were spaces between your quoting-marks and the correct separating marks. The ?regex page describes your options for using character classes. The parentheses are the delimiters for backreferences: first backref is '\\1'
and refers to the characters matched by the character class inside the first paired parentheses: ([ [:alpha:][:digit:]\"])
. By omitting the middle backreference from the replacement argument the matching double-quotes get eliminated.
Upvotes: 2