Extract characters in the middle of a string (maybe with regex?) in R

Question

I am battling with regex and I can't figure it out.

I have a bid data base extracted from last.fm (www.lastfm.com). The file is a .txt file where each column from each line is delimited by "," (comma) with over 1.7 GB and there are some characters messing up the reading into R. Until now I managed to understand where everything goes wrong and the main problem comes from " (quotation marks) inside other quotation marks.

To elucidate, here is an example of the .txt file when readLines is applied.

[1] "user,\"Method Man & Redman\",\"Da Rockwilder\",0,2012,2,10,8,0,41"       
[2] "user,\"Method Man & Redman\",\"Y.O.U.\",0,2012,2,10,7,56,25"             
[3] "user,\"Method Man & Redman\",\"Blackout\",0,2012,2,10,7,51,53"           
[4] "user,\"Chuckie\",\"Who Is Ready To Jump (Club Mix)\",0,2012,2,10,7,40,12"
[5] "user,\"Opgezwolle\",\"Volle Kracht\",0,2012,2,10,7,36,31"                
[6] "user,\"Opgezwolle\",\"Ut Is Wat Het Is\",0,2012,2,10,7,33,25"

Basically this becomes a data frame with 10 columns: username, "Artist", "Track", loved (0/1), year, month, day, hour, minute, second

The above example can easily be read without any problems but I get problems when something like this happens:

[1] "user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33"

In the first case, due to the double quotation marks, the comma in the name of the track makes this into two different columns and instead of the 10 columns I get 11 columns. On the second case, the 12" leaves the string "open" and only stops until it finds a similar case. When this happens, I loose several lines of the data frame.

What I want as a solution? I want to remove all the " (quotations marks) except the ones that surround the name of the Artist and name of the Track.

Output: The output would have in total four (4) " (quotation marks) in each line. "Artist" and "Track Name". So the output for those 2 lines that give me problem would be:

[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"

I tried to use Regex with gsub and gstring but I can't get it to extract only the " marks that are in excess.

If this is too complicated, something that would extract all the " except the first 3 (quotation marks around Artist name and first quotation mark around Track name) and the last one (quotation mark at the end of Track name), might work for most of the cases (and I would do the rest manually). I am assuming here that no Artist name contains quotation marks.

Any help would be appreciated and if you need any further explanation or data please let me know.

Avinash Raj · Accepted Answer

Use negative lookarounds to remove all the " which are neither preceded nor followed by commas.

(?



DEMO

> x <- c('user,"Fall Out Boy",""The Take Over, The Breaks Over"",0,2010,4,17,7,11,37', 'user,"Gare du Nord","I Want Love 12" Remix",0,2011,6,12,19,32,33')
> gsub("(?


Notice that there needs to be an extra backslash in the pattern argument, because backslashes are escape operators in both R and the regex-engine.

Extract characters in the middle of a string (maybe with regex?) in R

Answers (2)

Related Questions