syre
syre

Reputation: 982

Replace rogue double-quotes in vector in R

I have a broken CSV file with long text fields containing both double quotes and commas. I've been able to clean it up to some extent and now have tab-separated fields as a vector of whole lines (each value is a line).

head(temp, 2)
[1] "\"org_order\"\t\"organizations.api_path\"\t\"permalink\"\t\"api_path\"\t\"web_path\"\t\"name\"\t\"also_known_as\"\t\"short_description\"\t\"description\"\t\"profile_image_url\"\t\"primary_role\"\t\"role_company\"\t\"role_investor\"\t\"role_group\"\t\"role_school\"\t\"founded_on\"\t\"founded_on_trust_code\"\t\"is_closed\"\t\"closed_on\"\t\"closed_on_trust_code\"\t\"num_employees_min\"\t\"num_employees_max\"\t\"stock_exchange\"\t\"stock_symbol\"\t\"total_funding_usd\"\t\"number_of_investments\"\t\"homepage_url\"\t\"created_at\"\t\"updated_at\""                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[2] "1\t\"organizations/care1st-health-plan-arizona\"\t\"care1st-health-plan-arizona\"\t\"organizations/care1st-health-plan-arizona\"\t\"organization/care1st-health-plan-arizona\"\t\"Care1st Health Plan Arizona\"\t\"\"\t\"Care1st Health Plan Arizona provides high quality health care services.\"\t\"Care1st is a health plan providing support and services to meet the health care needs of eligible members enrolled in KidsCare, AHCCCS, and DDD.\"\t\"http://public.crunchbase.com/t_api_images/v1475743278/m2teurxnhkwacygzdn2m.png\"\t\"company\"\t\"\"\t\"\"\t\"\"\t\"\"\t\"2003-01-01\"\t\"4\"\t\"FALSE\"\t\"\"\t\"0\"\t\"251\"\t\"500\"\t\"\"\t\"\"\t\"0\"\t\"0\"\t\"\"\t\"1475743348\"\t\"1475899305\""  

I then write temp as a file and read it back (which I've found much faster than textConnection). However, read.table("temp", sep = "\t", quote = "\"", encoding = "UTF-8", colClasses = "character") chokes on certain lines and gives me messages such as:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 66951 did not have 29 elements

I think this is due to rogue double quotes, as in the following line (rogue quote can be found immediately after "TripAdvisor de la sant?").

temp[66951]
[1] "67654\t\"organizations/docotop\"\t\"docotop\"\t\"organizations/docotop\"\t\"organization/docotop\"\t\"DOCOTOP\"\t\"\"\t\"Le 'TripAdvisor de la sant?\" est arriv?. Docotop permet de trouver le meilleur professionnel de sant?gr?e ?la communaut?de patients\"\t\"\"\t\"http://public.crunchbase.com/t_api_images/v1455271104/ry9lhcfezcmemoifp92h.png\"\t\"company\"\t\"TRUE\"\t\"\"\t\"\"\t\"\"\t\"2015-11-17\"\t\"7\"\t\"\"\t\"\"\t\"0\"\t\"1\"\t\"10\"\t\"EURONEXT\"\t\"\"\t\"0\"\t\"0\"\t\"http://docotop.com/\"\t\"1455271299\"\t\"1473443321\""

I propose to replace rogue double quotes by single quotes, but I have to leave expected quotes in place. Quotes are expected right before or after a separator (tab) and at the beginning (first line only) and the end of a line. I've written the following attempt at regex with lookarounds for tab and line start and end, but it doesn't work:

temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)

EDIT: I tried @akrun's solution, but get:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 181 did not have 29 elements

The line in question (which didn't cause an error before):

temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others.    PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"

Upvotes: 2

Views: 246

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

Your (?<![^\t])"(?![\t$]) regex matches a " that is not preceded with a char other than a tab (so, there must be a tab or start of string before the "), and that is not followed with a tab or $ symbol.

So, the ^ and $ inside character classes lose their anchor meaning.

Replace the character classes with alternation groups:

gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)

The (?<!\t|^) lookbehind requires that the " is not at the start of the string and is not preceded with a tab.

The (?!\t|$) lookahead requires that the " is not at the end of the string ($) and is not followed with a tab char.

Upvotes: 1

Related Questions