Reputation: 141
I have data in excel which I am trying to analyze in R. How do I split the column in my dataframe which contains information in key-value pair into multiple columns.
for example:
Id col1
1 Name: Joe Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua UID: 23456 Location:NY CTN: ************************************
2 CustName: Kim Putok Ctn: ************************************ Location: CA
3 Customer Name: GLORIA ; CTN: ************************************ Affected CTN: ************************************ Location: Egypt
4 Customer Name: ; ;Leahi ; Reason for the Call: ; ; ; Location: CA discription: xyz
5 auto pop / suspended line -cx speaking spanish -transfered call
Please note here data in col1
is not in specific format and it contains lot of delimiters, data in the next line in same cell, data not in key-value pair (such type of data needs to be removed).
How do I tackle or clean such kind of data? How do I split such data in key-value pair across multiple columns? Any suggestion will be great help.
--EDIT-- My original data
> dput(head(asummy, 10))
structure(list(`arrange_df[, 12]` = c("Customer Name: JOE\r\nReason for the Call: BP Set Up\r\nResolution: \r\nvisual audit on the account\r\nset expectations in BP\r\noffered call back\r\nCenter Location: Clark\r\nCTN (Number Calling About): ************************************\r\nAVAYA (Number Calling From): ************************************\r\nAgent UID: cm***************************u\r\n",
"Name: Kim Putok\r\nCtn: ************************************\r\nReason for calling: lost phone/ ************************************/ follow up on insurance claim/ routed to asurion\r\nResolution:\r\nLocation: Clark\r\nattuid: gb***************************r\r\n",
"Customer Name: Heather \r\nReason for the Call: got suspended / card issue / supposedly paid / complains she just updated online her card then got susp\r\nResolution: \r\nCenter Location: Clark\r\nCTN: ************************************\r\nAlt No: ************************************\r\nCredit Reason: ********* courtesy \r\nAgent UID: rc***************************w/I-EQR******************G\r\n",
"Customer Name: GLORIA ; CTN: ************************************ Affected CTN: ************************************ Alternate Number: none Reason for the Call: change rate plan Resolution: set exp on changing rate plan Agent UID: gt***************************g",
"Customer Name: BRANDY \r\nReason for the Call: payment\r\nRecommendations/Troubleshooting Steps:\r\nResolution: explained card errors\r\nasked for alt # but no good\r\noffered to use different card or refill card\r\nsent qp link\r\nshe will go to bank to check as per cx\r\nCenter Location:\r\nCTN: ************************************\r\nCredit Reason (If credit was applied to account):\r\nAgent UID: jq***************************n\r\n",
"Customer Name: ; ;Leahi ; Reason for the Call: ; ; ;billing issue / ******************.****************** dollars only orig. bill / due date shld be *********th / why ****************** now? / ctn change questions ; Resolution: ; ;explained bill / ctn change info provided ; Center Location: ;Clark CTN: ; ;************************************ Alt No: Credit Reason: Agent UID: rc***************************w/I-K*********GMDR",
"Customer Name: MICHAEL\r\nReason for the Call: ACCOUNT BAL INQ\r\n\r\nRecommendations/Troubleshooting Steps:\r\n*Account verified and provided information\r\n\r\nResolution: \r\n*calling about the account status\r\n*suspended due to BBP\r\n*adv about BP policy\r\n*Provided payment options\r\n\r\nCenter Location:Clark\r\nCTN************************************:\r\nCredit Reason (If credit was applied to account):\r\nAgent UID:rc*********\r\n",
"Customer Name: Kimberly\r\nCTN: ************************************\r\nAffected CTN:************************************\r\nAlternate Number: none\r\nReason for the Call: add mhs otc $******************\r\nResolution: added mhs otc $******************\r\nAgent UID: gt***************************g\r\n",
"auto pop / suspended line\r\n-cx speaking spanish\r\n-transfered call\r\n",
"Customer Name: TERRELL ; Reason for the Call: payment Recommendations/Troubleshooting Steps: Resolution: payment success test and validated CTN: ************************************ Credit Reason (If credit was applied to account): Agent UID: jq***************************n"
)), row.names = c(NA, 10L), class = "data.frame")
Upvotes: 1
Views: 200
Reputation: 59345
My advice: rather than treat this dataset as a given, find out more about how it was generated. For instance, the sample you provided seems to have two types of row, one where the name:value pairs are separated by \r\n (these are easy to parse), and one where they are not (these are a lot harder to parse). I suspect there is another column in your dataset that distinguishes these, but you did not provide that. For instance, I notice you provided column 12 of arrange_df
. What's in the other columns?
For the first type, in rows 1,2,3,5,7, and 8,
library(data.table)
library(stringi)
#
df <- structure(list(`x` = c("Customer Name: JOE\r\nReason for the Call: BP Set Up\r\nResolution: \r\nvisual audit on the account\r\nset expectations in BP\r\noffered call back\r\nCenter Location: Clark\r\nCTN (Number Calling About): ************************************\r\nAVAYA (Number Calling From): ************************************\r\nAgent UID: cm***************************u\r\n",
"Name: Kim Putok\r\nCtn: ************************************\r\nReason for calling: lost phone/ ************************************/ follow up on insurance claim/ routed to asurion\r\nResolution:\r\nLocation: Clark\r\nattuid: gb***************************r\r\n",
"Customer Name: Heather \r\nReason for the Call: got suspended / card issue / supposedly paid / complains she just updated online her card then got susp\r\nResolution: \r\nCenter Location: Clark\r\nCTN: ************************************\r\nAlt No: ************************************\r\nCredit Reason: ********* courtesy \r\nAgent UID: rc***************************w/I-EQR******************G\r\n",
"Customer Name: GLORIA ; CTN: ************************************ Affected CTN: ************************************ Alternate Number: none Reason for the Call: change rate plan Resolution: set exp on changing rate plan Agent UID: gt***************************g",
"Customer Name: BRANDY \r\nReason for the Call: payment\r\nRecommendations/Troubleshooting Steps:\r\nResolution: explained card errors\r\nasked for alt # but no good\r\noffered to use different card or refill card\r\nsent qp link\r\nshe will go to bank to check as per cx\r\nCenter Location:\r\nCTN: ************************************\r\nCredit Reason (If credit was applied to account):\r\nAgent UID: jq***************************n\r\n",
"Customer Name: ; ;Leahi ; Reason for the Call: ; ; ;billing issue / ******************.****************** dollars only orig. bill / due date shld be *********th / why ****************** now? / ctn change questions ; Resolution: ; ;explained bill / ctn change info provided ; Center Location: ;Clark CTN: ; ;************************************ Alt No: Credit Reason: Agent UID: rc***************************w/I-K*********GMDR",
"Customer Name: MICHAEL\r\nReason for the Call: ACCOUNT BAL INQ\r\n\r\nRecommendations/Troubleshooting Steps:\r\n*Account verified and provided information\r\n\r\nResolution: \r\n*calling about the account status\r\n*suspended due to BBP\r\n*adv about BP policy\r\n*Provided payment options\r\n\r\nCenter Location:Clark\r\nCTN************************************:\r\nCredit Reason (If credit was applied to account):\r\nAgent UID:rc*********\r\n",
"Customer Name: Kimberly\r\nCTN: ************************************\r\nAffected CTN:************************************\r\nAlternate Number: none\r\nReason for the Call: add mhs otc $******************\r\nResolution: added mhs otc $******************\r\nAgent UID: gt***************************g\r\n",
"auto pop / suspended line\r\n-cx speaking spanish\r\n-transfered call\r\n",
"Customer Name: TERRELL ; Reason for the Call: payment Recommendations/Troubleshooting Steps: Resolution: payment success test and validated CTN: ************************************ Credit Reason (If credit was applied to account): Agent UID: jq***************************n"
)), row.names = c(NA, 10L), class = "data.frame")
setDT(df)[, ID:=seq(.N)]
f <- \(str) {stri_match_all_regex(str, '(.+)\\:(.+)')[[1]]}
parsed <- df[c(1,3,5,7,8), as.data.table(f(x)[, 2:3]), by=.(ID)]
Upvotes: 1
Reputation: 1695
The first step is to find a regex that matches the "separators" of your text. For example ': ' seems to be one of them. Here is a possible regex, but it does not work fully
library(tidyverse)
df = tribble(~ Id, ~ col1,
"1", "Name: Joe Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua UID: 23456 Location:NY CTN: ************************************",
"2", "CustName: Kim Putok Ctn: ************************************ Location: CA",
"3", "Customer Name: GLORIA ; CTN: ************************************ Affected CTN: ************************************ Location: Egypt",
"4", "Customer Name: ; ;Leahi ; Reason for the Call: ; ; ; Location: CA discription: xyz",
"5", "auto pop / suspended line -cx speaking spanish -transfered call")
regex <- ': |;' # TO BE MODIFIED TO SATISFY THE PROBLEM
lapply(1:nrow(df), \(i) strsplit(df$col1[i], regex))
To find the right regex, use this tool : https://www.autoregex.xyz/
Once the good regex has been found, one of these functions may be useful : strsplit
, extract
, separate
(tidyr
package)
It's not over yet, but it's the beginning of a solution
Upvotes: 1