Reputation: 17631
I have an R data.table
with a column of strangely formatted data which I need to parse. For each row, there is a column identity
which is in the following format:
identity
cat:211:93|dog:616:58|bird:1270:46|fish:2068:31|horse:614:1|cow:3719:1012
It's the format name:total_number:count_number
, separated by |
An example of the data.table is as follows:
library(data.table)
foo = data.table(name = c('Luna', 'Bob', 'Melissa'),
number = c(23, 37, 33),
identity = c('cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12', 'bird:1270:35|fish:2068:11|horse:614:44|cow:319:21', 'fish:72:41'))
print(foo)
name number identity
'Luna' 23 cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12
'Bob' 37 bird:1270:35|fish:2068:11|horse:614:44|cow:319:21
'Melissa' 33 fish:72:41
My problem is how to parse these lines such that each name
becomes a new column, and the numbers are calculated as a fraction, count_number/total_number
.
The correct format is as follows:
name number cat dog bird fish horse cow
'Luna' 23 0.2990354 0.1124031 0.02026432 0.02444795 0.001945525 0.03761755
'Bob' 37 NA NA 0.02755906 0.005319149 0.001628664 0.03761755
'Melissa' 33 NA NA NA 0.5694444 NA NA
How could I parse these rows, given I know the 'names' of the columns beforehand?
I think there should be some way to use data.table::tstrsplit()
, e.g.
tstrsplit(foo$identity, "|", fixed=TRUE)
(I'm happy to use a data.frame or dplyr as well.)
Upvotes: 1
Views: 205
Reputation: 25225
You can probably split by |, melt, then split by : again before calculating ratio and reshaping to your desired format.
library(data.table)
#step 4: reshape into desired wide format
dcast(
#step 1: split by | and get the elements into a column
foo[, melt(tstrsplit(identity, "\\|")), by=.(name, number)][,
#step 2: split by : to get count_number and total_number
tstrsplit(value, ":"), by=.(name, number)][,
#step 3: calculate ratio
ratio := as.numeric(V3) / as.numeric(V2)],
name + number ~ V1, value.var="ratio")
output:
name number bird cat cow dog fish horse
1: Bob 37 0.02755906 NA 0.06583072 NA 0.005319149 0.071661238
2: Luna 23 0.02026432 0.2990354 0.03761755 0.1124031 0.024447950 0.001945525
3: Melissa 33 NA NA NA NA 0.569444444 NA
Addressing OP's comment in a more general way: You have to design a solution to your problem first before coding. Picture in your mind what kind of output you are expecting in each step of your solution. Then let the console be your TA and documentation be your lecturer.
For e.g. in your first step of your solution, you split by |
, so you run the below in the console
foo[, tstrsplit(identity, "|", fixed=TRUE)]
What are your expecting? What do you see? Missing name
and number
? Add them in by=
.
foo[, tstrsplit(identity, "|", fixed=TRUE), by=.(name, number)]
Then, what do you get? Error? Can you fix it? Maybe read the documentation again? If still unable to solve it, maybe search for it online? Remember what you are trying to achieve with this step: How to get it into a single column? Maybe you find something like below:
foo[, unlist(tstrsplit(identity, "|", fixed=TRUE)), by=.(name, number)]
Then, move on to the next step.
Upvotes: 3