Reputation: 7147
I have the following data which looks like the following;
COLUPC UPC
19638 715643501208 00-01-82000-72608
11783 15230000022 00-01-01820-00198
17140 18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706 28382012393 27-01-15503-01673
8716 13410025574 00-01-78250-00236
The columns COLUPC
and UPC
come from two different data frames and a random sample of each was taken. The COLUPC
and UPC
are the same unique identifiers. It is just that COLUPC
has lost its format.
Taking the first row value for COLUPC
- 715643501208
. It should look like 07-01-56435-01208
following the same structure as in column UPC
.
It gets a little more tricky with observation 15230000022
It should look like 01-05-23000-00022
where there was an extra 0
added in the 4th component 00022
along with component 1 and 2.
The documentation that comes along with the data file states that COLUPC
has no leading zeros shown.
So;
Component 1 is 2 digits
Component 2 is 2 digits (1 digit with a 0 at the begining)
Component 3 is 5 digits
Component 4 is 5 digits
Taking this number 00-01-01820-00198
from the UPC
column. In the COLUPC
column this would collapse to 11820198
where all the leading zeros have been removed from component 1, 2, 3 and 4.
If I am a little unclear let me know.
Data:
structure(list(COLUPC = c(715643501208, 15230000022, 18769210012,
711925600019, 28382012393, 13410025574, 716708427455, 12210000157,
13410025602, 17143945712, 17336011341, 13410062505, 11820053218,
11820002989, 13410059505, 11820043550, 18417331130, 711820053025,
18982650001, 18248813000, 17199077603, 18834510005, 13410048602,
18417310545, 11820000987, 13410000217, 17089701331, 13410017256,
618516510302, 23410030602, 17336080301, 613498630003, 17825000257,
16206705155, 17031000013, 13410000555, 16206704972, 37031030703,
27336082301, 17031023882, 18769227102, 18382056793, 31820000697,
13410048574, 17199048004, 11820041550, 11820000884, 18572511509,
14182700012, 13410010505, 18700050898, 18248812910, 18015761147,
18549873837, 16821309117, 618516510402, 17231163009, 611969900006,
11820011349, 11820000769, 13410046805, 13410000162, 13410000579,
18374110012, 21820000801, 17231163011, 23410068505, 17199030030,
11820000771, 17031000009, 28549873736, 11820000784, 17199048016,
13410057256, 11820054028, 11820001991, 13410057602, 17336080341,
21820029031, 715643501210, 17231163012, 17199047004, 18769201103,
13410000623, 15230000540, 618516510704, 17336011348, 15230001301,
618516510707, 17143934806, 18572511510, 28248800006, 13410057536,
611969900002, 11820041047, 13410057340, 18769241103, 17199000025,
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198",
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673",
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055",
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152",
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750",
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406",
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019",
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121",
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706",
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309",
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030",
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627",
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006",
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000",
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043",
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004",
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657",
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047",
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005",
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141",
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168",
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978",
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306",
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802",
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340",
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559",
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993",
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372",
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078",
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101",
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302",
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952",
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141",
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L,
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L,
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L,
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L,
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L,
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L,
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L,
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L,
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L,
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L,
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L,
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L,
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L,
11687L))
Upvotes: 0
Views: 88
Reputation: 7312
I think you can do this with two gsub
statements, based on nchar
. I'm making the assumption here that for an 11 digit value in COLUPC
, you are always missing a zero on the fourth group. And I called your data frame df1
in my sapply
call. If you named yours something different you'll need to swap in your name.
We'll define a helper function to pass to sapply
:
fixUPC <- function(x){
if (nchar(x) == 11) {
# we catch each Component in a separate group, then reconstruct in
# our substitution adding zeros and dashes as necessary
gsub("^(\\d)(\\d)(\\d{5})(\\d{4})","0\\1-0\\2-\\3-0\\4",x)
} else if (nchar(x) == 12){
gsub("^(\\d)(\\d)(\\d{5})(\\d{5})","0\\1-0\\2-\\3-\\4",x)
}
}
df1$fixedUPC <- sapply(df1$COLUPC, fixUPC)
head(df1$fixedUPC)
[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"
You could also extend this to situations where you lost more than 3 characters, i.e. nchar == 10
, however, as you lose more numbers, your ability to accurately fill in zeros will depend on how consistent the UPCs were to begin with. For your last example of 11820198
, it becomes ambiguous whether the original UPC was 00-01-01820-00198
or 01-01-82019-00008
. If you can say with certainty that the zeros always go one place or another for a certain nchar
value, you're good to go, but that may not be true in your full data set.
Upvotes: 1