user113156
user113156

Reputation: 7147

Adding back trailing zeros in an identification column

I have the following data which looks like the following;

            COLUPC               UPC
19638 715643501208 00-01-82000-72608
11783  15230000022 00-01-01820-00198
17140  18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706  28382012393 27-01-15503-01673
8716   13410025574 00-01-78250-00236

The columns COLUPC and UPC come from two different data frames and a random sample of each was taken. The COLUPC and UPC are the same unique identifiers. It is just that COLUPC has lost its format.

Taking the first row value for COLUPC - 715643501208. It should look like 07-01-56435-01208 following the same structure as in column UPC.

It gets a little more tricky with observation 15230000022 It should look like 01-05-23000-00022 where there was an extra 0 added in the 4th component 00022 along with component 1 and 2.

The documentation that comes along with the data file states that COLUPC has no leading zeros shown.

So;

Component 1 is 2 digits

Component 2 is 2 digits (1 digit with a 0 at the begining)

Component 3 is 5 digits

Component 4 is 5 digits

Taking this number 00-01-01820-00198 from the UPC column. In the COLUPC column this would collapse to 11820198 where all the leading zeros have been removed from component 1, 2, 3 and 4.

If I am a little unclear let me know.

Data:

structure(list(COLUPC = c(715643501208, 15230000022, 18769210012, 
711925600019, 28382012393, 13410025574, 716708427455, 12210000157, 
13410025602, 17143945712, 17336011341, 13410062505, 11820053218, 
11820002989, 13410059505, 11820043550, 18417331130, 711820053025, 
18982650001, 18248813000, 17199077603, 18834510005, 13410048602, 
18417310545, 11820000987, 13410000217, 17089701331, 13410017256, 
618516510302, 23410030602, 17336080301, 613498630003, 17825000257, 
16206705155, 17031000013, 13410000555, 16206704972, 37031030703, 
27336082301, 17031023882, 18769227102, 18382056793, 31820000697, 
13410048574, 17199048004, 11820041550, 11820000884, 18572511509, 
14182700012, 13410010505, 18700050898, 18248812910, 18015761147, 
18549873837, 16821309117, 618516510402, 17231163009, 611969900006, 
11820011349, 11820000769, 13410046805, 13410000162, 13410000579, 
18374110012, 21820000801, 17231163011, 23410068505, 17199030030, 
11820000771, 17031000009, 28549873736, 11820000784, 17199048016, 
13410057256, 11820054028, 11820001991, 13410057602, 17336080341, 
21820029031, 715643501210, 17231163012, 17199047004, 18769201103, 
13410000623, 15230000540, 618516510704, 17336011348, 15230001301, 
618516510707, 17143934806, 18572511510, 28248800006, 13410057536, 
611969900002, 11820041047, 13410057340, 18769241103, 17199000025, 
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198", 
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673", 
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055", 
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152", 
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750", 
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406", 
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019", 
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121", 
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706", 
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309", 
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030", 
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627", 
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006", 
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000", 
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043", 
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004", 
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657", 
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047", 
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005", 
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141", 
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168", 
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978", 
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306", 
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802", 
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340", 
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559", 
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993", 
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372", 
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078", 
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101", 
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302", 
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952", 
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141", 
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L, 
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L, 
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L, 
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L, 
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L, 
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L, 
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L, 
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L, 
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L, 
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L, 
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L, 
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L, 
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L, 
11687L))

Upvotes: 0

Views: 88

Answers (1)

Mako212
Mako212

Reputation: 7312

I think you can do this with two gsub statements, based on nchar. I'm making the assumption here that for an 11 digit value in COLUPC, you are always missing a zero on the fourth group. And I called your data frame df1 in my sapply call. If you named yours something different you'll need to swap in your name.

We'll define a helper function to pass to sapply:

fixUPC <- function(x){
    if (nchar(x) == 11) {
        # we catch each Component in a separate group, then reconstruct in 
        # our substitution adding zeros and dashes as necessary
        gsub("^(\\d)(\\d)(\\d{5})(\\d{4})","0\\1-0\\2-\\3-0\\4",x)
    } else if (nchar(x) == 12){
        gsub("^(\\d)(\\d)(\\d{5})(\\d{5})","0\\1-0\\2-\\3-\\4",x)
    }
}

df1$fixedUPC <- sapply(df1$COLUPC, fixUPC)

head(df1$fixedUPC)

[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"

You could also extend this to situations where you lost more than 3 characters, i.e. nchar == 10, however, as you lose more numbers, your ability to accurately fill in zeros will depend on how consistent the UPCs were to begin with. For your last example of 11820198, it becomes ambiguous whether the original UPC was 00-01-01820-00198 or 01-01-82019-00008. If you can say with certainty that the zeros always go one place or another for a certain nchar value, you're good to go, but that may not be true in your full data set.

Upvotes: 1

Related Questions