Michael Owen
Michael Owen

Reputation: 375

How to correct a set of values in a column using REGEXP by removing spaces in MySQL?

I am looking for some help in order to cleanse a list of reference numbers that I have in my table. A correct reference number is either made up of 8 or 13 numbers so I have been looking at different patterns in my data table to workout what errors I need to fix. The following is a sample of the errors that I have come across and was looking for advice to take on best practices rather than repetitive coding.

The following is my create table and insert statements and my desired output.

Create Table

CREATE TABLE “BU_REF_NO" 
   (           "REF_NO" VARCHAR2(255 BYTE)
   );

Insert Statement

INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('12 103 501');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('10 1234 5678 901');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('12 3456 7890 123');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('123456 7890 123');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('987654 3210 321');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('12345678 1234567890123');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('1234567890123 12345678');
INSERT INTO "BU_REF_NO" (REF_NO) VALUES ('1234567890 123');

Final Output

REF_NO
12103501
1012345678901
1234567890123
1234567890123
9876543210321
12345678 1234567890123
1234567890123 12345678
1234567890123

My Query to pick out the sequences of rows beginning with a number and after 2 characters there is a space.

SELECT
    *
FROM
    BU_REF_NO
WHERE
    REF_NO REGEXP '^[0-9]{2} '
    ;

This gives me the following output:

REF_NO
12 103 501
10 1234 5678 901
12 3456 7890 123

Now where I am stuck is trying to workout the right way to remove the spaces for all the different sequences i.e. 2 characters then a space, or 6 characters and a space. I was looking at running the following code for the sequences I have in my table but I am sure there is an easier way to implement this hence looking for some advice.

What I currently have is the following (not tested yet):

UPDATE BU_REF_NO SET REF_NO = '' WHERE REF_NO REGEXP '^[0-9]{2} ';
UPDATE BU_REF_NO SET REF_NO = '' WHERE REF_NO REGEXP '^[0-9]{6} ';
UPDATE BU_REF_NO SET REF_NO = '' WHERE REF_NO REGEXP '^[0-9]{10} ';

The other thing that I probably need to include is to check for the total number of values excluding the spaces as I only need to run some code when I either have a total of 8 or 13 values, so this probably needs to be included at the start of my code. Lastly if I have less than 8 or more than 13 values than the REF_NO just stays as it is.

Thanks in advance.


UPDATE:

Hi, thanks to everyone for your responses. To make it clear from the sample data, I am showing which columns should get amended and which stay the same (BU_REF_NO_AMENDED is not a column that exists, just showing after code what it should look like):

BU_REF_NO | BU_REF_NO_AMENDED
12 103 501 | 12103501
10 1234 5678 901 | 1012345678901
12 3456 7890 123 | 1234567890123
123456 7890 123 | 1234567890123
987654 3210 321 | 9876543210321
12345678 1234567890123 | 12345678 1234567890123
1234567890123 12345678 | 1234567890123 12345678
1234567890 123 | 1234567890123

So all the rows above in the amended column remove the spaces (BU_REF_AMENDED) apart from 2 (12345678 1234567890123, 1234567890123 12345678). The reason for this is because these 2 rows have got 8 digits, a space and then 13 digits or 13 digits, a space and then 8 digits.

The reason these 2 rows are not amended is because these rows have correct ref_numbers and I need to leave these rows as they are.

Hope that clarifies my final output.


Hi, I have wrote the following query to work out the length of numbers in the rows that I want as I thought that might help me to write an inner query to select only the rows that need the spaces removing to make it easier.

SELECT
  REF_NO, 
  LENGTH(REF_NO) AS LENGTH_WITH_SPACE,
  LENGTH(REF_NO) - LENGTH(REPLACE(REF_NO, ' ', '')) AS SPACES,
  LENGTH(REF_NO) - (LENGTH(REF_NO) - LENGTH(REPLACE(REF_NO, ' ', ''))) AS LENGTH_WITHOUT_SPACES
FROM
  BU_REF_NO

;

This gives me the output:

REF_NO | LENGTH_WITH_SPACE | SPACES | LENGTH_WITHOUT_SPACES
12 103 501 | 10 | 2 | 8
10 1234 5678 901 | 16 | 3 | 13
12 3456 7890 123 | 16 | 3 | 13
123456 7890 123 | 15 | 2 | 13
987654 3210 321 | 15 | 2 | 13
12345678 1234567890123 | 22 | 1 | 21
1234567890123 12345678 | 22 | 1 | 21
1234567890 123 | 14 | 1 | 13 

Now I am thinking of writing an outer query that will select the LENGTH_WITHOUT_SPACES column for all values that equal 13 and also where the spaces are greater than 1 in the SPACES column. I can then run the REGEXP REPLACE query to replace spaces.

Will try that and post my results.

Here is what I attempted and it works!

SELECT
  REF_NO,
  CASE
    WHEN LENGTH_WITHOUT_SPACES = 8 AND SPACES > 0 THEN REGEXP_REPLACE(REF_NO,' ','') 
    WHEN LENGTH_WITHOUT_SPACES = 13 AND SPACES > 0 THEN REGEXP_REPLACE(REF_NO,' ','')
    ELSE REF_NO 
  END AS REF_AMENDED
FROM
  (
SELECT
  REF_NO, 
  LENGTH(REF_NO) AS LENGTH_WITH_SPACE,
  LENGTH(REF_NO) - LENGTH(REPLACE(REF_NO, ' ', '')) AS SPACES,
  LENGTH(REF_NO) - (LENGTH(REF_NO) - LENGTH(REPLACE(REF_NO, ' ', ''))) AS LENGTH_WITHOUT_SPACES
FROM
  BU_REF_NO
;

Just need to have this update the exist column but getting there.

Upvotes: 0

Views: 59

Answers (2)

user10703542
user10703542

Reputation:

the regex "^[0-9]+" without quantifier should suffice

Upvotes: 1

nbk
nbk

Reputation: 49373

See if a simple Replace is enough, you example is flawed.

CREATE TABLE BU_REF_NO 
   (           END_DATE VARCHAR(255)
   );
INSERT INTO BU_REF_NO (END_DATE) VALUES ('12 103 501');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('10 1234 5678 901');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('12 3456 7890 123');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('123456 7890 123');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('987654 3210 321');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('12345678 1234567890123');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('1234567890123 12345678');
INSERT INTO BU_REF_NO (END_DATE) VALUES ('1234567890 123');
✓

✓

✓

✓

✓

✓

✓

✓
UPDATE BU_REF_NO SET END_DATE = REPLACE(END_DATE,' ','');
SELECT * FROM BU_REF_NO;
| END_DATE              |
| :-------------------- |
| 12103501              |
| 1012345678901         |
| 1234567890123         |
| 1234567890123         |
| 9876543210321         |
| 123456781234567890123 |
| 123456789012312345678 |
| 1234567890123         |

db<>fiddle here

Upvotes: 1

Related Questions