Reputation: 37

Parse the data into separate individual columns

I have a column called Physical-address that I need to parse into it's own columns so that I need import it into separate address fields. The data in the column is like this:

Here are some examples:

1) 6453 W Church Dr, Summitville, CA, 46741, Brown

2) 2456 North Street, Greensboro, NC 27403

3) 5847 East Wood Dr, VA

4) 530 Alhambra Dr. Apt. A1 Anderson IN 46012

5) 1511 W 9th St, Anderson, IN, 46016

6) 2900 W 22nd St, Anderson, IN, 46013, Madison

7) Community Hospital, 1515 N Madison Ave, Anderson, IN, 46011

Result:

1) 46741

2) 27403

3) NULL

4) 46012

5) 46016

6) 46013

7) 46011

Thanks in advance!

Upvotes: 2

Answers (2)

Gottfried Lesigang

Reputation: 67331

You might try this (thx, gotqn for the sample data!)

DECLARE @DataSource TABLE
(
    ID INT IDENTITY,
    [value] NVARCHAR(MAX)
);

INSERT INTO @DataSource ([value])
VALUES ('6453 W Church Dr, Summitville, CA, 46741, Brown')
      ,('2456 North Street, Greensboro, NC 27403')
      ,('5847 East Wood Dr, VA')
      ,('530 Alhambra Dr. Apt. A1 Anderson IN 46012')
      ,('1511 W 9th St, Anderson, IN, 46016')
      ,('2900 W 22nd St, Anderson, IN, 46013, Madison')
      ,('Community Hospital, 1515 N Madison Ave, Anderson, IN, 46011');

--The query will find all integers in the strings and return those in the intervall needed

SELECT ID,[value]
      ,fragment.value('text()[1]','int')
FROM @DataSource ds
CROSS APPLY(SELECT CAST('<x>' + REPLACE(REPLACE((SELECT [value] AS [*] FOR XML PATH('')),',',' '),' ','</x><x>') + '</x>' AS XML)) A(Casted)
CROSS APPLY A.Casted.nodes('/x[not(empty(. cast as xs:int?))]') B(fragment)
WHERE fragment.value('text()[1]','int') BETWEEN 10000 AND 99999;

Some explanation:

the first CROSS APPLY will replace all commas to blanks and then split the string at all blanks (by transforming a b c to <x>a</x><x>b</x><x>c</x>).

The magic happens in the second CROSS APPLY:
.nodes('/x[not(empty(. cast as xs:int?))]') will return fragments, where the cast to xs:int? does not return empty. In other words: All values which are numbers.

As we return only numbers, we can savely use int in the .value()-method and can therefore simply check the interval using BETWEEN.

UPDATE

Thx to HABO you might change this slightly:

SELECT ID,[value]
      ,fragment.value('text()[1]','nvarchar(max)')
FROM @DataSource ds
CROSS APPLY(SELECT CAST('<x>' + REPLACE(REPLACE((SELECT [value] AS [*] FOR XML PATH('')),',',' '),' ','</x><x>') + '</x>' AS XML)) A(Casted)
CROSS APPLY A.Casted.nodes('/x[not(empty(. cast as xs:int?))]') B(fragment)
WHERE LEN(fragment.value('text()[1]','nvarchar(max)'))=5;

the idea is still the same: Find pure numbers. But the result is handled as string (not omiting leading zeros) and the filter is checking for a length of 5.

Upvotes: 4

gotqn

Reputation: 43666

Assuming you are using at least SQL Server 2012, the following is full working example. If you are using earlier version, replace the IIF with CASE WHEN and the TRY_CONVERT with other technique for checking if a string is a number (you can use pathindex to check if there are any not numeric characters, for example).

DECLARE @DataSource TABLE
(
    [value] NVARCHAR(MAX)
);

INSERT INTO @DataSource ([value])
VALUES ('6453 W Church Dr, Summitville, CA, 46741, Brown')
      ,('2456 North Street, Greensboro, NC 27403')
      ,('5847 East Wood Dr, VA')
      ,('530 Alhambra Dr. Apt. A1 Anderson IN 46012')
      ,('1511 W 9th St, Anderson, IN, 46016')
      ,('2900 W 22nd St, Anderson, IN, 46013, Madison')
      ,('Community Hospital, 1515 N Madison Ave, Anderson, IN, 46011');

WITH DataSource ([value], [valueXML]) AS
(
    SELECT [value]  
          ,CAST(('<X>'+REPLACE(REPLACE([value], ' ', ','),',' ,'</X><X>')+'</X>') AS XML)
    FROM @DataSource
), 
DataSourceFinal ([value], [number], [rank], [numbers_count]) AS
(
SELECT [value]  
      ,T.c.value('.', 'nvarchar(128)')
      ,ROW_NUMBER() OVER (PARTITION BY DS.[value] ORDER BY T.c DESC)
      ,COUNT(T.c.value('.', 'nvarchar(128)')) OVER (PARTITION BY [value])
FROM DataSource DS
CROSS APPLY DS.[valueXML].nodes('X') T(c)
WHERE TRY_CONVERT(BIGINT, T.c.value('.', 'nvarchar(128)')) IS NOT NULL
    AND T.c.value('.', 'nvarchar(128)') <> ''
)
SELECT [value]
      ,IIF([numbers_count] > 1, [number], NULL)
FROM DataSourceFinal
WHERE [rank] = 1;

The idea is simple - replace all empty strings with commas. Then split the newly created CSVs and look for the last number.

This can be solve really easy if you have regex support in your T-SQL. For, example:

WITH DataSource ([value], [number], [rank], [numbers_count]) AS
(
    SELECT DS.[value]
          ,REVERSE(RM.[CaptureValue])
          ,RM.[MatchID]
          ,COUNT(RM.[MatchID]) OVER (PARTITION BY DS.[value])
    FROM @DataSource DS
    CROSS APPLY [dbo].[fn_Utils_RegexMatches] (REVERSE([value]), '\d+') RM
)
SELECT [value]
      ,IIF([numbers_count] > 1, [number], NULL)
FROM DataSource
WHERE [rank] = 0;

Upvotes: 1

Parse the data into separate individual columns

Answers (2)

Some explanation:

UPDATE

Related Questions