RandyMcKay
RandyMcKay

Reputation: 326

PostgreSQL parse countries in array against the countries table

We have content and country tables. Country is pretty simple: country_name column defined as string: Albania, Belgium, China, Denmark etc...

Content is a table with half a million of rows with various data with countries column defined as array text[]. Each value there has a number of countries concatenated like: {"denmark,finland,france,germany,ireland,gb,italy,netherlands,poland,russia,spain,sweden,australia,brazil,canada,china,india,indonesia,japan,malaysia,vietnam,mexico,"south korea",thailand,usa,singapore,uae"}

The update from internal team is for a thousand of records and we are not sure if countries are all spelled correctly. So the task is to reconcile against the country_name in country table.

I am doing replace(replace(country_array::text,'{',''),'}','') as country_text and think about doing UNPIVOT to check each column against country table. Is there any other easier way to make sure countries array in Content table has valid country names from country table?

Thank you

Upvotes: 1

Views: 998

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269803

If you have doubts about some countries not being spelled correctly, then no doubt there are such examples.

Start by getting the list of countries that are not in the reference table:

select c_country, count(*)
from content c cross join lateral
     unnnest(c.countries) c_country left join
     countries co
     on co.country_name = c_country
where co.country_name is not null
group by c_country
order by count(*) desc;

Then, you can go in and fix the data.

There is nothing wrong a priori with storing values in arrays. However, if you were designing the database from scratch, I would probably recommend a contentCountries table and a countryId. That would ensure unambiguous relationships.

In your case, you should probably fix the ingestion process so the values are known to be correct when input. That might be sufficient, given that you already have a lot of data and just need to fix it.

Upvotes: 1

GMB
GMB

Reputation: 222472

You can unnest() each array to a set of rows, and ensure that all values occur in the country table. The following query gives you the array elements that are missing in the reference table:

select *
from 
    content c
    cross join lateral unnest(c.countries) as t(country_name)
    left join country y on y.country_name = t.country_name
where y.country_name is null

Demo on DB Fiddle

country table:

id | country_name
-: | :-----------
 1 | albania     
 2 | denmark     

content table:

id | countries        
-: | :----------------
 1 | {albania,denmark}
 1 | {albania,france} 

query results:

id | countries        | country_name
-: | :--------------- | :-----------
 1 | {albania,france} | france      

Upvotes: 3

Related Questions