Saurabh Kumar Singh
Saurabh Kumar Singh

Reputation: 45

Regular expression to identify text between semi-colons that contains comma and spaces

I am trying to identify some texts that contains comma(,) and white spaces(\s+) in a csv that is semi-colon(;) separated. Sample csv entries are as followed:

09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam's second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56

In the sample data I am trying to extract following texts:

al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard

Both the instances of target texts have comma(,) in it and that is creating issue when trying to convert the semi-colon(;) separated file into a comma(,) separated file as it adds extra columns for existing commas(,) in the string.

So far I have following regular expression that is taking me to the required texts. However, I am unable to retrieve entire string using this.

Regex: ([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)

Please help.

Upvotes: 0

Views: 75

Answers (1)

mklement0
mklement0

Reputation: 439487

It is simpler to split your input lines into fields and then use a relative simple regex to filter those fields by the characters of interest:

# Assume that $lines contains the lines of the input file, such as
# obtained via Get-Content
$lines -split ';' -match ' .*,|,.* '

This outputs those fields that contain both a space and a comma (,), yielding the output shown in your question.

If you only care about commas, -match ',' will do.


Taking a step back:

You can read your file with Import-Csv -Delimiter ';' and export it to a regular, ,-separated CSV with Export-Csv, which does not require special handling of fields with embedded , chars., because these cmdlets automatically enclose the field values in "..." (double quotes), which allows the fields to contain ,.

If your input file happens to lack a header row (a first line that contains column names), you'll have to supply one yourself, via the -Header parameter.

Upvotes: 1

Related Questions