myro
myro

Reputation: 1196

parse out wikipedia's IPAc

i'd like to parse out IPAc template's content from a wikipedia markup e.g:

'''Konjac''' ({{IPAc-en|lang|pron|ˈ|k|oʊ|n|j|æ|k}})

Konjac (English pronunciation: /ˈkoʊnjæk/)

'''Konjac''' ({{IPAc-en|lang|pron|ˈ|k|oʊ|n|j|æ|k}} {{respell|KOHN|yak}})

Konjac (English pronunciation: /ˈkoʊnjæk/ kohn-yak)

''Konjac'' is pronounced {{IPAc-en|ˈ|k|oʊ|n|j|æ|k}} in English.

Konjac is pronounced /ˈkoʊnjæk/ in English.

What regex would I need to extract this content |k|oʊ|n|j|æ|k I don't know how to match something that could be there but might not be (lang|pron)

Thank you

Upvotes: 0

Views: 98

Answers (1)

Joanna Derks
Joanna Derks

Reputation: 4063

I would give this a try:

IPAc-en(?:\w|[|])+.(?:[|]|([^}]))+(?:}}\s*{{respell(?:[|]|([^}]))+)?

It should match the main pronounciation as well as the optional 'respell' thing.

The matches of both pronounciations will be in the capturing groups, so you should be able to access it from java.

Explanation:

  • IPAc-en(?:\w|[|])+. - match the beginning and then word characters or the pipe as many times as you can. The match one other character (it's the funny one where pronounciation starts). Don't capture anything.

  • (?:[|]|([^}]))+ - match a pipe (don't capture) or anything else that's not a closing curly bracket (capture - that's the characters you want). Repeat until the end of the string or until you find }

  • (?:}}\s*{{respell(?:[|]|([^}]))+)? - then optionally match the brackets and respell text and use the same logic as above to capture the letters.

Upvotes: 1

Related Questions