Reputation: 51
I want to extract only the states from the below xml file.
<.Table>
<State>Florida</State>
<id>123</id>
<./Table>
<.Table>
<State>Texas</State>
<id>456</id>
<./Table>
Expected output :
(Florida)
(Texas)
But with the below pig statements I get
()
() as output
A = LOAD 'hdfs:/user.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Table') AS (x:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(x,
'<Table>\\n\\s*<State>(.*)</State>\\n\\s*\\n\\s*</Table>'))
as (state:chararray);
Please help me understand where I have gone wrong or how do I eliminate a certain tag line?
Upvotes: 0
Views: 429
Reputation: 111
That looks like a buggy regex, after the closing </State>
you are using \\n\\s*\\n\\s*</Table>
which seems to ignore the the <id>...</id>
elements. Have you looked at using some XML parsing library in a UDF? It might be easier than trying to build a bunch of regexes by hand.
EDIT: One other suggestion. Are you sure that the line separators in your file are just \n
, you may have \r\n
as the separator, in which case [\r\n]+
should help see this post for more details.
Upvotes: 0