Reputation: 329
I use this grep
command line on OS X.
grep -E 'Title|Amount|AwardID|FirstName|LastName| *.xml
and the result is here:
<Title>ABC System</Title>
<Amount>50000</Amount>
<AwardID>1000</AwardID>
<FirstName>Name</FirstName>
<LastName>Thanks</LastName>
and now, I tried to use sed
to replace strings and get things done. But it does not get things done.
What options should I use to get it.
sed -i "" 's/Title//g'
Results as a txt file:
ABC System, 50000, 100, Name, Thanks
I can do it separately.
$ grep -E 'AwardID|AwardAmount|FirstName|LastName' 1433501.xml > test
$ sed -E '/AwardID|AwardAmount|FirstName|LastName/s/.*>([^<]+)<.*/\1/' test
43856 1433501 Faisal Hossain
$ sed -E '/AwardID|AwardAmount|FirstName|LastName/s/.*>([^<]+)<.*/\1/' test | paste -sd',' -
43856,1433501,Faisal,Hossain
but when I put xxx.xml -> *.xml, I need to put new line. What should I put?
AwardTable
xml sel -t -v //AwardID -o , -v //AwardAmount -nl *.xml > AwardTable.csv
InvestigatorTable
xml sel -t -v //AwardID -m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -b -o [PI] -m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -b -o [CoPI] -nl *.xml
How should I get data for InvestigatorTable? How can I have following formats?
ID, Firstname, Lastname, Role
12345, FirstName, LastName, PI
12345, FirstName, LastName, Co-PI
12345, FirstName, LastName, Former-PI
xml sel -t -v //AwardID -o , -v //AwardAmount -m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [PI] -b -m '//Investigator[RoleCode = "Former Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [FoPI] -b -m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [CoPI] -b -nl *.xml
I can get like this
1417948,93147,M. Lee,Allison[PI],Jennifer,Arrigo[CoPI],Cynthia,Chandler[CoPI],Kerstin,Lehnert[CoPI]
1417966,574209,Robb,Lindgren[PI]
1418062,253000,Julia,Coonrod[PI],Gary,Harrison[FoPI]
I can do it manually now but please help it for me.
Please help me to get the results with structures
AwardID, FirstName, LastName, Role
Upvotes: 0
Views: 255
Reputation: 246807
awk would do it:
awk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
END {printf "\b\b \n"}
' << EOF
<Title>ABC System</Title>
<Amount>50000</Amount>
<AwardID>1000</AwardID>
<FirstName>Name</FirstName>
<LastName>Thanks</LastName>
EOF
ABC System, 50000, 1000, Name, Thanks
With multiple files, I assume you want a newline for each file. GNU awk v4 has an extension: ENDFILE
gawk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
ENDFILE {printf "\b\b \n"}
' *.xml
otherwise it's a bit more work:
awk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
FNR == 1 && FILENAME != ARGV[1] {printf "\b\b \n"}
END {printf "\b\b \n"}
' *.xml
For robustness, you should be using an XML parser or XSLT transformation.
Given your sample xml files, here's a solution using xmlstarlet, an xml processing tool I like:
xmlstarlet sel -t -v //AwardTitle -o , -v //AwardAmount -o , -v //AwardID -m //Investigator -o , -v FirstName -o , -v LastName -b -nl 1419538.xml 1424234.xml
IBDR: Workshop on Successful Approaches for Development and Dissemination of Instrumentation for Biological Research - May 1-2, 2014; Rosslyn, VA,49990,1419538,Sameer,Sonkusale,Valencia,Koomson,Eduardo,Rosa-Molinar
RAPID: Role of Physical, Chemical and Diffusion Properties of 4-Methyl-cyclohexane methanol in Remediating Contaminated Water and Water Pipes,49999,1424234,Daniel,Gallagher,Andrea,Dietrich,Paolo,Scardina
If you want to use another XSLT tool, here's the generated stylesheet:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardTitle"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardAmount"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardID"/>
</xsl:call-template>
<xsl:for-each select="//Investigator">
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="FirstName"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="LastName"/>
</xsl:call-template>
</xsl:for-each>
<xsl:value-of select="' '"/>
</xsl:template>
<xsl:template name="value-of-template">
<xsl:param name="select"/>
<xsl:value-of select="$select"/>
<xsl:for-each select="exslt:node-set($select)[position()>1]">
<xsl:value-of select="' '"/>
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The schema is not great. Specifically, it's not flexible: what if there are more than 5 investigators? You need something like this:
Perhaps more simple:
Award table: id, title, amount
AwardInvestigators table: award_id, firstname, lastname, role
BTW, I read the question more carefully. I've amended by xmlstarlet command a bit to ensure the Principal Investigator's name is first:
xmlstarlet sel -t \
-v //AwardID -o , -v //AwardAmount \
-m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -b \
-m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -b \
-nl \
*.xml
Upvotes: 1
Reputation: 77095
Here is another way to do it:
sed -nE '/Title|Amount|AwardID|FirstName|LastName/s/.*>([^<]+)<.*/\1/p' *.xml | paste -sd',' -
With your sample data, it gave the following output:
$ sed -nE '/Title|Amount|AwardID|FirstName|LastName/s/.*>([^<]+)<.*/\1/p' xmlfile | paste -sd',' -
Collaborative Research: Using the Rurutu hotspot to evaluate mantle motion and absolute plate motion models,137715,1433097,Jasper,Konter
Upvotes: 2