Reputation: 113
I am attempting to parse through email files I have stored on my local workstation. Each file contains a list of hardware orders. Some files may contain multiple lists of hardware in a block starting with Processor: and ending with ExtraIp:. My current script works without issue if the email only contains a single block. The issues arise when the email files contain multiple blocks of data as stated above.
Example issue email:
Processor: Intel Xeon E3-1270 V2 3.5GHZ, Quad Core
RAM: 16GB DDR3 SDRAM
HD1: 2 x SATA Hardware RAID 1 (7,200 rpm)
(+1TB 7200 RPM SATA hard drive)
SSD: No SSD Drive
HD2: SATA Backup Drive
(+1 TB SATA (7,200 rpm))
HD3: No Additional Storage Array
ExtraIp: Public IP Addresses
Processor: Intel Xeon E3-1220 V2 3.1GHZ, Quad Core
RAM: 8GB DDR3 SDRAM
HD1: 2 x SATA Hardware RAID 1 (7,200 rpm)
(+1TB 7200 RPM SATA hard drive)
SSD: No SSD Drive
HD2: No Backup Drive
HD3: No Additional Storage Array
ExtraIp: Public IP Addresses
My script:
#!/bin/bash
find ./email -print0 | while read -d $'\0' file
do
#### Sed and while loop here, with modification to the below lines to read data from the while loop instead of directly from each file ####
#### Example sed command: sed -n "/Processor:/,/ExtraIp:/p" $file ####
order_date=$(echo $file | awk '{print $11}')
grep "Processor:" "$file" | cut -d : -f2 | cut -d , -f1 | while read cpu_type
do
if [ "$cpu_type" != "" ]; then
echo $order_date
echo $cpu_type
ram_size=$(grep "RAM:" "$file" | cut -d : -f2)
if [ "$ram_size" != "" ]; then
echo $ram_size
fi
hd1_type=$(grep "HD1:" "$file" | cut -d : -f2)
if [ "$hd1_type" != "" ]; then
echo $hd1_type
fi
hd1_size=$(grep -A1 "HD1:" "$file" | tail -n1)
if [ "$hd1_size" != "" ]; then
echo $hd1_size
fi
ssd_type=$(grep "SSD:" "$file" | cut -d : -f2)
ssd_type1=$(grep "SSD:" "$file" | cut -d : -f2 | awk '{print $1}')
if [ "$ssd_type" != "" ]; then
echo $ssd_type
fi
if [[ "$ssd_type1" != "No" && "$ssd_type1" != "" ]]; then
ssd_size=$(grep -A1 "SSD:" "$file" | tail -n1)
echo $ssd_size
else
ssd_size="No SSD"
echo $ssd_size
fi
hd2_type=$(grep "HD2:" "$file" | cut -d : -f2)
hd2_type1=$(grep "HD2:" "$file" | cut -d : -f2 | awk '{print $1}')
if [ "$hd2_type" != "" ]; then
echo $hd2_type
fi
if [[ "$hd2_type1" != "No" && "$hd2_type1" != "" ]]; then
hd2_size=$(grep -A1 "HD2:" "$file" | tail -n1)
echo $hd2_size
else
hd2_size="No HD2"
echo $hd2_size
fi
hd3_type=$(grep "HD3:" "$file" | cut -d : -f2)
hd3_type1=$(grep "HD3:" "$file" | cut -d : -f2 | awk '{print $1}')
if [ "$hd3_type" != "" ]; then
echo $hd3_type
fi
if [[ "$hd3_type1" != "No" && "$hd3_type1" != "" ]]; then
hd3_size=$(grep -A1 "HD3:" "$file" | tail -n1)
echo $hd3_size
else
hd3_size="No HD3"
echo $hd3_size
fi
echo "$order_date,$cpu_type,$ram_size,$hd1_type,$hd1_size,$hd2_type,$hd2_size,$hd3_type,$hd3_size" >> order_list.csv
fi
done
done
Expected output:
If the email only contains one block of text I get the correct output:
2014-04-01,Intel Xeon E3-1270 V2 3.5GHZ, 16GB DDR3 SDRAM, 2 x SATA Hardware RAID 1 (7,200 rpm),(+1TB 7200 RPM SATA hard drive), SATA Backup Drive,(+1 TB SATA (7,200 rpm)), No Additional Storage Array,No HD3
If the email contains multiple blocks of text I get the following output:
2014-04-01,Intel Xeon E3-1270 V2 3.5GHZ, 16GB DDR3 SDRAM
8GB DDR3 SDRAM, 2 x SATA Hardware RAID 1 (7,200 rpm)
2 x SATA Hardware RAID 1 (7,200 rpm), (+1TB 7200 RPM SATA hard drive), SATA Backup Drive
No Backup Drive, HD3: No Additional Storage Array, No Additional Storage Array
No Additional Storage Array, ExtraIp: Public IP Addresses
2014-04-01,Intel Xeon E3-1220 V2 3.1GHZ, 16GB DDR3 SDRAM
8GB DDR3 SDRAM, 2 x SATA Hardware RAID 1 (7,200 rpm)
2 x SATA Hardware RAID 1 (7,200 rpm), (+1TB 7200 RPM SATA hard drive), SATA Backup Drive
No Backup Drive, HD3: No Additional Storage Array, No Additional Storage Array
No Additional Storage Array, ExtraIp: Public IP Addresses
In the second output the data from both blocks of text is duplicated for each CSV value (Memory and drives). My plan was to include another while loop from a sed command (placed in the space of the above comment in my script) and then modifying each of the commands to read the data from the while loop.
Example sed command to use:
sed -n "/Processor:/,/ExtraIp:/p" $file
Upvotes: 1
Views: 188
Reputation: 189487
Your parse script uses grep
to extract one field, and when the $file
contains two of the same fields, grep
extracts them both at the same time.
You would be better off refactoring to do all the parsing in Awk. I am not going to complete it for you, but this should be a good start.
awk 'BEGIN { split("Processor:RAM:HD1:SSD:HD2:HD3", f, /:/) }
/^Processor:/ { delete a } # forget any prevous record
/^(Processor|RAM|HD[123]|SSD):/ { i=$1; sub(/:/,"",i);
$1=""; sub(/^ /,""); a[i]=$0 }
i ~ /^(HD[123]|SSD)$/ && $1 == "No" { a[i] = "No " i; i=""; next }
i ~ /^(HD[123]|SSD)$/ && !k { k=i; next } # remember key for two-line entry
k { a[k] = a[k] "," $0; k=i="" }
/^ExtraIp: / {s=""; for (i=1; i<=length(f); i++) {
printf("%s%s", s, a[f[i]]); s="," } printf "\n" }' "$file"
Upvotes: 1