Reputation: 69
I'm trying to parse a simple html table of crime statistics from a police station (Garda is police Irish) from a saved HTML document in a java project. At the moment I am trying to parse the content from the html document and print it to the console. The issue I'm having is that I can only print the numbers (excluding the years) in the table but what I'm trying to achieve is have the name of the crime from the table followed by the 6 figures that follow.
Screenshot of the html table (Cannot embed the image as my reputation is too low)
HTML TABLE
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Recorded Crime Offences (Number) by Garda Station, Type of Offence and<BR>
Year</title>
</head>
<body>
<table border="">
<tbody><tr align="LEFT">
<th colspan="8">Recorded Crime Offences (Number) by Garda Station, Type of Offence and<br>
Year</th>
</tr>
<tr align="LEFT">
<th colspan="2"> </th>
<th valign="TOP" colspan="1">2011</th>
<th valign="TOP" colspan="1">2012</th>
<th valign="TOP" colspan="1">2013</th>
<th valign="TOP" colspan="1">2014</th>
<th valign="TOP" colspan="1">2015</th>
<th valign="TOP" colspan="1">2016</th>
</tr>
<tr align="RIGHT">
<th align="LEFT" valign="TOP" rowspan="12">Balbriggan, D.M.R. Northern Division</th>
<th align="LEFT">03 ,Attempts/threats to murder, assaults, harassments and related offences</th>
<td>96</td>
<td>89</td>
<td>70</td>
<td>97</td>
<td>103</td>
<td>103</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">04 ,Dangerous or negligent acts</th>
<td>72</td>
<td>67</td>
<td>50</td>
<td>53</td>
<td>45</td>
<td>43</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">05 ,Kidnapping and related offences</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>7</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">06 ,Robbery, extortion and hijacking offences</th>
<td>16</td>
<td>19</td>
<td>16</td>
<td>7</td>
<td>11</td>
<td>13</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">07 ,Burglary and related offences</th>
<td>177</td>
<td>190</td>
<td>157</td>
<td>140</td>
<td>151</td>
<td>139</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">08 ,Theft and related offences</th>
<td>510</td>
<td>466</td>
<td>495</td>
<td>542</td>
<td>445</td>
<td>302</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">09 ,Fraud, deception and related offences</th>
<td>66</td>
<td>76</td>
<td>126</td>
<td>114</td>
<td>98</td>
<td>66</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">10 ,Controlled drug offences</th>
<td>113</td>
<td>100</td>
<td>64</td>
<td>55</td>
<td>44</td>
<td>80</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">11 ,Weapons and Explosives Offences</th>
<td>22</td>
<td>18</td>
<td>13</td>
<td>10</td>
<td>19</td>
<td>17</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">12 ,Damage to property and to the environment</th>
<td>257</td>
<td>266</td>
<td>269</td>
<td>203</td>
<td>213</td>
<td>177</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">13 ,Public order and other social code offences</th>
<td>168</td>
<td>115</td>
<td>93</td>
<td>78</td>
<td>79</td>
<td>92</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">15 ,Offences against government, justice procedures and organisation of crime</th>
<td>45</td>
<td>48</td>
<td>39</td>
<td>39</td>
<td>66</td>
<td>50</td>
</tr>
<tr align="LEFT">
<td colspan="8"><a href="http://www.cso.ie/en/methods/crime/recordedcrime/">See Background Notes</a>
</td>
</tr>
</tbody></table>
</body></html>
The code I've currently come up with can print the numbers like so
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
... (Figures 11-66 omitted for conciseness)
Figure 67 : 48
Figure 68 : 39
Figure 69 : 39
Figure 70 : 66
Figure 71 : 50
However how I'd like it to display would be more like
Crime: 03 ,Attempts/threats to murder, assaults, harassments and related offences
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Crime: 04 ,Dangerous or negligent acts
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
etc, etc
I've attempted a number of different methods such as adding a for loop that accesses the th element with the crime, then another that accesses the td elements with the figures but this usually results in an error like
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
Working Parser Class
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ParseCrimeStatistics {
public static void main(String[]args) {
try {
int count = 0;
File input = new File("Balbriggan.html");
Document doc =Jsoup.parse(input, "UTF-8", "http://www.cso.ie");
Elements title = doc.select("td");
for(Element sectd1:title){
Elements ths = sectd1.select("td");
String result = ths.get(0).text();
System.out.println("Figure " + count + " : "+ result);
count++;
}
}catch (IOException e) {
e.printStackTrace();
}
}
}
Would anyone have any suggestions as to how I might approach this problem? Thank you.
Upvotes: 1
Views: 577
Reputation: 1103
Try this,
int count = 0;
File input = new File("Balbriggan.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.cso.ie");
Elements numbers = doc.select("td");
Elements titles = doc.select("th");
for(int i=9; i<titles.size(); i++)
{
System.out.println("Crime: " + titles.get(i).text());
for(int j=0; j<6; j++)
{
System.out.println("Figure " + count + ":" + numbers.get((i-9)*6+j).text());
count++;
}
}
Upvotes: 2