Reputation: 1355
I'm needing some help with Sed. I'm using it on Windows and Mac OSX. I need to Sed to add a
</tr>
<tr>
every 4 lines, after the first <tr>
found, and stop doing it on </tr>
i Just can't find a way to doing this. Every file will have up to 20 tables, so i need to do it automatically...
changing from this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
to this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
Is it possible with sed
? If not, what tool should i use?
Thanks
Upvotes: 3
Views: 252
Reputation: 241928
Using xsh:
open :F html file ; # Open as html.
while //table/tr[count(td)>4] wrap :U position()=8 tr //table/tr/td ; # Wrap four td's into a tr.
xmove :r //table/tr/tr before .. ; # Unwrap the extra tr.
remove //table/tr[last()] ; # Remove the extra tr.
Upvotes: 0
Reputation: 241928
Perl solution, still using regular expression instead of parsing HTML:
perl -pe '
undef $inside if m{</tr>};
if ($inside and ($. % 4) == $tr_line) {
print "</tr>\n<tr>\n";
}
$inside = 1 if defined $tr_line;
$tr_line = ($. + 1) % 4 if /<tr>/;
' file
Upvotes: 1
Reputation: 2264
you can try with regular expressions. You can test following expression on: http://gskinner.com/RegExr/
Catch expression:
?</td>.<td>.*?</td>.<td>.*?</td>.<td>.*?</td>)(?!.</tr>)
Replace expression:
$1\n</tr>\n<tr>
Flags checked:
global, ignorecase, dotall
Result:
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
You can use editor like Notepad++ for batch replace on many files at once (syntax will be little different).
Upvotes: 1
Reputation: 17634
try awk
awk '{print}; /<td>/ && ++i==4 {print "</tr>\n<tr>"; i=0}' file
<td>
then increase i
i
is 4
print </tr><tr>
and reset i
Testing with given input the desired output is returned,
with the only "problem" that an extra <tr></tr>
appears at the end of the list.
This is fixable but I'm running out of time here.
When I get back I can look into it if you think it is needed.
... part of the end of the result file
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
<tr> <-- extra <tr></tr> here
</tr>
</table>
Upvotes: 1
Reputation: 36272
I don't like the idea of using sed
to handle HTML code. Said that, try with this:
Content of script.sed
:
## For every line between '<tr>' and '</tr>' do ...
/<tr>/,/<\/tr>/ {
## Omit range edges.
/<\/\?tr>/ b;
## Append '<td>...</td>' to Hold Space (HS).
H;
## Get HS to Pattern Space (PS) to work with it.
x;
## If there are at least four newline characters means that exists four
## '<td>' tags too, so add a '<tr>' before them and a '</tr>' after them,
## print, and delete them (already processed).
/\(\n[^\n]*\)\{4\}/ {
s/^\(\n\)/<tr>\1/;
s/$/\n<\/tr>/;
p
s/^.*$//;
}
## Save the '<td>'s to HS again and read next line.
x;
b;
}
## Print all lines out of the range.
p;
Assuming infile
with the data posted in the question, run the script like:
sed -nf script.sed infile
That yields:
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
Upvotes: 3