Mitra Patel
Mitra Patel

Reputation: 99

HTML table Parsing xcode

I am trying to parse: [www.neiu.edu/~neiutemp/PhoneBook/alpha.htm] using the TFHPPLE parser and I am looking for the 1st TD (first column) from every TR (row) in a table. Here All the attributes of the TDs are same. I can't differentiate TDs.
I am able to get all of the HTML code, but fail to get 1st TD from each TR. After // 3(in the code) tutorialsNodes is empty. The output of

NSLog(@"Nodes are : %@",[tutorialsNodes description]);

is

Practice1[62351:c07] Nodes are : ().

I can't see what's wrong. Any help would be appreciated. My code to parse this URL:

NSURL *tutorialsUrl = [NSURL URLWithString:@"http://www.neiu.edu/~neiutemp/PhoneBook/alpha.htm"];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];

// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];

// 3
NSString *tutorialsXpathQueryString = @"//TR/TD";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
NSLog(@"Nodes are : %@",[tutorialsNodes description]);
// 4
NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodes) {
    // 5
    Tutorial *tutorial = [[Tutorial alloc] init];
    [newTutorials addObject:tutorial];

    // 6
    tutorial.title = [[element firstChild] content];

    // 7
    tutorial.url = [element objectForKey:@"href"];

    NSLog(@"title is: %@",[tutorial.title description]);
}

// 8
_objects = newTutorials;
[self.tableView reloadData];

Upvotes: 1

Views: 3171

Answers (1)

Rob
Rob

Reputation: 437682

This should work if you use @"//tr/td" instead of @"//TR/TD".

Looking at your HTML, though, since the author of that apparently doesn't know how to spell CSS, you have font tags buried throughout the source. So, your next line of code, which is obviously taken from the excellent Hpple tutorial by Matt Galloway on Ray Wenderlich's site, says:

tutorial.title = [[element firstChild] content];

But that won't work here, because for most of your entries, the firstChild is not the text, but rather it's a font tag. So you could check to see if it was a font tag like so:

TFHppleElement *subelement = [element firstChild];
if ([[subelement tagName] isEqualToString:@"font"])
    subelement = [subelement firstChild];
tutorial.title = [subelement content];

Or, you could instead just search for @"//tr/td/font" instead of @"//tr/td". Lots of approaches here. The trick (like all HTML parsing) is going to be to make it reasonably robust so you won't be susceptible to minor cosmetic tweaks of the page.

And obviously, your HTML doesn't have URLs there, so that code isn't applicable here.

Anyway, I hope this is enough to get you going.


You report having issues, so I thought I'd just supply a more complete code sample:

NSURL *tutorialsUrl = [NSURL URLWithString:@"http://www.neiu.edu/~neiutemp/PhoneBook/alpha.htm"];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];

TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];

NSString *tutorialsXpathQueryString = @"//tr/td";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];

if ([tutorialsNodes count] == 0)
    NSLog(@"nothing there");
else
    NSLog(@"There are %d nodes", [tutorialsNodes count]);

NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodes) {

    Tutorial *tutorial = [[Tutorial alloc] init];
    [newTutorials addObject:tutorial];

    TFHppleElement *subelement = [element firstChild];
    if ([[subelement tagName] isEqualToString:@"font"])
        subelement = [subelement firstChild];
    tutorial.title = [subelement content];

    NSLog(@"title is: %@", [tutorial.title description]);
}

That yields the following output:

2013-05-10 19:39:42.027 hpple-test[33881:c07] There are 10773 nodes
2013-05-10 19:39:42.028 hpple-test[33881:c07] title is: A
2013-05-10 19:39:46.027 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:46.698 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:47.333 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:47.827 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:48.358 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:49.133 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:49.775 hpple-test[33881:c07] title is: Abay, Hiwet B
2013-05-10 19:39:50.326 hpple-test[33881:c07] title is: H-Abay
2013-05-10 19:39:50.992 hpple-test[33881:c07] title is: 773-442-5140
2013-05-10 19:39:51.597 hpple-test[33881:c07] title is: (null)
2013-05-10 19:39:52.092 hpple-test[33881:c07] title is: Controller
2013-05-10 19:39:52.598 hpple-test[33881:c07] title is: E
2013-05-10 19:39:53.149 hpple-test[33881:c07] title is: 223
2013-05-10 19:39:55.040 hpple-test[33881:c07] title is: Abbruscato, Terence 
2013-05-10 19:39:55.806 hpple-test[33881:c07] title is: T-Abbruscato
2013-05-10 19:39:56.525 hpple-test[33881:c07] title is: 773-442-5339
...

Upvotes: 2

Related Questions