×
Namespaces

Variants
Actions
Revision as of 14:03, 17 July 2013 by hamishwillee (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

HTML Page parsing using HTMLAgilityPack

From Nokia Developer Wiki
Jump to: navigation, search
Featured Article
02 Jun
2013

This article explains how to parse the content of any HTML page using HTMLAgilityPack. HTMLAgilityPack is a .Net library which scans the HTML page and provides a DOM to us which then can be used to fetch required values from the page's content.

WP Metro Icon Web.png
WP Metro Icon Tools.png
SignpostIcon XAML 40.png
WP Metro Icon WP8.png
SignpostIcon WP7 70px.png
Article Metadata
Code ExampleTested with
SDK: Windows Phone SDK 8.0/7.1
Devices(s): Nokia Lumia 920, Nokia Lumia 800
CompatibilityArticle
Created: Vaishali Rawat (05 May 2013)
Last edited: hamishwillee (17 Jul 2013)

Contents

Introduction

In this article, we show how to scan and fetch the list of current Nokia Developer Champions from the Meet the Champions web page and display their details using native Windows Phone controls.

The screens in the example app are as shown below.

Prerequisites

To use this article, we need to have a reference to HTMLAgilityPack.dll and The Windows Phone Toolkit. Both of them can be installed by NuGet Manager. To do so, in the Solution Explorer | Manage Nuget Packages >> search for HTMLAgilityPack and wptoolkit and install them separately.

References Required

Following references to the DLLs are required.

  • HtmlAgilityPack
  • Microsoft.Phone.Controls.Toolkit
  • System.Xml.XPath

Note.pngNote: System.Xml.XPath dll can be found from \Program Files\Microsoft SDKs\Silverlight\v4.0\Libraries\Client\

These can be added by Project | References | Add Reference..

UI

  • MainPage - In the Main page, we will set the title tag of our target HTML page as the title of our screen. After that, we will have one heading named "Region" which will be a list picker control containing values of regions/continents types as available on HTML page. We will also have another heading for selecting country named "Country", a list picker which will have country names based on the selected region/continent. In last, we will have one list showing names of Nokia Developer Champions residing in the selected country.
  • DetailPage - In this page, we will simply set the Champion's name as title of our screen and below it, we will show the detail of the champion available on the site.

Note.pngNote: Detailed code snippet can be checked in the attached source code file.

Code Behind

Making the server request

In the constructor of the class Mainclass.cs, we are assigning required memory to our variables. Also, we are making a server request to download the HTML page's content.

// private variables decalred
private ObservableCollection<String> _regionsHeadingList, _americanCountries, _europeanCountries, _asiaPacificCountries, _middleEastCountries, _tempChampsList;
Dictionary<String, ObservableCollection<String>> _americanChampsCountry, _europeanChampsCountry, _asiaPacificChampsCountry, _middleEastChampsCountry;
Dictionary<String, String> _champsDetailInfo;
private static int _selContinentIndex;
private static String _selCountryName;
 
public void startServerRequest()
{
HttpWebRequest httpReq = (HttpWebRequest)HttpWebRequest.Create(new Uri("https://www.developer.nokia.com/Community/Champions/Meet_the_champions.xhtml"));
httpReq.BeginGetResponse(HTTPWebRequestCallBack, httpReq);
}

Callback handling

In the callback, we will request the data, get it into a stream, and then load it into our HTML Parser. The code snippet is shown below.

private void HTTPWebRequestCallBack(IAsyncResult result)
{
string strResponse = "";
try
{
Deployment.Current.Dispatcher.BeginInvoke(() =>
{
try
{
HttpWebRequest httpRequest = (HttpWebRequest)result.AsyncState;
WebResponse response = httpRequest.EndGetResponse(result);
 
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
strResponse = reader.ReadToEnd();
 
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.OptionFixNestedTags = true;
htmlDocument.LoadHtml(strResponse);
//rest of the code here
}

Before explaining the following code, let me tell you a bit about the main HTMLAgilityPack functions which we will use. The main function used to parse any node's value is SelectSingleNode() which takes a parameter of type XPath. As per the official docs, "XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps." So, we will have to provide some steps or values to reach to the target node. XPath uses several expressions to reach to the required node, e.g. / means searching will start from the root node. All the expressions detail can be found here.

Parsing the response

To show the HTML page's title in our example, we need to parse the HTML tag title. To parse it, we use the code:

 // parsing page's title
HtmlAgilityPack.HtmlNode titleNode = htmlDocument.DocumentNode.SelectSingleNode("//title");
if (titleNode != null)
{
txtBlockTitleOne.Text = titleNode.InnerText;
}

Then we need to list the available regions/continents. If you consider the HTML page, the continents are under h2 span tags, with the tag having an attribute named class with the value accordeonTitle. As per the terminology of XPath, the attributes can be traced with the syntax @. So to parse regions, code snippet used is:

  // parsing Regions/Continents types
var regionsHeadings = htmlDocument.DocumentNode.SelectNodes("//h2[@class='accordeonTitle']/span/text()"); //selects text of span tag under all h2 tags with class name = accordeonTitle
 
if (regionsHeadings != null)
{
txtBlockRegionHeading.Visibility = System.Windows.Visibility.Visible;
lstPickerRegion.Visibility = System.Windows.Visibility.Visible;
 
for (int i = 0; i < regionsHeadings.Count; i++)
{
_regionsHeadingList.Add(regionsHeadings[i].InnerHtml.Trim());
}
this.lstPickerRegion.ItemsSource = _regionsHeadingList;
}

After getting the list of regions, we set their values in the List Picker made for regions/continents.

Now we need the list of countries and their champions. On scanning the HTML page, we see that all this information is under a div tag having attribute named class and its value is accordeonContent. All the countries names are in the h3 type tag. After that, each of the champ's information is inside p and strong tags. Here, strong tag is the child node of p tag. So, the outer most div tag will be parsed as:

HtmlNodeCollection divContentHeadings = htmlDocument.DocumentNode.SelectNodes("//div[@class='accordeonContent']"); //selects text of span tag under all h2 tags with class name = accordeonTitle
for (int i = 0; i < divContentHeadings.Count; i++)
{
int countryIndex = -1;
if(_tempChampsList != null)
_tempChampsList = new ObservableCollection<string>();
// rest of the code here

There are 4 tags (corresponding to 4 continents/regions) as div tag having attribute class = accordeonContent. We scan through each of them to get each continent's champion's details.

HtmlNodeCollection childNodes = divContentHeadings[i].ChildNodes;
for (int j = 0; j < childNodes.Count; j++){
if (childNodes[j].Name.Equals("h3")) {
if (i == 0) {
if (_tempChampsList != null && _tempChampsList.Count > 0) {
_americanChampsCountry.Add(_americanCountries[countryIndex], _tempChampsList); }
}
else if (i == 1) {
if (_tempChampsList != null && _tempChampsList.Count > 0){
_europeanChampsCountry.Add(_europeanCountries[countryIndex], _tempChampsList);}
}
else if (i == 2) {
if (_tempChampsList != null && _tempChampsList.Count > 0) {
_asiaPacificChampsCountry.Add(_asiaPacificCountries[countryIndex], _tempChampsList); }
}
else if (i == 3) {
if (_tempChampsList != null && _tempChampsList.Count > 0) {
_middleEastChampsCountry.Add(_middleEastCountries[countryIndex], _tempChampsList); }
}
 
countryIndex++;
_tempChampsList = new ObservableCollection<string>();
 
if (i == 0)
_americanCountries.Add(childNodes[j].InnerText);
else if (i == 1)
_europeanCountries.Add(childNodes[j].InnerText);
else if (i == 2)
_asiaPacificCountries.Add(childNodes[j].InnerText);
else if (i == 3)
_middleEastCountries.Add(childNodes[j].InnerText);
}

As per the above code snippet, we are checking if the current node is h3 tag. If it is, we will check the current region's/continent's index and insert the temporarily created champs list (if it exists) in the corresponding Dictionary type data having a single record as <country name, champ name>. We are inserting this here as we don't have the choice to insert it later. Whenever we get any h3 type tag, it means a new country's champs info will come. So we are initializing the _tempChampsList variable here. At last, we are simply inserting the value of h3 tag into the corresponding countries names list.

The champ's detail is parsed like:

 else if (childNodes[j].Name.Equals("p") && childNodes[j].ChildNodes[0].Name.Equals("strong"))
{
_tempChampsList.Add(childNodes[j].ChildNodes[0].InnerText);
_champsDetailInfo.Add(childNodes[j].ChildNodes[0].InnerText, childNodes[j].InnerText);
}

Corresponding to a country name,we are inserting its champs names into a temporary list. Simultaneously, we are maintaining a Dictionary type variable which will have record like <champ name, champ detail>.

Now, all the champs details is parsed except the last country's of each continent. This is due to the fact that we were inserting the country's X details on starting X+1's values. So, to insert last country's details, the code used is:

 else if (childNodes[j].Name.Equals("div"))
{
if (childNodes[j].GetAttributeValue("class", "default").Equals("buttonGroup")){
// to add last country's champs names
if (i == 0) {
if (_tempChampsList != null){
_americanChampsCountry.Add(_americanCountries[countryIndex], _tempChampsList);}
}
else if (i == 1){
if (_tempChampsList != null) {
_europeanChampsCountry.Add(_europeanCountries[countryIndex], _tempChampsList); }
}
else if (i == 2){
if (_tempChampsList != null){
_asiaPacificChampsCountry.Add(_asiaPacificCountries[countryIndex], _tempChampsList);}
}
else if (i == 3){
if (_tempChampsList != null){
_middleEastChampsCountry.Add(_middleEastCountries[countryIndex], _tempChampsList);}
}
}

The previous code is used as before except that the outer most node here was a div with an attribute named class but with the value buttonGroup.

Integrating response with UI

So, now we have parsed all the required data and saved into some variables. It's time to integrate it. The code on changing region from the List picker's is as below:

 private void lstPickerRegion_SelectionChanged(object sender, SelectionChangedEventArgs e)
{
int selIndex = lstPickerRegion.SelectedIndex;
if (lstPickerRegion != null && selIndex >= 0)
{
txtBlockCountryHeading.Visibility = System.Windows.Visibility.Visible;
lstPickerCountry.Visibility = System.Windows.Visibility.Visible;
 
switch (selIndex)
{
case 0:
_selContinentIndex = 0;
if(_americanCountries != null)
lstPickerCountry.ItemsSource = _americanCountries;
break;
case 1:
_selContinentIndex = 1;
if (_europeanCountries != null)
lstPickerCountry.ItemsSource = _europeanCountries;
break;
case 2:
_selContinentIndex = 2;
if (_asiaPacificCountries != null)
lstPickerCountry.ItemsSource = _asiaPacificCountries;
break;
case 3:
_selContinentIndex = 3;
if (_middleEastCountries != null)
lstPickerCountry.ItemsSource = _middleEastCountries;
break;
}}}

The code on changing country from the List picker's is as below:

private void lstPickerCountry_SelectionChanged(object sender, SelectionChangedEventArgs e)
{
int selIndex = lstPickerCountry.SelectedIndex;
if (lstPickerCountry != null && selIndex >= 0)
{
switch (_selContinentIndex)
{
case 0:
_selCountryName = _americanCountries[selIndex].Trim();
setChampsList();
break;
case 1:
_selCountryName = _europeanCountries[selIndex].Trim();
setChampsList();
break;
case 2:
_selCountryName = _asiaPacificCountries[selIndex].Trim();
setChampsList();
break;
case 3:
_selCountryName = _middleEastCountries[selIndex].Trim();
setChampsList();
break;
default:
break;
}}}

The setChampsList() is declared as:

private void setChampsList()
{
ObservableCollection<CommonList> aChampsNamesList = new ObservableCollection<CommonList>();
ObservableCollection<String> aNamesList = new ObservableCollection<string>();
aNamesList = getChampsListByCountry(_selContinentIndex, _selCountryName);
if (aNamesList != null)
{
for(int i = 0; i < aNamesList.Count; i++) {
aChampsNamesList.Add(new CommonList (aNamesList[i]));
}
 
if (aChampsNamesList != null) {
champsList.Visibility = System.Windows.Visibility.Visible;
champsList.ItemsSource = aChampsNamesList;
} }
}
 
private ObservableCollection<String> getChampsListByCountry(int aContinentIndex, String aCountryName)
{
ObservableCollection<String> aChampsList = null;
switch(aContinentIndex)
{
case 0:
if (_americanChampsCountry != null)
{
for (int i = 0; i < _americanChampsCountry.Count; i++)
{
if (_americanChampsCountry.ContainsKey(aCountryName))
aChampsList = _americanChampsCountry[aCountryName];
}
}
break;
case 1:
if (_europeanChampsCountry != null)
{
for (int i = 0; i < _europeanChampsCountry.Count; i++)
{
if (_europeanChampsCountry.ContainsKey(aCountryName))
aChampsList = _europeanChampsCountry[aCountryName];
}
}
break;
case 2:
if (_asiaPacificChampsCountry != null){
for (int i = 0; i < _asiaPacificChampsCountry.Count; i++)
{
if (_asiaPacificChampsCountry.ContainsKey(aCountryName))
aChampsList = _asiaPacificChampsCountry[aCountryName];
}}
break;
case 3:
if (_middleEastChampsCountry != null){
for (int i = 0; i < _middleEastChampsCountry.Count; i++)
{
if (_middleEastChampsCountry.ContainsKey(aCountryName))
aChampsList = _middleEastChampsCountry[aCountryName];
}}
break;
}
return aChampsList;
}

On tapping any of the list box item, the detail page will be opened.

Showing Champ's detail

In the ChampDetailPage.cs file, we will simply catch and store the values sent by the previous page. The code snippet is as below:

protected override void OnNavigatedTo(System.Windows.Navigation.NavigationEventArgs e)
{
base.OnNavigatedTo(e);
NavigationContext.QueryString.TryGetValue("champName", out _aChampName);
NavigationContext.QueryString.TryGetValue("detail", out _champInfo);
 
if(_aChampName != null)
PageTitle.Text = _aChampName;
 
if(_champInfo != null)
txtDetail.Text = _champInfo;
}

Build and Run

Now you may build the app and try to run it.

Summary

This way we can parse any HTML page's contents using HTMLAgilityPack dll.

References

This page was last modified on 17 July 2013, at 14:03.
850 page views in the last 30 days.
×