1. Computing

Simple HTML page Scraping with Delphi

By

HTML Scraping with Delphi

HTML Scraping with Delphi

The idea of this article is to show you the techniques used to download a page from the Internet, do some page scraping and finally present the information in more "situation-friendly" manner.

The key to data extraction methods described in this article is to convert the existing HTML document to more "situation-friendly" source. These are the steps we'll be discussing:

  • Retrieval of HTML source documents
  • Processing the HTML document, removing the unneeded data
  • Transforming the result to string type variables
  • Displaying the information extracted in a TListView Delphi control

Preparing the Delphi HTML Scraping Project

To keep up with the article, I suggest you to start Delphi, create a new project with one blank form. On this form place one TButton (Standard palette) and one TListView (Win32 palette) component. Leave the default component names as Delphi suggests. That is, Button1 for the button and ListView1 for the list view component. You'll use Button1 to get the file from the Internet, do information retrieval and show the result in the ListView1. Also, make sure to add 4 columns to the ListView1: Title, URL, Description, When/Where. The ViewStyle of the ListView1 should be set to vsReport.

Retrieval of HTML source documents

Before we start extracting data from an HTML file, we need to make sure we have one locally.

Your first task is to create a Delphi function used to download a file from the Internet. One way of achieving this task is to use the WinInet API calls. Delphi gives us full access to the WinInet API (wininet.pas) which we can use to connect to and retrieve files from any Web site that uses either Hypertext Transfer Protocol (HTTP) or File Transfer Protocol (FTP). I've already written an article that describes this technique: Get File From the Net. Another approach is to use the TDownloadURL object. The TDownloadURL object, defined in ExtActns.pas unit, is designed for saving the contents of a specified URL to a file. Here's the code that uses the TDownloadURL to download the "startup page" from this site.

uses ExtActns.pas...;

function Download_HTM(const sURL, sLocalFileName : string): boolean;
begin
  Result := True;
  with TDownLoadURL.Create(nil) do
  try
    URL:=sURL;
    Filename:=sLocalFileName;
    try
      ExecuteTarget(nil) ;
    except
      Result:=False
    end;
  finally
    Free;
  end;
end;
This function, Download_HTM, downloads a file from the URL specified in the sURL parameter, and saves this file locally under a sLocalFileName name. The function returns True if it succeeds, False otherwise. Of course, this function is to be called from the Button1 OnClick event handler. You can see the code below. Note that, locally, the file is saved as c:\temp_adp.newandhot.
procedure TForm1.Button1Click(Sender: TObject) ;
const
  ADPNEWHOTURL='http://delphi.about.com/cs/newandhot/index.htm';
  TmpFileName='c:\temp_adp.newandhot';
begin
  if NOT Download_HTM(ADPNEWHOTURL,TmpFileName) then
  begin
    ShowMessage('Error in HTML file download') ;
    Exit;
  end;
  {
  more code to be added
  }

end;
Note: In the process of downloading a file, the TDownloadURL periodically generates an OnDownloadProgress event, so that you can provide users with feedback about the process. I'll leave this for you to implement. Now, that we have the HTM page locally on the disk we can use the techniques for handling ASCII files from Delphi code.

Processing the HTML document

It's important to have in mind that techniques described in this article are somehow deprecated with "new" intelligent information retrieval techniques like HTML to XML using the XSLT. If you do not know what I'm talking about don't worry.

To be able to successfully extract the data from a web page, the use of some kind of regular expressions for pattern matching is required. This in particular means that you will be able to do page scraping if and only if you know the structure of an HTML document. This is not a big problem if you are the one that creates the web page. Even if you are not the person behind a web page, you can use pattern matching, but must be sure to check your code occasionally, since HTML is dynamic content and a document structure can change very often due to the various banner ad systems and dynamic server-side scripting engines.

In situations when pattern matching is not giving results you can turn to more intelligent solutions like transforming the HTML document to XML - a standard for marking up structured documents; however this is not something we are to discuss here.

If you open up the downloaded file with the Notepad, you should notice that the information we want to extract is placed inside and tags. After you extract that part, you have to make sure any server or client side scripting is excluded - such text usually appears between the tags. What remains is an HTML code, with 10 items formatted like:
<h2><a href="/od/objectpascalide/a/speedsize.htm">Delphi Speed and Size: Top Tips</a></h2> <div class="bltxt"><a href="/od/objectpascalide/"><i>in Delphi Language</i></a> :: In many case the Delphi compiler will take care of the optimization. But that's just limited to aligning the code for pipelines, and some other small tweaks to the code. There is still much to be gained by taking into account how a computer works, and adapting your algorithm to that.
I'm not going to bother you with the project details here, be sure to download the entire code, you'll have plenty to play with.

Be sure to check the article describing how to develop an Delphi IDE add-on, expert designed to help you get the new and hot listing without even leaving the Delphi IDE!

©2014 About.com. All rights reserved.