| You are here: | About>Computing & Technology>Delphi Programming> Coding Internet / Network> Networking> Simple HTML page Scraping with Delphi |
![]() | Delphi Programming |
![]() HTML Scraping with Delphi Source CodeJoin the DiscussionPost your views, comments, questions and doubts to this article. Simple HTML page Scraping with DelphiThe idea of this article is to show you the techniques used to download a page from the Internet, do some page scraping and finally present the information in more "situation-friendly" manner. The key to data extraction methods described in this article is to convert the existing HTML document to more "situation-friendly" source. These are the steps we'll be discussing:
Preparing the Delphi HTML Scraping ProjectTo keep up with the article, I suggest you to start Delphi, create a new project with one blank form. On this form place one TButton (Standard palette) and one TListView (Win32 palette) component. Leave the default component names as Delphi suggests. That is, Button1 for the button and ListView1 for the list view component. You'll use Button1 to get the file from the Internet, do information retrieval and show the result in the ListView1. Also, make sure to add 4 columns to the ListView1: Title, URL, Description, When/Where. The ViewStyle of the ListView1 should be set to vsReport.Retrieval of HTML source documentsBefore we start extracting data from an HTML file, we need to make sure we have one locally.Your first task is to create a Delphi function used to download a file from the Internet. One way of achieving this task is to use the WinInet API calls. Delphi gives us full access to the WinInet API (wininet.pas) which we can use to connect to and retrieve files from any Web site that uses either Hypertext Transfer Protocol (HTTP) or File Transfer Protocol (FTP). I've already written an article that describes this technique: Get File From the Net. Another approach is to use the TDownloadURL object. The TDownloadURL object, defined in ExtActns.pas unit, is designed for saving the contents of a specified URL to a file. Here's the code that uses the TDownloadURL to download the "startup page" from this site. uses ExtActns.pas...;This function, Download_HTM, downloads a file from the URL specified in the sURL parameter, and saves this file locally under a sLocalFileName name. The function returns True if it succeeds, False otherwise. Of course, this function is to be called from the Button1 OnClick event handler. You can see the code below. Note that, locally, the file is saved as c:\temp_adp.newandhot. procedure TForm1.Button1Click(Sender: TObject) ;Note: In the process of downloading a file, the TDownloadURL periodically generates an OnDownloadProgress event, so that you can provide users with feedback about the process. I'll leave this for you to implement. Now, that we have the HTM page locally on the disk we can use the techniques for handling ASCII files from Delphi code. Processing the HTML documentIt's important to have in mind that techniques described in this article are somehow deprecated with "new" intelligent information retrieval techniques like HTML to XML using the XSLT. If you do not know what I'm talking about don't worry.If you open up the downloaded file with the Notepad, you should notice that the information we want to extract is placed inside and tags. After you extract that part, you have to make sure any server or client side scripting is excluded - such text usually appears between the tags. What remains is an HTML code, with 10 items formatted like: <h2><a href="/od/objectpascalide/a/speedsize.htm">Delphi Speed and Size: Top Tips</a></h2> <div class="bltxt"><a href="/od/objectpascalide/"><i>in Delphi Language</i></a> :: In many case the Delphi compiler will take care of the optimization. But that's just limited to aligning the code for pipelines, and some other small tweaks to the code. There is still much to be gained by taking into account how a computer works, and adapting your algorithm to that.I'm not going to bother you with the project details here, be sure to download the entire code, you'll have plenty to play with. Be sure to check the article describing how to develop an Delphi IDE add-on, expert designed to help you get the new and hot listing without even leaving the Delphi IDE! Source CodeJoin the DiscussionPost your views, comments, questions and doubts to this article. |
Las Vegas on a BudgetFind a BargainHotel DealsCheap EatsFree AttractionsEntertainment for Less |
All Topics | Email Article | | | ![]() |
| Advertising Info | News & Events | Work at About | SiteMap | Reprints | Help | Our Story | Be a Guide |
| User Agreement | Ethics Policy | Patent Info. | Privacy Policy | ©2008 About, Inc., A part of The New York Times Company. All rights reserved. |



