Home All Groups Group Topic Archive Search About
Author
3 Jul 2006 11:48 AM
Aristotelis Pitaridis
I am trying to extract all the links and the image URLs from an HTML file. I
tried to read byte - byte the information in order to detect the URLs but it
did not work because some JavaScript or other information in the HTML file
caused problems. Is there any class which works ok with this kind of data
extraction?



Aristotelis

Author
3 Jul 2006 3:49 PM
Jared Parsons [MSFT]
Hello Aristotelis,

> I am trying to extract all the links and the image URLs from an HTML
> file. I tried to read byte - byte the information in order to detect
> the URLs but it did not work because some JavaScript or other
> information in the HTML file caused problems. Is there any class which
> works ok with this kind of data extraction?

You could load the page into System.Windows.Forms.HtmlDocument.  It has a
GetElementsByTagName method that you can use to get all of the links.

--
Jared Parsons [MSFT]
jared***@online.microsoft.com
All opinions are my own. All content is provided "AS IS" with no warranties,
and confers no rights.
Author
3 Jul 2006 5:47 PM
Aristotelis Pitaridis
I tried it but the System.Windows.Forms.HtmlDocument object does not have a
constructor. How can I set the URL of the page in order to collect the
various information?



Aristotelis



Show quoteHide quote
? "Jared Parsons [MSFT]" <jared***@online.microsoft.com> ?????? ??? ??????
news:61f143b3fc88c86c94c088c47e@msnews.microsoft.com...
>
> Hello Aristotelis,
>
>> I am trying to extract all the links and the image URLs from an HTML
>> file. I tried to read byte - byte the information in order to detect
>> the URLs but it did not work because some JavaScript or other
>> information in the HTML file caused problems. Is there any class which
>> works ok with this kind of data extraction?
>
> You could load the page into System.Windows.Forms.HtmlDocument.  It has a
> GetElementsByTagName method that you can use to get all of the links.
>
> --
> Jared Parsons [MSFT]
> jared***@online.microsoft.com
> All opinions are my own. All content is provided "AS IS" with no
> warranties, and confers no rights.
>
>
Author
3 Jul 2006 8:16 PM
Jared Parsons [MSFT]
Hello Aristotelis,

> I tried it but the System.Windows.Forms.HtmlDocument object does not
> have a constructor. How can I set the URL of the page in order to
> collect the various information?

It looks like you'll have to create an instance of the WebBrowser control.
That will give you access to the underlying HtmlDocument which you can then
query. 

--
Jared Parsons [MSFT]
jared***@online.microsoft.com
All opinions are my own. All content is provided "AS IS" with no warranties,
and confers no rights.
Author
4 Jul 2006 6:08 AM
Aristotelis Pitaridis
I think that there will be a problem with the javascripts. If I load a page
which contains a Javascript Alert message box, this will have as a result to
stop the whole prosess, and the user will see this window on the screen. Is
there a way to disable the javascript execution for a WebBrowser control?

Aristotelis

Show quoteHide quote
? "Jared Parsons [MSFT]" <jared***@online.microsoft.com> ?????? ??? ??????
news:61f143b41448c86cb9fc878fae@msnews.microsoft.com...
>
> Hello Aristotelis,
>
>> I tried it but the System.Windows.Forms.HtmlDocument object does not
>> have a constructor. How can I set the URL of the page in order to
>> collect the various information?
>
> It looks like you'll have to create an instance of the WebBrowser control.
> That will give you access to the underlying HtmlDocument which you can
> then query.
> --
> Jared Parsons [MSFT]
> jared***@online.microsoft.com
> All opinions are my own. All content is provided "AS IS" with no
> warranties, and confers no rights.
>
>
Author
4 Jul 2006 8:48 AM
intolerance
http://www.regular-expressions.net/examples.html

This has a great tutorial about grabbing html tags.

-Allen
Author
3 Jul 2006 8:44 PM
Cor Ligthert [MVP]
Aristotolis,

They (we and others) use those JavaScript to prevent things as spamming,
including me. You want that we give you a method to overcome that and put
that on this board. Even if I did know it than was the answer. No way.

:-)

Cor

"Aristotelis Pitaridis" <pitari***@hotmail.com> schreef in bericht
news:1151923589.806196@athnrd02...
Show quoteHide quote
>I am trying to extract all the links and the image URLs from an HTML file.
>I tried to read byte - byte the information in order to detect the URLs but
>it did not work because some JavaScript or other information in the HTML
>file caused problems. Is there any class which works ok with this kind of
>data extraction?
>
>
>
> Aristotelis
>
>
Author
4 Jul 2006 12:38 PM
Herfried K. Wagner [MVP]
"Aristotelis Pitaridis" <pitari***@hotmail.com> schrieb:
>I am trying to extract all the links and the image URLs from an HTML file.
>I tried to read byte - byte the information in order to detect the URLs but
>it did not work because some JavaScript or other information in the HTML
>file caused problems. Is there any class which works ok with this kind of
>data extraction?

I suggest to use an HTML parser instead of regular expressions for this
purpose:

Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.

--
M S   Herfried K. Wagner
M V P  <URL:http://dotnet.mvps.org/>
V B   <URL:http://classicvb.org/petition/>