|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Regular Expression to Parse HTMLI have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN CLASS='myclass'>A bit of text</SPAN>, or Just some text, without tags What I would like to be able to do is parse each line so that I get an array like this SPAN CLASS myclass A bit of text or Just some text, without tags The array bit should follow, but I don't profess to be a regex expert (or any kind of expert for that matter). Can anyone help with a suitable pattern? TIA Charles is this usefult for you?
http://regexplib.com/REDetails.aspx?regexp_id=520 Galin Iliev MCSD, MCAD.NET Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... > Does anyone have a regex pattern to parse HTML from a stream? > > I have a well structured file, where each line is of the form > > <sometag someattribute='attr'>text</sometag> > > for example > > <SPAN CLASS='myclass'>A bit of text</SPAN>, or > Just some text, without tags > > What I would like to be able to do is parse each line so that I get an > array like this > > SPAN > CLASS > myclass > A bit of text > > or > > Just some text, without tags > > The array bit should follow, but I don't profess to be a regex expert (or > any kind of expert for that matter). Can anyone help with a suitable > pattern? > > TIA > > Charles > > Hi Galin
Thanks for the link. It looks like it ought to work, but when I test it against even a simple tag it returns no matches. I tried verifying the expression with Expresso and it gives the following error. Reference to undefined group number 5. Even when I test it using the facility on the web site it fails. Any idea how to correct it? Charles Show quoteHide quote "Galin Iliev" <iliev@_NOSPAM_.Galcho.com> wrote in message news:%23SfyobDPFHA.3388@TK2MSFTNGP10.phx.gbl... > is this usefult for you? > > http://regexplib.com/REDetails.aspx?regexp_id=520 > > Galin Iliev > MCSD, MCAD.NET > > "Charles Law" <bl***@nowhere.com> wrote in message > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... >> Does anyone have a regex pattern to parse HTML from a stream? >> >> I have a well structured file, where each line is of the form >> >> <sometag someattribute='attr'>text</sometag> >> >> for example >> >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >> Just some text, without tags >> >> What I would like to be able to do is parse each line so that I get an >> array like this >> >> SPAN >> CLASS >> myclass >> A bit of text >> >> or >> >> Just some text, without tags >> >> The array bit should follow, but I don't profess to be a regex expert (or >> any kind of expert for that matter). Can anyone help with a suitable >> pattern? >> >> TIA >> >> Charles >> >> > >
Show quote
Hide quote
"Charles Law" <bl***@nowhere.com> schrieb: Maybe it's easier to use the HTML Agility Pack:> Does anyone have a regex pattern to parse HTML from a stream? > > I have a well structured file, where each line is of the form > > <sometag someattribute='attr'>text</sometag> > > for example > > <SPAN CLASS='myclass'>A bit of text</SPAN>, or > Just some text, without tags > > What I would like to be able to do is parse each line so that I get an > array like this > > SPAN > CLASS > myclass > A bit of text ..NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML... <URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx> Download: <URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip> -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://classicvb.org/petition/> Hi Herfried
It's not my luck day today for getting things to work. When I try to open the AgilityPack solution I get two errors: Unable to open project HtmlDomView Unable to open project GetBinaryRemainder When I try to run it comes up with 12 compile errors, one of which is a cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too. Charles Show quoteHide quote "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message news:eJYwajDPFHA.1176@TK2MSFTNGP12.phx.gbl... > "Charles Law" <bl***@nowhere.com> schrieb: >> Does anyone have a regex pattern to parse HTML from a stream? >> >> I have a well structured file, where each line is of the form >> >> <sometag someattribute='attr'>text</sometag> >> >> for example >> >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >> Just some text, without tags >> >> What I would like to be able to do is parse each line so that I get an >> array like this >> >> SPAN >> CLASS >> myclass >> A bit of text > > Maybe it's easier to use the HTML Agility Pack: > > .NET Html Agility Pack: How to use malformed HTML just like it was > well-formed XML... > <URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx> > > Download: > > <URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip> > > -- > M S Herfried K. Wagner > M V P <URL:http://dotnet.mvps.org/> > V B <URL:http://classicvb.org/petition/> There's an example of just that in my article on the new VBRUN site here:
http://msdn.microsoft.com/vbrun/vbfusion/5000classes/ The expression I used is: ("(?<=href\s*=\s*[""']).*?(?=[""'])") Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message news:edmMW9DPFHA.3000@TK2MSFTNGP10.phx.gbl... > Hi Herfried > > It's not my luck day today for getting things to work. When I try to open > the AgilityPack solution I get two errors: > > Unable to open project HtmlDomView > Unable to open project GetBinaryRemainder > > When I try to run it comes up with 12 compile errors, one of which is a > cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too. > > Charles > > > "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message > news:eJYwajDPFHA.1176@TK2MSFTNGP12.phx.gbl... >> "Charles Law" <bl***@nowhere.com> schrieb: >>> Does anyone have a regex pattern to parse HTML from a stream? >>> >>> I have a well structured file, where each line is of the form >>> >>> <sometag someattribute='attr'>text</sometag> >>> >>> for example >>> >>> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >>> Just some text, without tags >>> >>> What I would like to be able to do is parse each line so that I get an >>> array like this >>> >>> SPAN >>> CLASS >>> myclass >>> A bit of text >> >> Maybe it's easier to use the HTML Agility Pack: >> >> .NET Html Agility Pack: How to use malformed HTML just like it was >> well-formed XML... >> <URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx> >> >> Download: >> >> <URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip> >> >> -- >> M S Herfried K. Wagner >> M V P <URL:http://dotnet.mvps.org/> >> V B <URL:http://classicvb.org/petition/> > > Hi Scott
It looks like this would specifically decode hrefs, and so if I wanted to decode another tag I would need to change the expression. To decode many different tags I would need to generate multiple expressions and test against each; please correct me if I have misunderstood. What I am hoping for is a generic expression that will decode all tags that conform to the general html format. I realise that this would also decode tags that are not valid html, but this would not matter as I have control over the file and what is in it. Charles Show quoteHide quote "Scott Swigart [MVP]" <sc***@swigartconsulting.com> wrote in message news:OTE6F2EPFHA.2788@TK2MSFTNGP09.phx.gbl... > There's an example of just that in my article on the new VBRUN site here: > > http://msdn.microsoft.com/vbrun/vbfusion/5000classes/ > > The expression I used is: > > ("(?<=href\s*=\s*[""']).*?(?=[""'])") > > > -- > Scott Swigart - MVP > http://blog.swigartconsulting.com > > > "Charles Law" <bl***@nowhere.com> wrote in message > news:edmMW9DPFHA.3000@TK2MSFTNGP10.phx.gbl... >> Hi Herfried >> >> It's not my luck day today for getting things to work. When I try to open >> the AgilityPack solution I get two errors: >> >> Unable to open project HtmlDomView >> Unable to open project GetBinaryRemainder >> >> When I try to run it comes up with 12 compile errors, one of which is a >> cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too. >> >> Charles >> >> >> "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.at> wrote in message >> news:eJYwajDPFHA.1176@TK2MSFTNGP12.phx.gbl... >>> "Charles Law" <bl***@nowhere.com> schrieb: >>>> Does anyone have a regex pattern to parse HTML from a stream? >>>> >>>> I have a well structured file, where each line is of the form >>>> >>>> <sometag someattribute='attr'>text</sometag> >>>> >>>> for example >>>> >>>> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >>>> Just some text, without tags >>>> >>>> What I would like to be able to do is parse each line so that I get an >>>> array like this >>>> >>>> SPAN >>>> CLASS >>>> myclass >>>> A bit of text >>> >>> Maybe it's easier to use the HTML Agility Pack: >>> >>> .NET Html Agility Pack: How to use malformed HTML just like it was >>> well-formed XML... >>> <URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx> >>> >>> Download: >>> >>> <URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip> >>> >>> -- >>> M S Herfried K. Wagner >>> M V P <URL:http://dotnet.mvps.org/> >>> V B <URL:http://classicvb.org/petition/> >> >> > > Charles,
Maybe I can point you on a class that is called MSHTML. It is not the nicest class, however very good to filter tags from a document using loops or even tag by tag by looping through the document something like this, this is a document collection. \\\ For Each iDocument As mshtml.IHTMLDocument2 In pDocuments For i As Integer = 0 To iDocument.all.length - 1 Dim hrefname As String Dim hElm As mshtml.IHTMLElement = DirectCast(iDocument.all.item(i), mshtml.IHTMLElement) Dim tagname As String = hElm.tagName.ToLower If (tagname = "a") Or (tagname = "chk") Then If Not DirectCast(hElm, mshtml.IHTMLAnchorElement).href Is Nothing Then hrefname = DirectCast(hElm, mshtml.IHTMLAnchorElement).href.ToString End If End If etc etc ///. .. .. In this newsgroups I leave the answers about this mostly to somebody who has by coincidence the same name as you, he is much longer and activer busy with it than I. Maybe you can search for his answers. :-))))) CorNow why didn't I think of that ;-) I shall look this fellow up, of whom you
speak, and see what he has to say on the matter. I have now got the Agility Pack working. It is somewhat smaller than mshtml and, I suspect, quicker. It's actually quite good, and may well be better than the regex idea; especially since I don't currently have a regex that works! I had thought that, for a large file, regex would be quicker than mshtml, but I have no actual evidence of that. Conversely, though, I think that the Agility Pack will be every bit as quick as a regex, if not quicker. Anyway, it works, which is the main thing. Charles Show quoteHide quote "Cor Ligthert" <notmyfirstn***@planet.nl> wrote in message news:Oc2k4OFPFHA.1564@TK2MSFTNGP14.phx.gbl... > Charles, > > Maybe I can point you on a class that is called MSHTML. It is not the > nicest class, however very good to filter tags from a document using loops > or even tag by tag by looping through the document something like this, > this is a document collection. > > \\\ > For Each iDocument As mshtml.IHTMLDocument2 In pDocuments > For i As Integer = 0 To iDocument.all.length - 1 > Dim hrefname As String > Dim hElm As mshtml.IHTMLElement = DirectCast(iDocument.all.item(i), > mshtml.IHTMLElement) > Dim tagname As String = hElm.tagName.ToLower > If (tagname = "a") Or (tagname = "chk") Then > If Not DirectCast(hElm, mshtml.IHTMLAnchorElement).href Is > Nothing Then > hrefname = DirectCast(hElm, > mshtml.IHTMLAnchorElement).href.ToString > End If > End If > etc etc > ///. > . > . > In this newsgroups I leave the answers about this mostly to somebody who > has by coincidence the same name as you, he is much longer and activer > busy with it than I. > > Maybe you can search for his answers. > > :-))))) > > Cor > Charles,
"Charles Law" <bl***@nowhere.com> schrieb: I am glad to hear that you finally got the Agility Pack to work :-).> I have now got the Agility Pack working. It is somewhat smaller than > mshtml and, I suspect, quicker. -- M S Herfried K. Wagner M V P <URL:http://dotnet.mvps.org/> V B <URL:http://classicvb.org/petition/> Charles Law wrote:
Show quoteHide quote > I have a well structured file, where each line is of the form Assuming it's always attrib='value', and never attrib="value",> > <sometag someattribute='attr'>text</sometag> > > for example > > <SPAN CLASS='myclass'>A bit of text</SPAN>, or > Just some text, without tags > > What I would like to be able to do is parse each line so that I get an array > like this > > SPAN > CLASS > myclass > A bit of text > > or > > Just some text, without tags // ExplicitCapture | Multiline | IgnorePatternWhitespace ^ ( < (?<tag>\w+) \s+ (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* > (?<text>.*) </ \k<tag> > ) .* | (?<bare_text> .+)$ Hi Jon
As with my reply to an earlier response, it looks like the expression you have given is specific to a given tag and attribute (unless I have misunderstood the syntax), whereas I am looking for something to parse _any_ tag and attribute. Although the tags I am parsing are limited in number, it would still be too onerous to create multiple expressions to compare with. Thanks for the suggestion. Charles Show quoteHide quote "Jon Shemitz" <j**@midnightbeach.com> wrote in message news:4256D60A.AB1E0542@midnightbeach.com... > Charles Law wrote: > >> I have a well structured file, where each line is of the form >> >> <sometag someattribute='attr'>text</sometag> >> >> for example >> >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >> Just some text, without tags >> >> What I would like to be able to do is parse each line so that I get an >> array >> like this >> >> SPAN >> CLASS >> myclass >> A bit of text >> >> or >> >> Just some text, without tags > > Assuming it's always attrib='value', and never attrib="value", > > // ExplicitCapture | Multiline | IgnorePatternWhitespace > > ^ > ( > < (?<tag>\w+) \s+ > (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* > > (?<text>.*) </ \k<tag> > > ) .* > | > (?<bare_text> .+) > $ > > -- > > www.midnightbeach.com Charles Law wrote:
> As with my reply to an earlier response, it looks like the expression you You misread. ?<attribute> &c captures to the named group "attribute" -> have given is specific to a given tag and attribute (unless I have > misunderstood the syntax), whereas I am looking for something to parse _any_ > tag and attribute. Although the tags I am parsing are limited in number, it > would still be too onerous to create multiple expressions to compare with. it doesn't match "attribute". You should try it. I spent five minutes writing it for you for free. Show quoteHide quote > > > Assuming it's always attrib='value', and never attrib="value", > > > > > > // ExplicitCapture | Multiline | IgnorePatternWhitespace > > > > > > ^ > > > ( > > > < (?<tag>\w+) \s+ > > > (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* > > > > (?<text>.*) </ \k<tag> > > > > ) .* > > > | > > > (?<bare_text> .+) > > > $ Jon
I apologise if I appeared dismissive of your efforts. I have tried it with <SPAN CLASS='result'>Hello world<SPAN> and it collects elements perfectly. I tried it with <SPAN>Hello world<SPAN> and it collects everything in bare_text. Is there a way to make it still collect in the designated fields? Thanks again. Charles Show quoteHide quote "Jon Shemitz" <j**@midnightbeach.com> wrote in message news:42571EA7.14724276@midnightbeach.com... > Charles Law wrote: > >> As with my reply to an earlier response, it looks like the expression >> you >> have given is specific to a given tag and attribute (unless I have >> misunderstood the syntax), whereas I am looking for something to parse >> _any_ >> tag and attribute. Although the tags I am parsing are limited in number, >> it >> would still be too onerous to create multiple expressions to compare >> with. > > You misread. ?<attribute> &c captures to the named group "attribute" - > it doesn't match "attribute". > > You should try it. I spent five minutes writing it for you for free. > >> > > Assuming it's always attrib='value', and never attrib="value", >> > > >> > > // ExplicitCapture | Multiline | IgnorePatternWhitespace >> > > >> > > ^ >> > > ( >> > > < (?<tag>\w+) \s+ >> > > (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* > >> > > (?<text>.*) </ \k<tag> > >> > > ) .* >> > > | >> > > (?<bare_text> .+) >> > > $ > > -- > > www.midnightbeach.com Charles Law wrote:
> Of course. But you said everything would look like > Jon > > I apologise if I appeared dismissive of your efforts. I have tried it with > > <SPAN CLASS='result'>Hello world<SPAN> > > and it collects elements perfectly. I tried it with > > <SPAN>Hello world<SPAN> > > and it collects everything in bare_text. Is there a way to make it still > collect in the designated fields? <sometag someattribute='attr'>text</sometag> or bare text. Try #[ExplicitCapture|Multiline|IgnorePatternWhitespace] ^ ( < (?<tag>\w+) (\s+ (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' )? \s* > (?<text>.*) </ \k<tag> > ) .* | (?<bare_text> .+)$ Jon
As we say in these parts, you know stuff. Thanks muchly. Charles Show quoteHide quote "Jon Shemitz" <j**@midnightbeach.com> wrote in message news:42583C30.765191C8@midnightbeach.com... > Charles Law wrote: >> >> Jon >> >> I apologise if I appeared dismissive of your efforts. I have tried it >> with >> >> <SPAN CLASS='result'>Hello world<SPAN> >> >> and it collects elements perfectly. I tried it with >> >> <SPAN>Hello world<SPAN> >> >> and it collects everything in bare_text. Is there a way to make it still >> collect in the designated fields? > > Of course. But you said everything would look like > > <sometag someattribute='attr'>text</sometag> > > or bare text. Try > > #[ExplicitCapture|Multiline|IgnorePatternWhitespace] > > ^ > ( > < > (?<tag>\w+) > (\s+ (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' )? \s* > > > (?<text>.*) </ \k<tag> > > ) .* > | > (?<bare_text> .+) > $ > > -- > > www.midnightbeach.com Charles,
In addition to the other comments. Rather then attempt to coerce Regex into parsing HTML, have you considered using an HTML parser/reader such as the SgmlReader? http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC Hope this helps Jay Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... | Does anyone have a regex pattern to parse HTML from a stream? | | I have a well structured file, where each line is of the form | | <sometag someattribute='attr'>text</sometag> | | for example | | <SPAN CLASS='myclass'>A bit of text</SPAN>, or | Just some text, without tags | | What I would like to be able to do is parse each line so that I get an array | like this | | SPAN | CLASS | myclass | A bit of text | | or | | Just some text, without tags | | The array bit should follow, but I don't profess to be a regex expert (or | any kind of expert for that matter). Can anyone help with a suitable | pattern? | | TIA | | Charles | | Hi Jay
I have just had a look at the link, and it is similar, I think, to the Agility Pack. Now that I have the Agility Pack working I am going to try and make that work for me, unless a regex comes up. I think the code to use a regex would be shorter/simpler, but of course that does not necessarily equate with speed, and that is my overriding concern (well, that and reliability, of course). Charles Show quoteHide quote "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in message news:eomXYZHPFHA.2748@TK2MSFTNGP09.phx.gbl... > Charles, > In addition to the other comments. > > Rather then attempt to coerce Regex into parsing HTML, have you considered > using an HTML parser/reader such as the SgmlReader? > > http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC > > Hope this helps > Jay > > "Charles Law" <bl***@nowhere.com> wrote in message > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... > | Does anyone have a regex pattern to parse HTML from a stream? > | > | I have a well structured file, where each line is of the form > | > | <sometag someattribute='attr'>text</sometag> > | > | for example > | > | <SPAN CLASS='myclass'>A bit of text</SPAN>, or > | Just some text, without tags > | > | What I would like to be able to do is parse each line so that I get an > array > | like this > | > | SPAN > | CLASS > | myclass > | A bit of text > | > | or > | > | Just some text, without tags > | > | The array bit should follow, but I don't profess to be a regex expert > (or > | any kind of expert for that matter). Can anyone help with a suitable > | pattern? > | > | TIA > | > | Charles > | > | > > > I have a well structured file If you can guarantee that the file will always be well-formed, you can use System.Xml namespace classes to do the parsing for you. i.e. XmlReader / XmlWriter / XmlDocument or any of the XPath readers/writers/document. -- Show quoteHide quoteDave Sexton d***@www..jwaonline..com ----------------------------------------------------------------------- "Charles Law" <bl***@nowhere.com> wrote in message news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... > Does anyone have a regex pattern to parse HTML from a stream? > > I have a well structured file, where each line is of the form > > <sometag someattribute='attr'>text</sometag> > > for example > > <SPAN CLASS='myclass'>A bit of text</SPAN>, or > Just some text, without tags > > What I would like to be able to do is parse each line so that I get an array like this > > SPAN > CLASS > myclass > A bit of text > > or > > Just some text, without tags > > The array bit should follow, but I don't profess to be a regex expert (or any kind of expert for that matter). Can anyone help > with a suitable pattern? > > TIA > > Charles > > Hi Dave
Actually, you have hit on something there. I write the file in the first place as HTML, but I could write it as XML, but use HTML tags. I would then have the right class structure to read it back in. Marvellous. It pays to look outside the box. Thanks. Charles Show quoteHide quote "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... >> I have a well structured file > > If you can guarantee that the file will always be well-formed, you can use > System.Xml namespace classes to do the parsing for you. i.e. XmlReader / > XmlWriter / XmlDocument or any of the XPath readers/writers/document. > > -- > Dave Sexton > d***@www..jwaonline..com > ----------------------------------------------------------------------- > "Charles Law" <bl***@nowhere.com> wrote in message > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... >> Does anyone have a regex pattern to parse HTML from a stream? >> >> I have a well structured file, where each line is of the form >> >> <sometag someattribute='attr'>text</sometag> >> >> for example >> >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or >> Just some text, without tags >> >> What I would like to be able to do is parse each line so that I get an >> array like this >> >> SPAN >> CLASS >> myclass >> A bit of text >> >> or >> >> Just some text, without tags >> >> The array bit should follow, but I don't profess to be a regex expert (or >> any kind of expert for that matter). Can anyone help with a suitable >> pattern? >> >> TIA >> >> Charles >> >> > > Charles,
| but I could write it as XML, but use HTML tags. That would be XHTML ;-)If you are writing the files, then this may be the way to go. Hope this helps Jay Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message news:%23dBanrbPFHA.3076@TK2MSFTNGP12.phx.gbl... | Hi Dave | | Actually, you have hit on something there. I write the file in the first | place as HTML, but I could write it as XML, but use HTML tags. I would then | have the right class structure to read it back in. Marvellous. It pays to | look outside the box. | | Thanks. | | Charles | | | "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message | news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... | >> I have a well structured file | > | > If you can guarantee that the file will always be well-formed, you can use | > System.Xml namespace classes to do the parsing for you. i.e. XmlReader / | > XmlWriter / XmlDocument or any of the XPath readers/writers/document. | > | > -- | > Dave Sexton | > d***@www..jwaonline..com | > ----------------------------------------------------------------------- | > "Charles Law" <bl***@nowhere.com> wrote in message | > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... | >> Does anyone have a regex pattern to parse HTML from a stream? | >> | >> I have a well structured file, where each line is of the form | >> | >> <sometag someattribute='attr'>text</sometag> | >> | >> for example | >> | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or | >> Just some text, without tags | >> | >> What I would like to be able to do is parse each line so that I get an | >> array like this | >> | >> SPAN | >> CLASS | >> myclass | >> A bit of text | >> | >> or | >> | >> Just some text, without tags | >> | >> The array bit should follow, but I don't profess to be a regex expert (or | >> any kind of expert for that matter). Can anyone help with a suitable | >> pattern? | >> | >> TIA | >> | >> Charles | >> | >> | > | > | | Hi Jay
You won't be surprised to hear that this is a continuing theme. Once upon a time, there was RTF, but it was slow, and the people wept, for it was very, very slow, and they got very, very bored waiting. So, the developer chappie considered the many possible alternatives, and decided to simplify the whole thing by invoking the minor devil known as the listview. But the users came back and said, "but we liked the rich text box, because it had colours and stuff". And the developer said, "you have colours, what are you complaining about; the listview is every bit as colourful, and quicker to boot, it just doesn't retain the colours when you save and reload". And then he added, "you are lucky to have anything at all, so just be grateful", but he went away thinking that he had somehow done the users a disservice. So, anyway, he came up with the idea of saving the output as html, so that it could be opened by the great God Microsoft Word; oh, and some browser thingy called IE. But then there was the dilemma: how to load it back into the application with colour, as the users had become used to. And it was then that Regular Expression came to the developer one night in a dream. But he knew little of the Regular Expression, so he sought help from the great developers in the sky. And they said, try this ... no, try this ... and he tried it, and it worked; sought of. But by this time, the developer had grown weary, and also his calculating machine had become defective because he had done some re-installing and it had mucked up his debugger, and it took him a day-and-a-half to put it right. So, by Sunday evening he was really very weary indeed, and then some. Finally, a door opened, and a bright light shone in. The developer tried some stuff, and it worked. He wrote a set of classes to serialise and de-serialise an html class, which looked remarkably like real html, which is apparently something called xhtml. So, now we are back in the present. The story is nearly at its end. The developer just needs some sleep (and the love of a good women), and all will be right with the world. And so, to sleep, perchance to dream, ay there's the rub. Charles Show quoteHide quote "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in message news:e9hRbwgPFHA.2252@TK2MSFTNGP15.phx.gbl... > Charles, > | but I could write it as XML, but use HTML tags. > > That would be XHTML ;-) > > If you are writing the files, then this may be the way to go. > > Hope this helps > Jay > > "Charles Law" <bl***@nowhere.com> wrote in message > news:%23dBanrbPFHA.3076@TK2MSFTNGP12.phx.gbl... > | Hi Dave > | > | Actually, you have hit on something there. I write the file in the first > | place as HTML, but I could write it as XML, but use HTML tags. I would > then > | have the right class structure to read it back in. Marvellous. It pays > to > | look outside the box. > | > | Thanks. > | > | Charles > | > | > | "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message > | news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... > | >> I have a well structured file > | > > | > If you can guarantee that the file will always be well-formed, you can > use > | > System.Xml namespace classes to do the parsing for you. i.e. XmlReader > / > | > XmlWriter / XmlDocument or any of the XPath readers/writers/document. > | > > | > -- > | > Dave Sexton > | > d***@www..jwaonline..com > | > > ----------------------------------------------------------------------- > | > "Charles Law" <bl***@nowhere.com> wrote in message > | > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... > | >> Does anyone have a regex pattern to parse HTML from a stream? > | >> > | >> I have a well structured file, where each line is of the form > | >> > | >> <sometag someattribute='attr'>text</sometag> > | >> > | >> for example > | >> > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or > | >> Just some text, without tags > | >> > | >> What I would like to be able to do is parse each line so that I get > an > | >> array like this > | >> > | >> SPAN > | >> CLASS > | >> myclass > | >> A bit of text > | >> > | >> or > | >> > | >> Just some text, without tags > | >> > | >> The array bit should follow, but I don't profess to be a regex expert > (or > | >> any kind of expert for that matter). Can anyone help with a suitable > | >> pattern? > | >> > | >> TIA > | >> > | >> Charles > | >> > | >> > | > > | > > | > | > > Charles,
| So, now we are back in the present. The story is nearly at its end. The Can't really help you on either of those... Other then wishing you luck in | developer just needs some sleep (and the love of a good women), and all will | be right with the world. those areas... This question & the question on "Easiest way to generate XML in VB.NET" post reminds me of Item #29 "Always Use a Parser" from Elliotte Rusty Harold's book "Effective XML - 50 Specific Ways to Improve Your XML" from Addison Wesley lists a number of other reasons to use a parser. Although Item #29 is largely reading, I find the topic apropos to writing also. Hence my suggestion, without realizing the connection, of using either the SgmlReader or XHTML... Hope this helps Jay Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message readers/writers/document.news:en5zAFiPFHA.3076@TK2MSFTNGP12.phx.gbl... | Hi Jay | | You won't be surprised to hear that this is a continuing theme. | | Once upon a time, there was RTF, but it was slow, and the people wept, for | it was very, very slow, and they got very, very bored waiting. | | So, the developer chappie considered the many possible alternatives, and | decided to simplify the whole thing by invoking the minor devil known as the | listview. But the users came back and said, "but we liked the rich text box, | because it had colours and stuff". | | And the developer said, "you have colours, what are you complaining about; | the listview is every bit as colourful, and quicker to boot, it just doesn't | retain the colours when you save and reload". | | And then he added, "you are lucky to have anything at all, so just be | grateful", but he went away thinking that he had somehow done the users a | disservice. | | So, anyway, he came up with the idea of saving the output as html, so that | it could be opened by the great God Microsoft Word; oh, and some browser | thingy called IE. | | But then there was the dilemma: how to load it back into the application | with colour, as the users had become used to. And it was then that Regular | Expression came to the developer one night in a dream. But he knew little of | the Regular Expression, so he sought help from the great developers in the | sky. And they said, try this ... no, try this ... and he tried it, and it | worked; sought of. | | But by this time, the developer had grown weary, and also his calculating | machine had become defective because he had done some re-installing and it | had mucked up his debugger, and it took him a day-and-a-half to put it | right. So, by Sunday evening he was really very weary indeed, and then some. | | Finally, a door opened, and a bright light shone in. The developer tried | some stuff, and it worked. He wrote a set of classes to serialise and | de-serialise an html class, which looked remarkably like real html, which is | apparently something called xhtml. | | | So, now we are back in the present. The story is nearly at its end. The | developer just needs some sleep (and the love of a good women), and all will | be right with the world. | | And so, to sleep, perchance to dream, ay there's the rub. | | Charles | | | "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in message | news:e9hRbwgPFHA.2252@TK2MSFTNGP15.phx.gbl... | > Charles, | > | but I could write it as XML, but use HTML tags. | > | > That would be XHTML ;-) | > | > If you are writing the files, then this may be the way to go. | > | > Hope this helps | > Jay | > | > "Charles Law" <bl***@nowhere.com> wrote in message | > news:%23dBanrbPFHA.3076@TK2MSFTNGP12.phx.gbl... | > | Hi Dave | > | | > | Actually, you have hit on something there. I write the file in the first | > | place as HTML, but I could write it as XML, but use HTML tags. I would | > then | > | have the right class structure to read it back in. Marvellous. It pays | > to | > | look outside the box. | > | | > | Thanks. | > | | > | Charles | > | | > | | > | "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message | > | news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... | > | >> I have a well structured file | > | > | > | > If you can guarantee that the file will always be well-formed, you can | > use | > | > System.Xml namespace classes to do the parsing for you. i.e. XmlReader | > / | > | > XmlWriter / XmlDocument or any of the XPath Show quoteHide quote | > | > | > | > -- | > | > Dave Sexton | > | > d***@www..jwaonline..com | > | | > > ----------------------------------------------------------------------- | > | > "Charles Law" <bl***@nowhere.com> wrote in message | > | > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... | > | >> Does anyone have a regex pattern to parse HTML from a stream? | > | >> | > | >> I have a well structured file, where each line is of the form | > | >> | > | >> <sometag someattribute='attr'>text</sometag> | > | >> | > | >> for example | > | >> | > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or | > | >> Just some text, without tags | > | >> | > | >> What I would like to be able to do is parse each line so that I get | > an | > | >> array like this | > | >> | > | >> SPAN | > | >> CLASS | > | >> myclass | > | >> A bit of text | > | >> | > | >> or | > | >> | > | >> Just some text, without tags | > | >> | > | >> The array bit should follow, but I don't profess to be a regex expert | > (or | > | >> any kind of expert for that matter). Can anyone help with a suitable | > | >> pattern? | > | >> | > | >> TIA | > | >> | > | >> Charles | > | >> | > | >> | > | > | > | > | > | | > | | > | > | | I have just spotted a Freudian slip
> | So, now we are back in the present. The story is nearly at its end. The Maybe there is something going on in my head that I don't know about ... > | developer just needs some sleep (and the love of a good wom*e*n), and > all wouldn't be the first time. I don't see any specific support for XHTML in .NET, unless it goes by another name. I have my solution, using the XmlSerializer to serialise and de-serialise a class hierarchy that resembles the html document I want to manipulate. It requires that I name the classes quite carefully, and there are some things that I cannot readily do, such as put comments - -->) into a STYLE tag, but it works. Have I missed a trick with this XHTML? Charles Show quoteHide quote "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in message news:e%23eZ%238pPFHA.1172@TK2MSFTNGP12.phx.gbl... > Charles, > | So, now we are back in the present. The story is nearly at its end. The > | developer just needs some sleep (and the love of a good women), and all > will > | be right with the world. > Can't really help you on either of those... Other then wishing you luck in > those areas... > > > This question & the question on "Easiest way to generate XML in VB.NET" > post > reminds me of Item #29 "Always Use a Parser" from Elliotte Rusty Harold's > book "Effective XML - 50 Specific Ways to Improve Your XML" from Addison > Wesley lists a number of other reasons to use a parser. Although Item #29 > is > largely reading, I find the topic apropos to writing also. Hence my > suggestion, without realizing the connection, of using either the > SgmlReader > or XHTML... > > Hope this helps > Jay > > > > > "Charles Law" <bl***@nowhere.com> wrote in message > news:en5zAFiPFHA.3076@TK2MSFTNGP12.phx.gbl... > | Hi Jay > | > | You won't be surprised to hear that this is a continuing theme. > | > | Once upon a time, there was RTF, but it was slow, and the people wept, > for > | it was very, very slow, and they got very, very bored waiting. > | > | So, the developer chappie considered the many possible alternatives, and > | decided to simplify the whole thing by invoking the minor devil known as > the > | listview. But the users came back and said, "but we liked the rich text > box, > | because it had colours and stuff". > | > | And the developer said, "you have colours, what are you complaining > about; > | the listview is every bit as colourful, and quicker to boot, it just > doesn't > | retain the colours when you save and reload". > | > | And then he added, "you are lucky to have anything at all, so just be > | grateful", but he went away thinking that he had somehow done the users > a > | disservice. > | > | So, anyway, he came up with the idea of saving the output as html, so > that > | it could be opened by the great God Microsoft Word; oh, and some browser > | thingy called IE. > | > | But then there was the dilemma: how to load it back into the application > | with colour, as the users had become used to. And it was then that > Regular > | Expression came to the developer one night in a dream. But he knew > little > of > | the Regular Expression, so he sought help from the great developers in > the > | sky. And they said, try this ... no, try this ... and he tried it, and > it > | worked; sought of. > | > | But by this time, the developer had grown weary, and also his > calculating > | machine had become defective because he had done some re-installing and > it > | had mucked up his debugger, and it took him a day-and-a-half to put it > | right. So, by Sunday evening he was really very weary indeed, and then > some. > | > | Finally, a door opened, and a bright light shone in. The developer tried > | some stuff, and it worked. He wrote a set of classes to serialise and > | de-serialise an html class, which looked remarkably like real html, > which > is > | apparently something called xhtml. > | > | > | So, now we are back in the present. The story is nearly at its end. The > | developer just needs some sleep (and the love of a good women), and all > will > | be right with the world. > | > | And so, to sleep, perchance to dream, ay there's the rub. > | > | Charles > | > | > | "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in > message > | news:e9hRbwgPFHA.2252@TK2MSFTNGP15.phx.gbl... > | > Charles, > | > | but I could write it as XML, but use HTML tags. > | > > | > That would be XHTML ;-) > | > > | > If you are writing the files, then this may be the way to go. > | > > | > Hope this helps > | > Jay > | > > | > "Charles Law" <bl***@nowhere.com> wrote in message > | > news:%23dBanrbPFHA.3076@TK2MSFTNGP12.phx.gbl... > | > | Hi Dave > | > | > | > | Actually, you have hit on something there. I write the file in the > first > | > | place as HTML, but I could write it as XML, but use HTML tags. I > would > | > then > | > | have the right class structure to read it back in. Marvellous. It > pays > | > to > | > | look outside the box. > | > | > | > | Thanks. > | > | > | > | Charles > | > | > | > | > | > | "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message > | > | news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... > | > | >> I have a well structured file > | > | > > | > | > If you can guarantee that the file will always be well-formed, you > can > | > use > | > | > System.Xml namespace classes to do the parsing for you. i.e. > XmlReader > | > / > | > | > XmlWriter / XmlDocument or any of the XPath > readers/writers/document. > | > | > > | > | > -- > | > | > Dave Sexton > | > | > d***@www..jwaonline..com > | > | > | > > > ----------------------------------------------------------------------- > | > | > "Charles Law" <bl***@nowhere.com> wrote in message > | > | > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... > | > | >> Does anyone have a regex pattern to parse HTML from a stream? > | > | >> > | > | >> I have a well structured file, where each line is of the form > | > | >> > | > | >> <sometag someattribute='attr'>text</sometag> > | > | >> > | > | >> for example > | > | >> > | > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or > | > | >> Just some text, without tags > | > | >> > | > | >> What I would like to be able to do is parse each line so that I > get > | > an > | > | >> array like this > | > | >> > | > | >> SPAN > | > | >> CLASS > | > | >> myclass > | > | >> A bit of text > | > | >> > | > | >> or > | > | >> > | > | >> Just some text, without tags > | > | >> > | > | >> The array bit should follow, but I don't profess to be a regex > expert > | > (or > | > | >> any kind of expert for that matter). Can anyone help with a > suitable > | > | >> pattern? > | > | >> > | > | >> TIA > | > | >> > | > | >> Charles > | > | >> > | > | >> > | > | > > | > | > > | > | > | > | > | > > | > > | > | > > Charles,
| I don't see any specific support for XHTML in .NET There is no specific support per se.XHTML is HTML tags in an XML document. Ergo the XHTML support in .NET is the classes System.Xml namespace, such as the XmlSerializer. XmlSerializer directly or indirectly uses a System.Xml.XmlWriter to write XML output. In other words it follows Item #29 & uses a "parser". Hope this helps Jay Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message <<snip>>news:%23Itj8GrPFHA.3704@TK2MSFTNGP12.phx.gbl... |I have just spotted a Freudian slip | | > | So, now we are back in the present. The story is nearly at its end. The | > | developer just needs some sleep (and the love of a good wom*e*n), and | > all | | Maybe there is something going on in my head that I don't know about ... | wouldn't be the first time. | | I don't see any specific support for XHTML in .NET, unless it goes by | another name. I have my solution, using the XmlSerializer to serialise and | de-serialise a class hierarchy that resembles the html document I want to | manipulate. It requires that I name the classes quite carefully, and there | are some things that I cannot readily do, such as put comments | - -->) into a STYLE tag, but it works. | | Have I missed a trick with this XHTML? | | Charles | | Thanks for clearing that up. I think I have probably done the best with it
then Cheers Charles Show quoteHide quote "Jay B. Harlow [MVP - Outlook]" <Jay_Harlow_***@msn.com> wrote in message news:eHOJ6etPFHA.3336@TK2MSFTNGP09.phx.gbl... > Charles, > | I don't see any specific support for XHTML in .NET > There is no specific support per se. > > XHTML is HTML tags in an XML document. > > Ergo the XHTML support in .NET is the classes System.Xml namespace, such > as > the XmlSerializer. XmlSerializer directly or indirectly uses a > System.Xml.XmlWriter to write XML output. In other words it follows Item > #29 > & uses a "parser". > > Hope this helps > Jay > > "Charles Law" <bl***@nowhere.com> wrote in message > news:%23Itj8GrPFHA.3704@TK2MSFTNGP12.phx.gbl... > |I have just spotted a Freudian slip > | > | > | So, now we are back in the present. The story is nearly at its end. > The > | > | developer just needs some sleep (and the love of a good wom*e*n), > and > | > all > | > | Maybe there is something going on in my head that I don't know about ... > | wouldn't be the first time. > | > | I don't see any specific support for XHTML in .NET, unless it goes by > | another name. I have my solution, using the XmlSerializer to serialise > and > | de-serialise a class hierarchy that resembles the html document I want > to > | manipulate. It requires that I name the classes quite carefully, and > there > | are some things that I cannot readily do, such as put comments > | - -->) into a STYLE tag, but it works. > | > | Have I missed a trick with this XHTML? > | > | Charles > | > | > <<snip>> > > Charles,
NOTE: The SgmlTextReader I mentioned in my earlier post allows you to treat any HTML as XML. Hope this helps Jay Show quoteHide quote "Charles Law" <bl***@nowhere.com> wrote in message news:%23dBanrbPFHA.3076@TK2MSFTNGP12.phx.gbl... | Hi Dave | | Actually, you have hit on something there. I write the file in the first | place as HTML, but I could write it as XML, but use HTML tags. I would then | have the right class structure to read it back in. Marvellous. It pays to | look outside the box. | | Thanks. | | Charles | | | "Dave" <NOSPAM-dave@dotcomdatasolutions.com> wrote in message | news:%232oYmOXPFHA.1500@TK2MSFTNGP09.phx.gbl... | >> I have a well structured file | > | > If you can guarantee that the file will always be well-formed, you can use | > System.Xml namespace classes to do the parsing for you. i.e. XmlReader / | > XmlWriter / XmlDocument or any of the XPath readers/writers/document. | > | > -- | > Dave Sexton | > d***@www..jwaonline..com | > ----------------------------------------------------------------------- | > "Charles Law" <bl***@nowhere.com> wrote in message | > news:%23REkS4BPFHA.1884@TK2MSFTNGP15.phx.gbl... | >> Does anyone have a regex pattern to parse HTML from a stream? | >> | >> I have a well structured file, where each line is of the form | >> | >> <sometag someattribute='attr'>text</sometag> | >> | >> for example | >> | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or | >> Just some text, without tags | >> | >> What I would like to be able to do is parse each line so that I get an | >> array like this | >> | >> SPAN | >> CLASS | >> myclass | >> A bit of text | >> | >> or | >> | >> Just some text, without tags | >> | >> The array bit should follow, but I don't profess to be a regex expert (or | >> any kind of expert for that matter). Can anyone help with a suitable | >> pattern? | >> | >> TIA | >> | >> Charles | >> | >> | > | > | | |
|||||||||||||||||||||||