|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
A Question About Regular Expressions and Capturethink) to suck some information out of some html. I could have never come up with this myself but Balena has an example which is very similar to this. The guts of the program is ... Dim i As Integer Dim rgx As Regex Dim Pattern As String = "<td class=td1 width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ "\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>" Dim Pattern2 As String = "<td class=td1 width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ "\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis don't help rgx = New Regex(Pattern) tbxPattern.Text = Pattern Dim m As Match, g As Group For Each m In rgx.Matches(tbxInput.Text) g = m.Groups("variable") lstbxKeys.Items.Add(g.Value) g = m.Groups("value") lstbxValues.Items.Add(g.Value) Next The data looks like this (below). It works fine for all cases except the first (the "Celular" data) where the value is picked up as "123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I can't understand why the "</b>" is included in the value. Doesn't my pattern clearly show that the value is a string of one or more characters, terminated by, optionally, "</b>" followed by "</td>". Is there a straightforward way to tell it to not include the "</b>" in the value? Note that the "</b>" is not always present so the pattern has to say that it is optional. Thank, Bob <tr height=24> <td class=td1 width="35%"><b>Celular</td> <td width=1><img src="../img/p.gif" width=1 height=1></td> <td class=td2 width="65%"><b>123-abc-5678</b></td> </tr> <tr height=24> <td class=td1 width="35%">Edad</td> <td width=1><img src="../img/p.gif" width=1 height=1></td> <td class=td2 width="65%">24 Años</td> </tr> <tr height=24> <td class=td1 width="35%">Altura</td> <td width=1><img src="../img/p.gif" width=1 height=1></td> <td class=td2 width="65%">1.70 mts.</td> eBob.com wrote:
Show quoteHide quote > I am using regular expressions and a particular feature called "capture" (I Yes, but remember that regexes are 'greedy' by default - they always> think) to suck some information out of some html. I could have never come > up with this myself but Balena has an example which is very similar to this. > The guts of the program is ... > > Dim i As Integer > Dim rgx As Regex > > Dim Pattern As String = "<td class=td1 > width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ > > "\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>" > > Dim Pattern2 As String = "<td class=td1 > width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ > > "\s*.*\s*<td class=td2 > width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis > don't help > > rgx = New Regex(Pattern) > > tbxPattern.Text = Pattern > > Dim m As Match, g As Group > > For Each m In rgx.Matches(tbxInput.Text) > > g = m.Groups("variable") > > lstbxKeys.Items.Add(g.Value) > > g = m.Groups("value") > > lstbxValues.Items.Add(g.Value) > > Next > > The data looks like this (below). It works fine for all cases except the > first (the "Celular" data) where the value is picked up as > "123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I > can't understand why the "</b>" is included in the value. Doesn't my > pattern clearly show that the value is a string of one or more characters, > terminated by, optionally, "</b>" followed by "</td>". capture as many characters as they can. Thus when given a choice between: value: 123-abc-5678</b> optional </b>: no and value: 123-abc-5678 optional </b>: yes since the 'value' match happens first, and it can legitimately capture everything including the </b>, it does so. > Is there a How about, instead of value capturing one or more of any character with> straightforward way to tell it to not include the "</b>" in the value? ..+ you instead capture one or more characters that aren't < with [^<]+ Also, there are flags you can put in to make expressions non-greedy, but I don't think that will work in this situation. BUT I would *urge* you to stop trying to parse HTML with regex, and instead run (don't walk) to <http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>, and from there download HtmlAgilityPack, which is an absolutely invaluable library that converts (even malformed) HTML into a nice XML document tree. It makes doing HTML parsing a hundred times more easy than trying to use regex. -- Larry Lard Replies to group please Thank you very much Larry. It finally occurred to me that there had to be
some way to take advantage of the fact that the string I am after does not contain "<", but the only solution I could think of was very ugly. Your suggestion is much, much better. And thank you for making me aware of the HtmlAgilityPack, I will be looking into it. Thanks, Bob Show quoteHide quote "Larry Lard" <larryl***@hotmail.com> wrote in message news:1150190711.583629.75450@u72g2000cwu.googlegroups.com... > > eBob.com wrote: >> I am using regular expressions and a particular feature called "capture" >> (I >> think) to suck some information out of some html. I could have never >> come >> up with this myself but Balena has an example which is very similar to >> this. >> The guts of the program is ... >> >> Dim i As Integer >> Dim rgx As Regex >> >> Dim Pattern As String = "<td class=td1 >> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ >> >> "\s*.*\s*<td class=td2 >> width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>" >> >> Dim Pattern2 As String = "<td class=td1 >> width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _ >> >> "\s*.*\s*<td class=td2 >> width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra >> parenthesis >> don't help >> >> rgx = New Regex(Pattern) >> >> tbxPattern.Text = Pattern >> >> Dim m As Match, g As Group >> >> For Each m In rgx.Matches(tbxInput.Text) >> >> g = m.Groups("variable") >> >> lstbxKeys.Items.Add(g.Value) >> >> g = m.Groups("value") >> >> lstbxValues.Items.Add(g.Value) >> >> Next >> >> The data looks like this (below). It works fine for all cases except the >> first (the "Celular" data) where the value is picked up as >> "123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". >> I >> can't understand why the "</b>" is included in the value. Doesn't my >> pattern clearly show that the value is a string of one or more >> characters, >> terminated by, optionally, "</b>" followed by "</td>". > > Yes, but remember that regexes are 'greedy' by default - they always > capture as many characters as they can. Thus when given a choice > between: > > value: 123-abc-5678</b> > optional </b>: no > > and > > value: 123-abc-5678 > optional </b>: yes > > since the 'value' match happens first, and it can legitimately capture > everything including the </b>, it does so. > >> Is there a >> straightforward way to tell it to not include the "</b>" in the value? > > How about, instead of value capturing one or more of any character with > > > .+ > > you instead capture one or more characters that aren't < with > > [^<]+ > > Also, there are flags you can put in to make expressions non-greedy, > but I don't think that will work in this situation. > > BUT > > I would *urge* you to stop trying to parse HTML with regex, and > instead run (don't walk) to > <http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>, > and from there download HtmlAgilityPack, which is an absolutely > invaluable library that converts (even malformed) HTML into a nice XML > document tree. It makes doing HTML parsing a hundred times more easy > than trying to use regex. > > -- > Larry Lard > Replies to group please >
Function Vs. Sub Procedure
Marshal Structure containing arrays to function in DLL OOP object instance assignment in sub new() Scanning Option Group (VB 6 Option Button Control Array) ExitWindowsEx function not working. String, not Boolean VB.NET: RasDial + CallBacks + throwing events = frozen UI? Is there a Function and Function Argument generic self-reference? Linking childform to Main Form in VB6 Passing an array of to a Sub |
|||||||||||||||||||||||