|
web
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
trouble reading word documentsI use a code that creates a FileStream to open and read the content of a word document. I want to save the text as plain text to a database. Now I have a code that reads UTF-8 encoding but that doesn't always work: Sub readdoc2(ByVal sPathName As String) Dim temp As UTF8Encoding = New UTF8Encoding(True) Dim fs As FileStream = File.OpenRead(sPathName) Dim b(1024) As Byte Do While fs.Read(b, 0, b.Length) > 0 Me.RichTextBox1.Text &= temp.GetString(b, 0, b.Length) Loop fs.Close() End Sub Some documents need my other code: Sub readdoc(ByVal sPathName As String) Dim fs As FileStream = File.OpenRead(sPathName) Dim d As New StreamReader(fs) 'creating a new StreamReader and passing the filestream object fs as argument d.BaseStream.Seek(0, SeekOrigin.Begin) 'Seek method is used to move the cursor to different positions in a file, in this code, to 'the beginning While d.Peek() > -1 'peek method of StreamReader object tells how much more data is left in the file Me.RichTextBox1.Text &= d.ReadLine() End While d.Close() End Sub Anyway I end up with some strange characters which I first have to remove before I can save the text to the database. Is there no way you can get the text from a document without having to remove these unreadable characters? Regards Marco The Netherlands Co wrote:
> Hi All, You can not handle a .doc file like a plain text file. It's stored in a > > I use a code that creates a FileStream to open and read the content of > a word document. > I want to save the text as plain text to a database. > Now I have a code that reads UTF-8 encoding but that doesn't always > work: (proprietary) binary format. If you have really a lot of time to read: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx Armin > You can not handle a .doc file like a plain text file. It's stored in a The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside > (proprietary) binary format. > > If you have really a lot of time to read: > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx the zip file, you'll find a bunch of xml files. \word\document.xml contains the text data for the file (along with some other stuff you'll need to parse out). Don't know if this is useful to your particular situation or not... Dale
Show quote
Hide quote
"Dale Atkin" <labrad***@ibycus.com> wrote in message Actually for me, that was enormously helpful.news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl... > > > You can not handle a .doc file like a plain text file. It's stored in a > > (proprietary) binary format. > > > > If you have really a lot of time to read: > > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx > > The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside > the zip file, you'll find a bunch of xml files. \word\document.xml contains > the text data for the file (along with some other stuff you'll need to parse > out). > > Don't know if this is useful to your particular situation or not... > > Dale > Thanks Dale Atkin, Armin Zingler, and Co. Is there a zip/unzip function in VB2005 that can be used to expose the XML inside docx (etc.) formats...? Thanks in Advance... ____________________________________________________________ Timothy Casey GPEMC - Eleven is the num***@timothycasey.info to email. Philosophical Essays: http://timothycasey.info Speed Reading: http://speed-reading-comprehension.com Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security. Science & Geology: http://geologist-1011.com;http://geologist-1011.net Technical & Web Design: http://web-design-1011.com -- GPEMC! Anti-SPAM email conditions apply. See www.fieldcraft.biz/GPEMC The General Public Electronic Mail Contract is free for public use. If enough of us participate, we can launch a class action to end SPAM Put GPEMC in your signature to join the fight. Invoice a SPAMmer today! On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote:
Show quoteHide quote > "Dale Atkin" <labrad***@ibycus.com> wrote in message I recommend SharpZipLib:> news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl... >> >> > You can not handle a .doc file like a plain text file. It's stored in a >> > (proprietary) binary format. >> > >> > If you have really a lot of time to read: >> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx >> >> The new office format (ie. .docx, pptx, etc) is actually a zip file. > Inside >> the zip file, you'll find a bunch of xml files. \word\document.xml > contains >> the text data for the file (along with some other stuff you'll need to > parse >> out). >> >> Don't know if this is useful to your particular situation or not... >> >> Dale >> > > Actually for me, that was enormously helpful. > Thanks Dale Atkin, Armin Zingler, and Co. > > Is there a zip/unzip function in VB2005 that can be used to expose the XML > inside docx (etc.) formats...? > > Thanks in Advance... > http://www.icsharpcode.net/OpenSource/SharpZipLib/ -- Tom Shelton On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote:
Show quoteHide quote > "Dale Atkin" <labrad***@ibycus.com> wrote in message I recommend SharpZipLib:> news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl... >> >> > You can not handle a .doc file like a plain text file. It's stored in a >> > (proprietary) binary format. >> > >> > If you have really a lot of time to read: >> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx >> >> The new office format (ie. .docx, pptx, etc) is actually a zip file. > Inside >> the zip file, you'll find a bunch of xml files. \word\document.xml > contains >> the text data for the file (along with some other stuff you'll need to > parse >> out). >> >> Don't know if this is useful to your particular situation or not... >> >> Dale >> > > Actually for me, that was enormously helpful. > Thanks Dale Atkin, Armin Zingler, and Co. > > Is there a zip/unzip function in VB2005 that can be used to expose the XML > inside docx (etc.) formats...? > > Thanks in Advance... > http://www.icsharpcode.net/OpenSource/SharpZipLib/ Unless your using .NET 3.0 in your VS2005 - because then you can use System.IO.Packaging. It has native support for the style of zip files that word is using. -- Tom Shelton On 22 mei, 07:28, Tom Shelton <tom_shel***@comcastXXXXXXX.net> wrote:
Show quoteHide quote > On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote: What if I open Word, select all text and copy that to a string.> > > > > "Dale Atkin" <labrad***@ibycus.com> wrote in message > >news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl... > > >> > You can not handle a .doc file like a plain text file. It's stored in a > >> > (proprietary) binary format. > > >> > If you have really a lot of time to read: > >> >http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx > > >> The new office format (ie. .docx, pptx, etc) is actually a zip file. > > Inside > >> the zip file, you'll find a bunch of xml files. \word\document.xml > > contains > >> the text data for the file (along with some other stuff you'll need to > > parse > >> out). > > >> Don't know if this is useful to your particular situation or not... > > >> Dale > > > Actually for me, that was enormously helpful. > > Thanks Dale Atkin, Armin Zingler, and Co. > > > Is there a zip/unzip function in VB2005 that can be used to expose the XML > > inside docx (etc.) formats...? > > > Thanks in Advance... > > I recommend SharpZipLib:http://www.icsharpcode.net/OpenSource/SharpZipLib/ > > Unless your using .NET 3.0 in your VS2005 - because then you can use > System.IO.Packaging. It has native support for the style of zip files that > word is using. > > -- > Tom Shelton Then paste it into a richtextbox? Marco >What if I open Word, select all text and copy that to a string. Is that an option for you? Sure you could do that.>Then paste it into a richtextbox? Might even be able to work out a way to script doing that, or you might be able to code some kind of macro within VBA to do what you want that would be more efficient (do they still call it VBA?). Dale "Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message <eleven_is_the_num***@timothycasey.info> wrote:news:%23GWLi4p2JHA.4744@TK2MSFTNGP04.phx.gbl... > On 2009-05-22, Number Eleven - GPEMC! Show quoteHide quote > > "Dale Atkin" <labrad***@ibycus.com> wrote in message Thank you - that's fantastic news for me...> > news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl... > >> > >> > You can not handle a .doc file like a plain text file. It's stored in a > >> > (proprietary) binary format. > >> > > >> > If you have really a lot of time to read: > >> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx > >> > >> The new office format (ie. .docx, pptx, etc) is actually a zip file. > > Inside > >> the zip file, you'll find a bunch of xml files. \word\document.xml > > contains > >> the text data for the file (along with some other stuff you'll need to > > parse > >> out). > >> > >> Don't know if this is useful to your particular situation or not... > >> > >> Dale > >> > > > > Actually for me, that was enormously helpful. > > Thanks Dale Atkin, Armin Zingler, and Co. > > > > Is there a zip/unzip function in VB2005 that can be used to expose the XML > > inside docx (etc.) formats...? > > > > Thanks in Advance... > > > > I recommend SharpZipLib: > http://www.icsharpcode.net/OpenSource/SharpZipLib/ > > Unless your using .NET 3.0 in your VS2005 - because then you can use > System.IO.Packaging. It has native support for the style of zip files that > word is using. ____________________________________________________________ Timothy Casey GPEMC - Eleven is the num***@timothycasey.info to email. Philosophical Essays: http://timothycasey.info Speed Reading: http://speed-reading-comprehension.com Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security. Science & Geology: http://geologist-1011.com;http://geologist-1011.net Technical & Web Design: http://web-design-1011.com -- GPEMC! Anti-SPAM email conditions apply. See www.fieldcraft.biz/GPEMC The General Public Electronic Mail Contract is free for public use. If enough of us participate, we can launch a class action to end SPAM Put GPEMC in your signature to join the fight. Invoice a SPAMmer today!
Good tutorial for working with XML
problem reading array data from structure When double precision isn't very precise Controls not rendering how to know if to close sqlreader Multiple File Select Not Working In Published ClickOnce Application Loading an XML Document? drag and drop issue importing data from excel sheet to datagridView Implementing IPostBackEventHandler |
|||||||||||||||||||||||