Home All Groups Group Topic Archive Search About

trouble reading word documents

Author
21 May 2009 9:59 PM
Co
Hi All,

I use a code that creates a FileStream to open and read the content of
a word document.
I want to save the text as plain text to a database.
Now I have a code that reads UTF-8 encoding but that doesn't always
work:

Sub readdoc2(ByVal sPathName As String)

        Dim temp As UTF8Encoding = New UTF8Encoding(True)
        Dim fs As FileStream = File.OpenRead(sPathName)
        Dim b(1024) As Byte

        Do While fs.Read(b, 0, b.Length) > 0
            Me.RichTextBox1.Text &= temp.GetString(b, 0, b.Length)
        Loop

        fs.Close()

    End Sub

Some documents need my other code:

    Sub readdoc(ByVal sPathName As String)

        Dim fs As FileStream = File.OpenRead(sPathName)
        Dim d As New StreamReader(fs)

        'creating a new StreamReader and passing the filestream object
fs as argument
        d.BaseStream.Seek(0, SeekOrigin.Begin)
        'Seek method is used to move the cursor to different positions
in a file, in this code, to
        'the beginning

        While d.Peek() > -1
            'peek method of StreamReader object tells how much more
data is left in the file
            Me.RichTextBox1.Text &= d.ReadLine()
        End While
        d.Close()

    End Sub

Anyway I end up with some strange characters which I first have to
remove before I can save the
text to the database.

Is there no way you can get the text from a document without having to
remove these unreadable
characters?

Regards
Marco
The Netherlands

Author
21 May 2009 10:24 PM
Armin Zingler
Co wrote:
> Hi All,
>
> I use a code that creates a FileStream to open and read the content of
> a word document.
> I want to save the text as plain text to a database.
> Now I have a code that reads UTF-8 encoding but that doesn't always
> work:

You can not handle a .doc file like a plain text file. It's stored in a
(proprietary) binary format.

If you have really a lot of time to read:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx


Armin
Author
21 May 2009 10:39 PM
Dale Atkin
> You can not handle a .doc file like a plain text file. It's stored in a
> (proprietary) binary format.
>
> If you have really a lot of time to read:
> http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside
the zip file, you'll find a bunch of xml files. \word\document.xml contains
the text data for the file (along with some other stuff you'll need to parse
out).

Don't know if this is useful to your particular situation or not...

Dale
Author
22 May 2009 2:04 AM
Number Eleven - GPEMC!
Show quote Hide quote
"Dale Atkin" <labrad***@ibycus.com> wrote in message
news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl...
>
> > You can not handle a .doc file like a plain text file. It's stored in a
> > (proprietary) binary format.
> >
> > If you have really a lot of time to read:
> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
>
> The new office format (ie. .docx, pptx, etc) is actually a zip file.
Inside
> the zip file, you'll find a bunch of xml files. \word\document.xml
contains
> the text data for the file (along with some other stuff you'll need to
parse
> out).
>
> Don't know if this is useful to your particular situation or not...
>
> Dale
>

Actually for me, that was enormously helpful.
Thanks Dale Atkin, Armin Zingler, and Co.

Is there a zip/unzip function  in VB2005 that can be used to expose the XML
inside docx (etc.) formats...?

Thanks in Advance...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the num***@timothycasey.info to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com;http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
--
GPEMC! Anti-SPAM email conditions apply. See www.fieldcraft.biz/GPEMC
The General Public Electronic Mail Contract is free for public use.
If enough of us participate, we can launch a class action to end SPAM
Put GPEMC in your signature to join the fight. Invoice a SPAMmer today!
Author
22 May 2009 5:27 AM
Tom Shelton
On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote:
Show quoteHide quote
> "Dale Atkin" <labrad***@ibycus.com> wrote in message
> news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl...
>>
>> > You can not handle a .doc file like a plain text file. It's stored in a
>> > (proprietary) binary format.
>> >
>> > If you have really a lot of time to read:
>> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
>>
>> The new office format (ie. .docx, pptx, etc) is actually a zip file.
> Inside
>> the zip file, you'll find a bunch of xml files. \word\document.xml
> contains
>> the text data for the file (along with some other stuff you'll need to
> parse
>> out).
>>
>> Don't know if this is useful to your particular situation or not...
>>
>> Dale
>>
>
> Actually for me, that was enormously helpful.
> Thanks Dale Atkin, Armin Zingler, and Co.
>
> Is there a zip/unzip function  in VB2005 that can be used to expose the XML
> inside docx (etc.) formats...?
>
> Thanks in Advance...
>

I recommend SharpZipLib:

http://www.icsharpcode.net/OpenSource/SharpZipLib/
--
Tom Shelton
Author
22 May 2009 5:28 AM
Tom Shelton
On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote:
Show quoteHide quote
> "Dale Atkin" <labrad***@ibycus.com> wrote in message
> news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl...
>>
>> > You can not handle a .doc file like a plain text file. It's stored in a
>> > (proprietary) binary format.
>> >
>> > If you have really a lot of time to read:
>> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
>>
>> The new office format (ie. .docx, pptx, etc) is actually a zip file.
> Inside
>> the zip file, you'll find a bunch of xml files. \word\document.xml
> contains
>> the text data for the file (along with some other stuff you'll need to
> parse
>> out).
>>
>> Don't know if this is useful to your particular situation or not...
>>
>> Dale
>>
>
> Actually for me, that was enormously helpful.
> Thanks Dale Atkin, Armin Zingler, and Co.
>
> Is there a zip/unzip function  in VB2005 that can be used to expose the XML
> inside docx (etc.) formats...?
>
> Thanks in Advance...
>

I recommend SharpZipLib:
http://www.icsharpcode.net/OpenSource/SharpZipLib/

Unless your using .NET 3.0 in your VS2005 - because then you can use
System.IO.Packaging.  It has native support for the style of zip files that
word is using.

--
Tom Shelton
Author
22 May 2009 6:27 AM
Co
On 22 mei, 07:28, Tom Shelton <tom_shel***@comcastXXXXXXX.net> wrote:
Show quoteHide quote
> On 2009-05-22, Number Eleven - GPEMC! <eleven_is_the_num***@timothycasey.info> wrote:
>
>
>
> > "Dale Atkin" <labrad***@ibycus.com> wrote in message
> >news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl...
>
> >> > You can not handle a .doc file like a plain text file. It's stored in a
> >> > (proprietary) binary format.
>
> >> > If you have really a lot of time to read:
> >> >http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
>
> >> The new office format (ie. .docx, pptx, etc) is actually a zip file.
> > Inside
> >> the zip file, you'll find a bunch of xml files. \word\document.xml
> > contains
> >> the text data for the file (along with some other stuff you'll need to
> > parse
> >> out).
>
> >> Don't know if this is useful to your particular situation or not...
>
> >> Dale
>
> > Actually for me, that was enormously helpful.
> > Thanks Dale Atkin, Armin Zingler, and Co.
>
> > Is there a zip/unzip function  in VB2005 that can be used to expose the XML
> > inside docx (etc.) formats...?
>
> > Thanks in Advance...
>
> I recommend SharpZipLib:http://www.icsharpcode.net/OpenSource/SharpZipLib/
>
> Unless your using .NET 3.0 in your VS2005 - because then you can use
> System.IO.Packaging.  It has native support for the style of zip files that
> word is using.
>
> --
> Tom Shelton

What if I open Word, select all text and copy that to a string.
Then paste it into a richtextbox?

Marco
Author
22 May 2009 2:28 PM
Dale Atkin
>What if I open Word, select all text and copy that to a string.
>Then paste it into a richtextbox?

Is that an option for you? Sure you could do that.

Might even be able to work out a way to script doing that, or you might be
able to code some kind of macro within VBA to do what you want that would be
more efficient (do they still call it VBA?).

Dale
Author
23 May 2009 2:54 AM
Number Eleven - GPEMC!
"Tom Shelton" <tom_shel***@comcastXXXXXXX.net> wrote in message
news:%23GWLi4p2JHA.4744@TK2MSFTNGP04.phx.gbl...
> On 2009-05-22, Number Eleven - GPEMC!
<eleven_is_the_num***@timothycasey.info> wrote:
Show quoteHide quote
> > "Dale Atkin" <labrad***@ibycus.com> wrote in message
> > news:uvPlVUm2JHA.4744@TK2MSFTNGP04.phx.gbl...
> >>
> >> > You can not handle a .doc file like a plain text file. It's stored in
a
> >> > (proprietary) binary format.
> >> >
> >> > If you have really a lot of time to read:
> >> > http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
> >>
> >> The new office format (ie. .docx, pptx, etc) is actually a zip file.
> > Inside
> >> the zip file, you'll find a bunch of xml files. \word\document.xml
> > contains
> >> the text data for the file (along with some other stuff you'll need to
> > parse
> >> out).
> >>
> >> Don't know if this is useful to your particular situation or not...
> >>
> >> Dale
> >>
> >
> > Actually for me, that was enormously helpful.
> > Thanks Dale Atkin, Armin Zingler, and Co.
> >
> > Is there a zip/unzip function  in VB2005 that can be used to expose the
XML
> > inside docx (etc.) formats...?
> >
> > Thanks in Advance...
> >
>
> I recommend SharpZipLib:
> http://www.icsharpcode.net/OpenSource/SharpZipLib/
>
> Unless your using .NET 3.0 in your VS2005 - because then you can use
> System.IO.Packaging.  It has native support for the style of zip files
that
> word is using.


Thank you - that's fantastic news for me...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the num***@timothycasey.info to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com;http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
--
GPEMC! Anti-SPAM email conditions apply. See www.fieldcraft.biz/GPEMC
The General Public Electronic Mail Contract is free for public use.
If enough of us participate, we can launch a class action to end SPAM
Put GPEMC in your signature to join the fight. Invoice a SPAMmer today!