Home All Groups Group Topic Archive Search About

Can we Read the text contents from PDF using .net

Author
14 Sep 2006 11:11 AM
B.N.Prabhu
Can we Read the text contents from PDF using .net.

If possible means what to do.

Author
14 Sep 2006 7:21 PM
Spam Catcher
=?Utf-8?B?Qi5OLlByYWJodQ==?= <prabh***@officetiger.com> wrote in
news:1D395A97-D922-4795-BA82-510C94FB4C0A@microsoft.com:

> Can we Read the text contents from PDF using .net.
>
> If possible means what to do.

PDF is a collection of objects - it's not formatted text.

So you cna read the text - tho maybe not in a meaningful manner
Author
15 Sep 2006 8:31 PM
Chris
Here is what I do:

    ''' <summary>
    ''' Gets the PDF text from a file
    ''' requires pdftotext.exe from http://www.foolabs.com/xpdf
    ''' </summary>
    ''' <param name="filename">The filename.</param>
    ''' <returns>PDF Text</returns>
    Public Function getPDFtext(ByVal filename As String) As String
        Dim p As New System.Diagnostics.Process
        Dim std_out As IO.StreamReader
        Dim txtStdout As String = ""

        Try

            p.StartInfo.FileName = "Asset Search\pdftotext.exe"
            p.StartInfo.Arguments = filename & " -"
            p.StartInfo.UseShellExecute = False
            p.StartInfo.CreateNoWindow = True
            p.StartInfo.RedirectStandardOutput = True

            p.Start()
            std_out = p.StandardOutput()

            'Get the text from standard output
            txtStdout = std_out.ReadToEnd()

            std_out.Close()
        Catch ex As Exception
            MsgBox("Error in while extracting PDF text, the error is: " &
ex.Message.ToString)
        End Try

        Return txtStdout
    End Function

I wouldn't use it for anything serious, business critical, or Realtime. For
that you should probably go with a commercial control like
http://www.pdfonline.com/. But for quick and dirty text extraction it works
fine for me.

Best Regards,

Chris


Show quoteHide quote
"B.N.Prabhu" <prabh***@officetiger.com> wrote in message
news:1D395A97-D922-4795-BA82-510C94FB4C0A@microsoft.com...
> Can we Read the text contents from PDF using .net.
>
> If possible means what to do.
Author
15 Sep 2006 8:36 PM
Chris
Just to clarify the following line should point the the actual pdftotext.exe
program

p.StartInfo.FileName = "Asset Search\pdftotext.exe" <---Points to location
of pdftotext.exe

Chris

Show quoteHide quote
"Chris" <consult_Chris@nospam.yahoo.com> wrote in message
news:ec7KGYQ2GHA.4228@TK2MSFTNGP06.phx.gbl...
> Here is what I do:
>
>    ''' <summary>
>    ''' Gets the PDF text from a file
>    ''' requires pdftotext.exe from http://www.foolabs.com/xpdf
>    ''' </summary>
>    ''' <param name="filename">The filename.</param>
>    ''' <returns>PDF Text</returns>
>    Public Function getPDFtext(ByVal filename As String) As String
>        Dim p As New System.Diagnostics.Process
>        Dim std_out As IO.StreamReader
>        Dim txtStdout As String = ""
>
>        Try
>
>            p.StartInfo.FileName = "Asset Search\pdftotext.exe"
>            p.StartInfo.Arguments = filename & " -"
>            p.StartInfo.UseShellExecute = False
>            p.StartInfo.CreateNoWindow = True
>            p.StartInfo.RedirectStandardOutput = True
>
>            p.Start()
>            std_out = p.StandardOutput()
>
>            'Get the text from standard output
>            txtStdout = std_out.ReadToEnd()
>
>            std_out.Close()
>        Catch ex As Exception
>            MsgBox("Error in while extracting PDF text, the error is: " &
> ex.Message.ToString)
>        End Try
>
>        Return txtStdout
>    End Function
>
> I wouldn't use it for anything serious, business critical, or Realtime.
> For that you should probably go with a commercial control like
> http://www.pdfonline.com/. But for quick and dirty text extraction it
> works fine for me.
>
> Best Regards,
>
> Chris
>
>
> "B.N.Prabhu" <prabh***@officetiger.com> wrote in message
> news:1D395A97-D922-4795-BA82-510C94FB4C0A@microsoft.com...
>> Can we Read the text contents from PDF using .net.
>>
>> If possible means what to do.
>
>
Author
16 Sep 2006 11:00 AM
JimmyKoolPantz
If you want to pull information in some type of format my advice is to
purchase some sdk software (suggestions: OmniPage or ABBYY).  The
software will allow you to extract information or images from a pdf.  I
have never used the sdk kit however I have used the software and it
does extremely well when we use it to extract data from a pdf and
export it as an excel file.