Serving the Quantitative Finance Community

 
User avatar
Zmey
Topic Author
Posts: 0
Joined: March 11th, 2002, 10:46 pm

PDF to text

March 11th, 2002, 11:17 pm

Hello,Just wondered if anyone has any tips handy how to translate Adobe Acrobat PDF documents into text files using C++/VBA? The problem I am trying to solve is automating data extraction from a large number of PDF files we receive every day.Thanks in advance!zmey@rocketmail.com
 
User avatar
ebifry
Posts: 0
Joined: December 9th, 2001, 8:34 am

PDF to text

March 22nd, 2002, 11:48 pm

There is a program

pdftotext

It is included in the xpdf package, it is opensource so you can look at the code. You might want to look at the code, it might give you some ideas. Apart from that I am sure there are commercial libraries you can use.

Cheers

Tony
 
User avatar
Zmey
Topic Author
Posts: 0
Joined: March 11th, 2002, 10:46 pm

PDF to text

March 25th, 2002, 1:48 pm

Thaks for the response! I'll check out the open-source code for XPDF. However, I tried "pdftotext" black-box copy and it appeared to produce random combination garbate files from the PDF's I have to deal with.

 
User avatar
ebifry
Posts: 0
Joined: December 9th, 2001, 8:34 am

PDF to text

March 25th, 2002, 9:38 pm

I have had the same experience with some pdf files. It seems to work for some files and not others. I haven't worked out the pattern for which files it works ok on and which one produces garbage.
Any success on tracking down any commercial libraries to do the job, I am sure Adobe has some.

Good luck!

Tony
 
User avatar
Paul
Posts: 7047
Joined: July 20th, 2001, 3:28 pm

PDF to text

March 25th, 2002, 10:23 pm

I would assume that it won't work for pdfs that the author has 'protected.'

P
 
User avatar
Russell
Posts: 1
Joined: October 16th, 2001, 5:18 pm

PDF to text

April 3rd, 2002, 11:05 am

Hi,

There is another way to do this, although please read the caveat below.

Send the pdf to pdf2html@adobe.com, or pdf2txt@sun.trace.wisc.edu, you will receive a reply in the chosen format.

Be aware that this resource is intended for use by people who might need pdf's translated into text in order to have the contents read back by text to speech programs. I'm sure that there is at present no particular strain on the resource, however I think if you are planning on performing the conversion on a frequent basis or for commercial benefit you should probably refrain from using it.

R