Show HN: Doc Converter – Convert PDF docs to Word documents on your computer

IvanK_net · on Jan 26, 2023

I wrote a free PDF editor (open a PDF, edit, export a PDF), my users edit around 500,000 PDF files every month.

I have been gradually improving it for the past five years. It is a part of my photo editor https://www.Photopea.com. I know really a lot about PDF, I wish I didn't know that much :D I am glad to see that there are others who try to "make sense" of PDF files instead of just rendering them :)

** fun fact: Often, a PDF contains text as an array of characters, each has its X and Y coordinate and a style (white characters omitted). It is up to you to "cluster" them into words, lines, paragraphs ...

** Often, PDF text is made uneditable (on purpose). You see a text "Hello", but in fact, there is a text "bsiin", and a font, which renders "b" with a shape that looks like a letter "H", "s" as "e", and so on. If you open that PDF in a PDF viewer, select "Hello" and copy-paste it elsewhere, you get "bsiin".

rainbowzootsuit · on Jan 26, 2023

Photopea is fantastic. I don't use Photoshop enough to justify a cloud subscription and adobe has shut down the licensing service for the version I have on disc (CS3).

https://community.adobe.com/t5/photoshop-ecosystem-discussio...

Shared404 · on Jan 26, 2023

Photopea is a great solution, and I'm both glad it exists and that you are able to solve your issues using it.

But the fact that we as a society have accepted

> adobe has shut down the licensing service for the version I have on disc (CS3).

as something normal and acceptable is insane to me.

rainbowzootsuit · on Jan 26, 2023

I haven't accepted it. I sail harder all the time. The Adobe Creative Suite hasn't been a recent priority, but I should look it up on principal. Thank you.

Photopea also helps when you're at a random computer and can help someone do an edit that would otherwise require access to a computer with software installed.

soco · on Jan 26, 2023

I also had some exposure to PDF and looking back it's almost better you'd render it then OCR on the rendered page.

nashashmi · on Jan 26, 2023

> render

I think you meant raster

lhuser123 · on Jan 26, 2023

How do you deal with scanned pdf?

IvanK_net · on Jan 26, 2023

It is usually a PDF containing a single JPG file inside, which you can see and export at the original resolution.

To edit it, I guess you could paint over the text with white, and add a new text on top of it.

kjhughes · on Jan 26, 2023

Is this a wrapper around pandoc?

If so, it's hardly noteworthy. If you've written your own PDF to DOCX converter, then you have an interesting technical story (or ten) to tell -- do tell.

kris_wayton · on Jan 26, 2023

I ran it, and it installs these python extensions:

  Successfully installed PyMuPDF-1.21.1 fire-0.5.0 fonttools-4.38.0 lxml-4.9.2 numpy-1.24.1 opencv-python-4.7.0.68 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.2.0

kjhughes · on Jan 26, 2023

Thanks for checking it out for us.

So, it's a wrapper around not panddoc but pdf2docx,

https://github.com/dothinking/pdf2docx

which parses PDF via PyMuPDF,

https://github.com/pymupdf/PyMuPDF

which is a wrapper around MuPDF (which does the heavy lifting parsing PDF),

https://mupdf.com/

and writes DOCX via python-docx,

https://github.com/python-openxml/python-docx

ifedapo · on Jan 26, 2023

yes, it does indeed use pdf2docx under the hood. From a technical point of view, it doesn't do anything new asides from straddling Python and Electron into one App.

However, from an everyday user point of view, it does make it rather simple to convert pdf to word document. An everyday user won't be up for doing that via cli commands. And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

anonymouse008 · on Jan 26, 2023

> And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

If the customer base is less technically adept, wouldn't most of them not care and just upload it to a cloud service? I ask sincerely - recently I've realized I don't have as firm of a grip on the 'average consumer' as I thought.

jcuenod · on Jan 26, 2023

I just started testing fixpdfs.com, and some of the first feedback I heard when I asked users about pricing was "I'd rather download an app that does this than pay a subscription"

zxspectrum1982 · on Jan 26, 2023

I would only trust my PDFs to Adobe, Microsoft, AWS, etc: the big players, very well-known, that are not going to use the content of the PDFs against me. And of course I'd rather use something that runs completely on my laptop.

happymellon · on Jan 26, 2023

Do we know how this would compare to using libpdf?

ifedapo · on Jan 26, 2023

Hmm, I haven't used libpdf to know enough, but just from glance through its documentation, it seems libpdf is more suitable for creating and reading PDF files. If this is correct, then it'll be missing the bridge to converting the read content of the PDF file to a Word document

kris_wayton · on Jan 26, 2023

I see a couple of things called libpdf...lib-pdf and libpdf++. One generates pdfs programmatically. The other parses pdfs, but generates only images. Maybe you meant something else?

LightFog · on Jan 26, 2023

Does it include its source/dependency licensing post extraction? Some of these dependencies are under GPL/AGPL https://github.com/dothinking/pdf2docx/blob/master/LICENSE

ifedapo · on Jan 26, 2023

what does "post extraction" refer to here exactly?

kris_wayton · on Jan 26, 2023

I believe they are asking if, after extracting all the pieces (it's shipped as a self-extracting archive), does it do the things it needs to do to comply with GPL/AGPL? Like supplying the source code, or how to get the source code.

LightFog · on Jan 26, 2023

Installation essentially - the linked website doesn't link to licensing info for third-party dependencies. I was wondering if the licensing info (and source code of this product) were included in the installation bundle or available from the running product - since this is a requirement of the GPL license used in some of the dependencies.

The Windows app is an unsigned executable - not planning on running it myself.

ifedapo · on Jan 26, 2023

While there hasn't been any actual deep dive into this on my part yet, as the App does all its bit on the user's machine, the code itself also does live on the user machine post-installation. There has been no additional effort made to obfuscate the code that powers the App.

What could possibly be missing at the moment is a written instruction that documents where to locate the code base on the user's machine post-installation

schnebbau · on Jan 26, 2023

I know right? You can build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem.

weaksauce · on Jan 28, 2023

Is this sarcasm? this sounds like the dropbox detractors from when they launched. sure you can do all that but a ton of people want packaged up and easy workflows.

it's not will it sell it's how many will it sell

KeplerBoy · on Jan 26, 2023

pandoc doesn't ingest PDFs, it can only output them.

Getting PDFs into the pandoc intermediate representation would probably work on such a small subset of PDFs, pandoc does not even bother trying.

varunsharma07 · on Jan 26, 2023

Is this functionality not included with MS Word?

voxelghost · on Jan 26, 2023

Both word, and Google docs can do this, the results aren't always great but the function is there.

https://support.microsoft.com/en-us/office/edit-a-pdf-b2d1d7...

ranit · on Jan 26, 2023

From the linked MS support document:

> This works best with PDFs that are mostly text

voxelghost · on Jan 26, 2023

The problem isn't really if there are lots of images or graphics, but rather that word isn't great at handling complex layouts.

I somehow doubt OPs converter does much better. I hope to be proven wrong though.

bayindirh · on Jan 26, 2023

The biggest problem is not Word's limitations in handling complex layouts, but PDF's complete lack of layout information.

Digory · on Jan 26, 2023

And Acrobat, if you deal with pdfs regularly.

ranit · on Jan 26, 2023

The opposite is included: doc to pdf.

Tijdreiziger · on Jan 26, 2023

No, PDF to Word has been included since Word 2013.

https://support.microsoft.com/en-us/office/opening-pdfs-in-w...

nashashmi · on Jan 26, 2023

Open pdf in the Word open document dialog window. It converts the document very very nicely. Better than adobe!

hardwaresofton · on Jan 26, 2023

What are the chances... This is exactly the idea from the latest edition of Unvalidated Ideas[0][1] I released this week.

Of course I imagined a different kinds of integrations but maybe this is a case of great minds thinking alike!

[0]: https://unvalidatedideas.com/editions/latest

[1]: https://unvalidatedideas.com/editions/39

jcuenod · on Jan 26, 2023

What are the chances? I've just started my live beta of https://fixpdfs.com to convert pdfs of scanned documents/books into better "documents" with OCR, normalized margins, etc. (for better reading, searching, and highlighting)

lhuser123 · on Jan 26, 2023

It was my experience that OCRing scanned PDFs, would result in many small errors. For example “Alt” could be interpreted as “A|t”. Did you had those problems? How did you fixed it? What about other languages?

jcuenod · on Jan 26, 2023

I didn't build my own OCR models, in the beta I'm using tesseract but I'm going to use google or amazon when I start charging. There's no way to compete on OCR quality but I don't see other products automatically fixing doc scans, which is the value add I see my software really giving...

KeplerBoy · on Jan 26, 2023

With the small restriction that conversion from PDF is one of the few things pandoc does not do.

heywhatupboys · on Jan 26, 2023

yes, you obviously came up with the novel idea of PDF conversion!

0cf8612b2e1e · on Jan 26, 2023

Is there a possibility we will ever get a redesigned PDF format that removes so much unneeded complexity and make conversions/parsing more straightforward? Or are we stuck with this until the end of time?

ss108 · on Jan 26, 2023

? Word does this itself. Does this perform this task better than Word does?

chungus · on Jan 26, 2023

Interesting! I like the single payment as opposed to a subscription service. How does this compare to pandoc?

jqr- · on Jan 26, 2023

It's a one year license. The FAQ is not clear on whether the software will continue working with an expired license.

ifedapo · on Jan 26, 2023

Hi, thank you for pointing out the room for improvement on the FAQ. What it attempted to communicate is that the App will indeed stop working after the License expires, so it will have to be renewed at that point

phonon · on Jan 26, 2023

That's terrible. Writing "Buy Now" is dishonest. More like "One Year License Now".

Its license is also linked to the computer it was activated on. Change computers... too bad!

It's particularly egregious where there don't seem to be any substantive further improvements planned, and the underlying engine was not built by you.

ifedapo · on Jan 26, 2023

I can see how the "Buy Now" text could be misleading. Tbh, that's just the default text that comes with the Paypal button, and it wasn't deliberated crafted that way to mislead buyers. I'll roll out an update to make it more explicit.

> there don't seem to be any substantive further improvements planned

About that, the truth is, it often starts with a use-case as simple as "convert PDF to Word". Improvements usually would come from user feedback, and continuous maintenance. While I deliberately started out to keep the App simple, there's a good chance that features and functionality would expand when user feedback gets in the mix.

I'm saying this from my experience with maintaining IGdm (https://github.com/igdmapps/igdm)

garganzol · on Jan 26, 2023

That's awful. Never do this to desktop apps. Their whole point is about reliability and ownership.

aw4y · on Jan 26, 2023

is it different from https://dothinking.github.io/pdf2docx/quickstart.gui.html ?

ifedapo · on Jan 26, 2023

Very much different yes. Doc Converter doesn't re-use any of pdf2docx's GUI feature. Doc Converter's UI is instead built with Electron.

trelane · on Jan 26, 2023

Going from open to proprietary formats seems to be the wrong direction.

layer8 · on Jan 26, 2023

Docx is similarly open as PDF. (And full Adobe Acrobat compatibility is similarly difficult as full Microsoft Word compatibility. PDF has an edge here in practice because most use cases are read-only, or only very limited modification.)

trelane · on Jan 29, 2023

> Docx is similarly open as PDF.

This is not true, as far as I can tell.

> PDF has an edge here in practice because most use cases are read-only

I think many folks would be extremely happy to have full-fidelity read-only access to Microsoft formats without having to have Microsoft Office.

On the other side, it's _extremely_ common to both produce and consume PDF without over touching an Adobe product.