Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Doc Converter – Convert PDF docs to Word documents on your computer (docconverter.app)
74 points by ifedapo on Jan 26, 2023 | hide | past | favorite | 54 comments


I wrote a free PDF editor (open a PDF, edit, export a PDF), my users edit around 500,000 PDF files every month.

I have been gradually improving it for the past five years. It is a part of my photo editor https://www.Photopea.com. I know really a lot about PDF, I wish I didn't know that much :D I am glad to see that there are others who try to "make sense" of PDF files instead of just rendering them :)

** fun fact: Often, a PDF contains text as an array of characters, each has its X and Y coordinate and a style (white characters omitted). It is up to you to "cluster" them into words, lines, paragraphs ...

** Often, PDF text is made uneditable (on purpose). You see a text "Hello", but in fact, there is a text "bsiin", and a font, which renders "b" with a shape that looks like a letter "H", "s" as "e", and so on. If you open that PDF in a PDF viewer, select "Hello" and copy-paste it elsewhere, you get "bsiin".


Photopea is fantastic. I don't use Photoshop enough to justify a cloud subscription and adobe has shut down the licensing service for the version I have on disc (CS3).

https://community.adobe.com/t5/photoshop-ecosystem-discussio...


Photopea is a great solution, and I'm both glad it exists and that you are able to solve your issues using it.

But the fact that we as a society have accepted

> adobe has shut down the licensing service for the version I have on disc (CS3).

as something normal and acceptable is insane to me.


I haven't accepted it. I sail harder all the time. The Adobe Creative Suite hasn't been a recent priority, but I should look it up on principal. Thank you.

Photopea also helps when you're at a random computer and can help someone do an edit that would otherwise require access to a computer with software installed.


I also had some exposure to PDF and looking back it's almost better you'd render it then OCR on the rendered page.


> render

I think you meant raster


How do you deal with scanned pdf?


It is usually a PDF containing a single JPG file inside, which you can see and export at the original resolution.

To edit it, I guess you could paint over the text with white, and add a new text on top of it.


Is this a wrapper around pandoc?

If so, it's hardly noteworthy. If you've written your own PDF to DOCX converter, then you have an interesting technical story (or ten) to tell -- do tell.


I ran it, and it installs these python extensions:

  Successfully installed PyMuPDF-1.21.1 fire-0.5.0 fonttools-4.38.0 lxml-4.9.2 numpy-1.24.1 opencv-python-4.7.0.68 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.2.0


Thanks for checking it out for us.

So, it's a wrapper around not panddoc but pdf2docx,

https://github.com/dothinking/pdf2docx

which parses PDF via PyMuPDF,

https://github.com/pymupdf/PyMuPDF

which is a wrapper around MuPDF (which does the heavy lifting parsing PDF),

https://mupdf.com/

and writes DOCX via python-docx,

https://github.com/python-openxml/python-docx


yes, it does indeed use pdf2docx under the hood. From a technical point of view, it doesn't do anything new asides from straddling Python and Electron into one App.

However, from an everyday user point of view, it does make it rather simple to convert pdf to word document. An everyday user won't be up for doing that via cli commands. And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)


> And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

If the customer base is less technically adept, wouldn't most of them not care and just upload it to a cloud service? I ask sincerely - recently I've realized I don't have as firm of a grip on the 'average consumer' as I thought.


I just started testing fixpdfs.com, and some of the first feedback I heard when I asked users about pricing was "I'd rather download an app that does this than pay a subscription"


I would only trust my PDFs to Adobe, Microsoft, AWS, etc: the big players, very well-known, that are not going to use the content of the PDFs against me. And of course I'd rather use something that runs completely on my laptop.


Do we know how this would compare to using libpdf?


Hmm, I haven't used libpdf to know enough, but just from glance through its documentation, it seems libpdf is more suitable for creating and reading PDF files. If this is correct, then it'll be missing the bridge to converting the read content of the PDF file to a Word document


I see a couple of things called libpdf...lib-pdf and libpdf++. One generates pdfs programmatically. The other parses pdfs, but generates only images. Maybe you meant something else?


Does it include its source/dependency licensing post extraction? Some of these dependencies are under GPL/AGPL https://github.com/dothinking/pdf2docx/blob/master/LICENSE


what does "post extraction" refer to here exactly?


I believe they are asking if, after extracting all the pieces (it's shipped as a self-extracting archive), does it do the things it needs to do to comply with GPL/AGPL? Like supplying the source code, or how to get the source code.


Installation essentially - the linked website doesn't link to licensing info for third-party dependencies. I was wondering if the licensing info (and source code of this product) were included in the installation bundle or available from the running product - since this is a requirement of the GPL license used in some of the dependencies.

The Windows app is an unsigned executable - not planning on running it myself.


While there hasn't been any actual deep dive into this on my part yet, as the App does all its bit on the user's machine, the code itself also does live on the user machine post-installation. There has been no additional effort made to obfuscate the code that powers the App.

What could possibly be missing at the moment is a written instruction that documents where to locate the code base on the user's machine post-installation


I know right? You can build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem.


Is this sarcasm? this sounds like the dropbox detractors from when they launched. sure you can do all that but a ton of people want packaged up and easy workflows.

it's not will it sell it's how many will it sell


pandoc doesn't ingest PDFs, it can only output them.

Getting PDFs into the pandoc intermediate representation would probably work on such a small subset of PDFs, pandoc does not even bother trying.


Is this functionality not included with MS Word?


Both word, and Google docs can do this, the results aren't always great but the function is there.

https://support.microsoft.com/en-us/office/edit-a-pdf-b2d1d7...


From the linked MS support document:

> This works best with PDFs that are mostly text


The problem isn't really if there are lots of images or graphics, but rather that word isn't great at handling complex layouts.

I somehow doubt OPs converter does much better. I hope to be proven wrong though.


The biggest problem is not Word's limitations in handling complex layouts, but PDF's complete lack of layout information.


And Acrobat, if you deal with pdfs regularly.


The opposite is included: doc to pdf.


No, PDF to Word has been included since Word 2013.

https://support.microsoft.com/en-us/office/opening-pdfs-in-w...


Open pdf in the Word open document dialog window. It converts the document very very nicely. Better than adobe!


What are the chances... This is exactly the idea from the latest edition of Unvalidated Ideas[0][1] I released this week.

Of course I imagined a different kinds of integrations but maybe this is a case of great minds thinking alike!

[0]: https://unvalidatedideas.com/editions/latest

[1]: https://unvalidatedideas.com/editions/39


What are the chances? I've just started my live beta of https://fixpdfs.com to convert pdfs of scanned documents/books into better "documents" with OCR, normalized margins, etc. (for better reading, searching, and highlighting)


It was my experience that OCRing scanned PDFs, would result in many small errors. For example “Alt” could be interpreted as “A|t”. Did you had those problems? How did you fixed it? What about other languages?


I didn't build my own OCR models, in the beta I'm using tesseract but I'm going to use google or amazon when I start charging. There's no way to compete on OCR quality but I don't see other products automatically fixing doc scans, which is the value add I see my software really giving...


With the small restriction that conversion from PDF is one of the few things pandoc does not do.


yes, you obviously came up with the novel idea of PDF conversion!


Is there a possibility we will ever get a redesigned PDF format that removes so much unneeded complexity and make conversions/parsing more straightforward? Or are we stuck with this until the end of time?


? Word does this itself. Does this perform this task better than Word does?


Interesting! I like the single payment as opposed to a subscription service. How does this compare to pandoc?


It's a one year license. The FAQ is not clear on whether the software will continue working with an expired license.


Hi, thank you for pointing out the room for improvement on the FAQ. What it attempted to communicate is that the App will indeed stop working after the License expires, so it will have to be renewed at that point


That's terrible. Writing "Buy Now" is dishonest. More like "One Year License Now".

Its license is also linked to the computer it was activated on. Change computers... too bad!

It's particularly egregious where there don't seem to be any substantive further improvements planned, and the underlying engine was not built by you.


I can see how the "Buy Now" text could be misleading. Tbh, that's just the default text that comes with the Paypal button, and it wasn't deliberated crafted that way to mislead buyers. I'll roll out an update to make it more explicit.

> there don't seem to be any substantive further improvements planned

About that, the truth is, it often starts with a use-case as simple as "convert PDF to Word". Improvements usually would come from user feedback, and continuous maintenance. While I deliberately started out to keep the App simple, there's a good chance that features and functionality would expand when user feedback gets in the mix.

I'm saying this from my experience with maintaining IGdm (https://github.com/igdmapps/igdm)


That's awful. Never do this to desktop apps. Their whole point is about reliability and ownership.



Very much different yes. Doc Converter doesn't re-use any of pdf2docx's GUI feature. Doc Converter's UI is instead built with Electron.


Going from open to proprietary formats seems to be the wrong direction.


Docx is similarly open as PDF. (And full Adobe Acrobat compatibility is similarly difficult as full Microsoft Word compatibility. PDF has an edge here in practice because most use cases are read-only, or only very limited modification.)


> Docx is similarly open as PDF.

This is not true, as far as I can tell.

> PDF has an edge here in practice because most use cases are read-only

I think many folks would be extremely happy to have full-fidelity read-only access to Microsoft formats without having to have Microsoft Office.

On the other side, it's _extremely_ common to both produce and consume PDF without over touching an Adobe product.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: