Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This dataset includes "books3", which is a comprehensive dump of Bibliotik, a torrent tracker dedicated to pirated ebooks.

Throw a dart at a wall filled with every notable author/publisher ever and whoever you hit probably owns some of this data.

Apparently you can just do whatever as long as you say it's for AI research, go post Blu-ray rips online, it's fine provided you have a .ai domain :^)



It also contains an archive of opensubtitles, which is also not very open source.


The subtitles aren't open?

If you meant transcribing dialogue from a TV show is violating copyright, I'm not so sure, it's relatively common to quote dialogue for varied purposes, ex. TV critics

Definitely understand if you're saying the whole dialogue for a TV show is copyrighted, but I'm curious about the opensubtitles part, used to work in that area.


Quoting excerpts is different from transcribing an entire work, which is unambiguously copyright infringement. (Otherwise you would find the “book” version of any and all TV shows on Amazon.) The subtitles in question are generally translations, which likewise fall under copyright, being a derived work.


Yeah, I was just curious about the opensubtitles site because I used to work in that field (subtitles) and wasn't sure if there were some new pirate sites that were monetizing subs.

n.b. not being argumentative, please don't read it that way, I apologize if it comes off that way:

Not every derived work is a copyright violation, that's why subs and dubs don't get kicked around, you can quote dialogue in an article, etc.[^1]

Answering if it applies to AI is playing out in court currently with ex. NYT v. OpenAI[^2] and Sarah Silverman et al v. OpenAI[^3] and v. Meta.[^4]

[^1] "Copyright doesn't protect against all use of the work or use of derivative works. There are a few exceptions that fall under what's commonly known as the fair use doctrine:" (https://www.legalzoom.com/articles/what-are-derivative-works...)

[^2] https://www.nytimes.com/2023/12/27/business/media/new-york-t...

[^3] https://www.theverge.com/2024/2/13/24072131/sarah-silverman-...

[^4] https://www.hollywoodreporter.com/business/business-news/sar...


Quoting is very different from posting the full contents of something. I can quote a book but I can’t reproduce it in its entirety.


Right, you can't reproduce a book. W/r/t subs and dubs, fair use has applied historically.


I wish it had included the books3, but it doesn't anymore. I wish it was possible to download that 36GB books3.tar in the wild these days. Herewith, I promise to use this dataset according to the "fair use" only...


> I wish it was possible to download that 36GB books3.tar in the wild these days.

There... is a torrent.


I know. But here where I am, using torrent means participate in distribution of the content and that is where I'll get huge bill for illegally sharing this file.


Use a debrid provider or seedbox to download the torrent. They torrent it for you and then you direct download from them. Should cost $10 or less.


not the domain per se, but the high-powered law firms at your fingertips. Copyright law is much easier to enforce against working-class parents of 12-year-olds than SV elites.


> Throw a dart at a wall filled with every notable author/publisher ever

copyrights do expire, and any books older than Mickey Mouse are public domain, so it's not every notable author ever


Technically true, narrow that down to merely "every notable living author and a subset of dead ones" then.

Bram Stokers bones will be relieved to hear that their work isn't being misappropriated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: