This dataset includes "books3", which is a comprehensive dump of Bibliotik, a to...

oldgradstudent · on March 7, 2024

It also contains an archive of opensubtitles, which is also not very open source.

refulgentis · on March 7, 2024

The subtitles aren't open?

If you meant transcribing dialogue from a TV show is violating copyright, I'm not so sure, it's relatively common to quote dialogue for varied purposes, ex. TV critics

Definitely understand if you're saying the whole dialogue for a TV show is copyrighted, but I'm curious about the opensubtitles part, used to work in that area.

layer8 · on March 7, 2024

Quoting excerpts is different from transcribing an entire work, which is unambiguously copyright infringement. (Otherwise you would find the “book” version of any and all TV shows on Amazon.) The subtitles in question are generally translations, which likewise fall under copyright, being a derived work.

refulgentis · on March 7, 2024

Yeah, I was just curious about the opensubtitles site because I used to work in that field (subtitles) and wasn't sure if there were some new pirate sites that were monetizing subs.

n.b. not being argumentative, please don't read it that way, I apologize if it comes off that way:

Not every derived work is a copyright violation, that's why subs and dubs don't get kicked around, you can quote dialogue in an article, etc.[^1]

Answering if it applies to AI is playing out in court currently with ex. NYT v. OpenAI[^2] and Sarah Silverman et al v. OpenAI[^3] and v. Meta.[^4]

[^1] "Copyright doesn't protect against all use of the work or use of derivative works. There are a few exceptions that fall under what's commonly known as the fair use doctrine:" (https://www.legalzoom.com/articles/what-are-derivative-works...)

[^2] https://www.nytimes.com/2023/12/27/business/media/new-york-t...

[^3] https://www.theverge.com/2024/2/13/24072131/sarah-silverman-...

[^4] https://www.hollywoodreporter.com/business/business-news/sar...

PavleMiha · on March 7, 2024

Quoting is very different from posting the full contents of something. I can quote a book but I can’t reproduce it in its entirety.

refulgentis · on March 7, 2024

Right, you can't reproduce a book. W/r/t subs and dubs, fair use has applied historically.

pk-protect-ai · on March 7, 2024

I wish it had included the books3, but it doesn't anymore. I wish it was possible to download that 36GB books3.tar in the wild these days. Herewith, I promise to use this dataset according to the "fair use" only...

SekstiNi · on March 7, 2024

> I wish it was possible to download that 36GB books3.tar in the wild these days.

There... is a torrent.

pk-protect-ai · on March 7, 2024

I know. But here where I am, using torrent means participate in distribution of the content and that is where I'll get huge bill for illegally sharing this file.

MacsHeadroom · on March 8, 2024

Use a debrid provider or seedbox to download the torrent. They torrent it for you and then you direct download from them. Should cost $10 or less.

gosub100 · on March 7, 2024

not the domain per se, but the high-powered law firms at your fingertips. Copyright law is much easier to enforce against working-class parents of 12-year-olds than SV elites.

fsckboy · on March 7, 2024

> Throw a dart at a wall filled with every notable author/publisher ever

copyrights do expire, and any books older than Mickey Mouse are public domain, so it's not every notable author ever

jsheard · on March 7, 2024

Technically true, narrow that down to merely "every notable living author and a subset of dead ones" then.

Bram Stokers bones will be relieved to hear that their work isn't being misappropriated.