There exists databases of “the hash of problematic photos” (CSAM), so it seems trivial to search your billions of photos against them before training an AI. You can’t catch everything, but this seems like an obvious miss considering the explicitly tried to scrape pornography.
These hashes is exactly how researchers later discovered this content, so it’s clearly not hard.
The Stanford researchers also found a substantial number of CSAM images in the LAION-5B dataset which were not recognized by PhotoDNA, probably because the images in question were not in wide distribution prior to their inclusion in LAION.
You are uploading 5 billion examples of <something>. You cannot filter it manually, of course, because there are five billion of it. Given that it is the year 2024, how hard is it to be positive that a well-resourced team at Stanford in 2029 will not have better methods of identifying and filtering your data, or a better reference dataset to filter it against, than you do presently?
You don’t have to do it manually. There is a database of file hashes.
And this isn’t just “one engineer”. Companies like StabilityAI, Google, etc have used LAION datasets. If you built a dataset you should expend some resources on automated filtering. Don’t include explicit imagery as an intentional choice if you can’t do basic filtering.
It's impossible to create such a list while evading all such material.