Case insensitivity was a huge mistake in computing really. Most languages don't have cases and its very non trivial to convert between cases. Should have treated every char as completely unique.
Sure. At the user signup side, block emails that are too close in ways like case, but as a sender you should always treat them as unique emails.
So if my email is HeLLo@example.com because I want to be cute people will have to try 6 times before they finally get the right email address? Imagine telling that to someone in person. This kind "weird" casing isn't that rare and doesn't require cute usernames: "DonaldDuck@example.com", "FreeBSD@example.com", "DrMcCoy@example.com", etc.
Languages that don't have case is not an issue; the situations where a lowercase <-> uppercase mapping is not simple are actually not that many. It's not trivial, but not all that complex either. The most annoying part is Turkish, Azeri, and Lithuanian where the rules differ a bit but the used language is often unknown. For the purpose of matching things ("is this email address known in our system?") it's actually not that hard, since you can just treat several characters as identical (displaying text correctly to users is harder, but that's not important here).
I see this attitude in various situations, often under "falsehoods programmers believe" articles, which goes something like "it's hard in a few rare cases, therefore we should not do it at all for the >99% cases where it's simple and unproblematic".
It is fine to have multiple email addresses connected to a single inbox. Email providers already do normalization like this that is not baked into the spec. Gmail for example treats johndoe@gmail.com and john.doe@gmail.com the same.
Well, maybe Google should check their own implementation, because interesting things happen when accounts for both version exist.
I have an account with the dot, that was made in age, when Gmail was invite-only. Few years ago, someone created an account without the dot. Yes, I'm receiving their mail, and have no way to contact them, because everything I send out comes back to me.
Not only do people do this, it's actually extremely common, ask anyone with commonname@gmail.com. Someone here said their original email address has become unusable because it gets thousands of messages a day that are intended for other people. "That guy" might actually be multiple people all making the same mistake with your email address.
My coworker told me there's a complete stranger who, every time she emails her son, accidentally sends the email to him first. This has been going on for years, she makes the same mistake every time, she doesn't learn her son's actual email address, and she doesn't learn to press "reply."
I'm 100% sure the "non-dot-version" doesn't exist as a separate account.
AOL used to allow users to email other users without using the @aol.com extension. Back in those days I (prior to any capacity to negotiate a sensible ISP, for the record) had an email that matched a common subject line that was inundated by people who typed their subject in the To line and then wrote an email.
And that’s why I regularly receive other people’s mail in my gmail inbox, and why i have stopped using gmail for anything important (it’s right to assume that gmail is also sending my emails to other people).
Google’s gmail people aren’t really as smart as they think they are.
That's something I don't understand. I've always given my email as john.doe@gmail.com, and I sometimes receive emails - addressed to Another John Doe - sent to johndoe@gmail.com.
That Another John Doe never, ever had access to johndoe@gmail.com, they just gave a wrong address. That's not gmail's fault.
That Another John Doe never, ever had access to johndoe@gmail.com, they just gave a wrong address. That's not gmail's fault.
This. My wife and I have two flavors of this.
Her address is firstmlast@gmail.com. There people frequently forget the m initial and somebody else owns firstlast@gmail.com She's since started using dots first.m.last to mitigate the error.
My address is firstlast@gmail, where first and last are not globally common, but are fairly common in Scotland. Once a year or so, I receive email for somebody else that shares my name. I don' know his real email, but I've been "invited" on his family vacations 3-4 times now. Infrequent enough that I just respond "thanks for the invite, but I think you'll be disappointed when I arrive and not the Alistair you were expecting."
Or someone assumed (or just tried) that other email address.
I signed up for my university's email forwarding for alumni early on and got my first name as my email. For quite a while, I would get emails, including fairly sensitive ones, sent to me by not yet very email savvy people just assuming you could send an email to someone's first name and it would get to them.
Nah, it happens with mangled names that no bot would ever try to stuff, too. E.g. I own derefr@gmail; but I sometimes receive email from people trying to reach a man named "Derek" — who almost certainly owns the address derek.fr@gmail, but probably typoed it once as dere.fr@, and now his browser autocompletes that into registration forms for him.
This doesn't really make any sense. It's not just gmail that does this, dots are almost always ignored before the @.
Nobody else can register an email that is the same as yours but without a dot. So the only way you receive someone else's email is if they give the wrong address.
> It's not just gmail that does this, dots are almost always ignored before the @.
That's not my experience. Which non-gmail email software ignores dots before the @?
Thinking about this, I guess the sending MTA doesn't care about dots; it goes RCPT TO: <address.with.dots@example.com>. The receiving MTA then has to validate that address; it does that using some account database that isn't typically part of the MTA - it could be a unix account (no dots!), a database table, or an LDAP user. Finally it passes the mail off to a delivery agent, which hopefully relies on the same account database.
So the elision of dots appears to be a feature of certain account databases. So which account databases elide dots?
MTAs can be configured to additional transforms before looking up the account. For example, postfix's virtual table [0] can be used for this and on my server it does elide dots in the local part (along with everything else).
> So if my email is HeLLo@example.com because I want to be cute people will have to try 6 times before they finally get the right email address? Imagine telling that to someone in person.
Yeah. That's fine.
If my email is LLLLLLLLLLLL@example.com because I want to be cute I have to tell people to type exactly the right number of L's. Do you think they should just be able to type a lot of L's and as long as it's somewhere near it counts as the same email address?
In a world where email addresses are always case sensitive, everyone will use lowercase (like they pretty much always already do anyway), and it'll be fine.
The only reason "LLL" and "lll" mean the same thing are because currently email addresses are (sometimes) case insensitive.
In a world where email addresses were "obviously" case sensitive, "LLL" mapping to "lll" would be just as crazy as "LLLLLLLLLL" mapping to "LLLLLLLLLLL".
They just seem similar, to humans. But they're different strings.
By maps to, they mean it counts as the same email address. 111@ and lll@ do not do that. The font has no impact on the email spec. However it can add extra confusion.
I doubt anyone is crazy enough to implement the email spec for comparing emails, to be fair. I would honestly be surprised if any publicly available mail agent or server supports that craziness.
> Case insensitivity was a huge mistake in computing really
Ah, youth. There was little choice! Sometimes you had only six bits for a character; sometimes your bytes could be from 1-36 bits wide, depending on what you wanted for your program, so you might have systems that mixed six-bit (only upper case) and early ascii (two cases) and so for matching you had to be case insensitive.
It’s easy to look back and say “those people were so stupid” but they weren’t.
That's a surefire way to increase customer support load, as users have mix case for their emails all the time. They might sign up on a phone, or login on a phone. They might just be hamfisted. Sure, it might be their problem, but they'll make it yours.
Some places even support the same password with inverted caps. So say, if your password was "passWORD", then "passWORD" and "PASSword" would work. If the first hash fails, they'll invert the case, re-hash, then check again.
Computer have to adapt to people and not the other way around.
E-Mail wouldn't have been widely accepted with case sensitive addresses.
People expect that MyName@mail.com reaches the same person like myname@mail.com or Myname@Mail.COM just like letters reach their recipient irrespective of the upper and lower case of the recipient's name.
Computers are there to make the lives of the users easier, not the programmers.
I'm not sure this is true. In my experience a lot of people actually do think that it's case sensitive. Many times I've heard someone describe the capitals while verbally telling someone their addresss.
However it being insensitive has probably helped a lot of times where people make mistakes in explaining or copying those capitals.
As a sender, you should always treat email as case sensitive. As an email host/receiver, you can and probably should chose to be insensitive. But never assume any other host works like that.
Similar to how gmail ignores . in emails but other hosts do not.
> In my experience a lot of people actually do think that it's case sensitive.
I don't think they really do. They may think case matters somehow, and so may be careful to reproduce the exact case that they used before, but I don't think many people would expect JohnDoe@gmail.com and johnDoe@gmail.com to be too different email accounts.
In the general population, how many people do you think understand that username (including an email address) probably isn't case sensitive but that password almost certainly is?
> Case insensitivity was a huge mistake in computing really
original ASCII only had uppercase. When lowercase got added, gradually with newer systems and software, without case insensitivity you would have had massive incompatibilities which probably would have hampered or even arrested the introduction of lowercase in new systems
and all over again, when microcomputers first came out, they came out with uppercase only to cut complexity and cost on simple systems.
I don’t see how adding lowercase to ASCII could have resulted in massive backwards compatibility issues considering that uppercase-only ASCII existed for only a few weeks in 1963. Surely there was not widespread adoption of ASCII in the spring of 1963.
I'm not old enough to be an expert, just old enough to have used the leftovers (new computers were too expensive!): many many devices were uppercase only, card punches, ASR-33 teletype, the Telex system, lineprinters, "glass" teletypes, FORTRAN, COBOL, and now that I think of it, Morse code/telegraph had always been. There was a ton of infrastructure that was uppercase only. You may be right that it wasn't ASCII's fault. Perhaps the first version of ASCII made sure to encompass what had been, and then saner heads said "let's allow for future progress".
i'm not going to explore the entire history, but just looked this up. TL;DR example, the addition of lowercase characters represented a jump from 6 bits to 7 bits at the hardware level:
"A six-bit character code is a character encoding designed for use on computers with word lengths a multiple of 6. Six bits can only encode 64 distinct characters, so these codes generally include only the upper-case letters, the numerals, some punctuation characters, and sometimes control characters. The 7-track magnetic tape format was developed to store data in such codes, along with an additional parity bit."
Capital letters aren't a matter of font. There's a difference between the river phoenix, a magical bird which lives by the river River, and River Phoenix, the actor. It isn't a presentation-layer difference, it's an encoding-layer difference.
Then again, if we could start from scratch we'd probably just have a single global phonetic language without case and with a limited number of total chars.
Honestly non-phonetic glyphs are probably an easier lift.
Fun fact: the reason we pronounce "ph" like "f" is because the Greek letter was originally pronounced like p-h, at the time Romans began stealing words, but then the Greeks started pronouncing it like "f" and the Romans followed suit, but kept the old Latinization of "ph", because they'd already carved it into stone.
English spelling is largely phonetic... but it captures the phonetic spelling across dozens or hundreds of shifts in the spoken language. Unless you can stop people from changing how they speak, any phonetic spelling reboot is either going to suffer from the same problem, or words will constantly change how they are spelled to keep up with the spoken word.
if you view meme culture as a trend towards increased symbol density in linguistic communication due to their ability to convey emotions, overtones, implications, and other nuance ("shaka, when the walls fell") then the increased symbol space of chinese/japanese/korean characters looks interesting.
conversely it's certainly been an obvious disadvantage (posed a lot of problems and imposed a lot of awkward workarounds) for mechanical/electronic communication - now you have to enter the characters too, and you have to express that larger number of characters efficiently. In practice, a lot of electronic communication is just simplified to ASCII because that's the set that works universally. Someone used the example of ess-tset being transliterated as "ss" in german, dunno if chinese uses anything similar, but it wouldn't surprise me, obviously Japanese has romaji too.
but at a human-interface level, fundamentally there is a limit to how many symbols people can absorb. Even with latin characters, people at best will sight-read whole words to increase symbol rate, but, the natural evolution is to use 1 character to represent 1 symbol/word, that's the highest possible rate at which humans can absorb symbols for a written system. And in turn you could in principle absorb a "word of symbols", which is a sentence, similar to how western readers can sight-read a word of our 1-character glyphs.
by "increasing the dimensionality" of the symbol, you increase the effective symbol rate, similar to how memes use subtext/etc to convey more nuance than a pure text can by itself.
I think anyone who deals with end users would disagree. It seems impossible to get users to abide to a specific casing. Things would break all the time.
> as a sender you should always treat them as unique emails
This is already how it works. Senders are not supposed to assume that local-parts are case-insensitive. (Some buggy implementations ignore this requirement and upper-case everything, but the serious implementations don’t).
The correct answer is "who cares" though. Languages which use cased alphabets.. use cased alphabets, you don't get to argue with it.
You also don't get to argue with the fusional position changes in Arabic, or the ligatures in Devanagari, or the places within a square the featural particles of Hangul must be printed in.
You are correct that its not negotiable when supporting that language, but it is negotiable what languages and writing sets a given application support.
I think you have correctly identified an implausible claim!
Of course, most languages aren't written at all ... or at least don't have a traditional written form that is sufficiently well established for someone to say that the "language" has case rather than a particular (proposed) way of writing it.
However, I rather suspect that the majority of languages in which books are published use some variant of the Latin alphabet and do, therefore, have case. (The only language I've heard of that uses the Latin alphabet without case is Lojban!)
On the other hand, if you weight languages by the number of (native) speakers, since about three quarters of the world's population lives in China, India, Pakistan, Bangladesh, Japan or Korea, probably it's true that most people don't use case in their main language.
It's just blown my mind that case might be a thing non-English speakers would need to learn to be able to read and write English. (Same for non-English, but that doesn't blow my mind in the same way.)
Yeah, we always say that the English alphabet has 26 letters, but there are actually 52 unique symbols you have to learn to read, or 104 if you also have to read/write cursive. Some of these symbols are very similar (if you learn 'o' you will definitely recognize 'O', and likely the cursive variants as well), while others are quite different ('g', 'G', and the cursive upper case G might as well be different letters altogether; the lower-case cursive does resemble 'g').
With joined-up letters (“cursive” in the USA I guess) different languages have different letterforms, and sometimes multiple systems.
For that matter typesetting rules vary by language as well — not just the obvious hyphenation rules busnspacimg as well. Just pick up a book in, say, French or Russian and you can tell at a glance (without even looking at the letters) that it’s not in English.
Right, 104 symbols would be the minimum if you do need to read/write cursive.
However, I don't agree with your point about typeset text. You're right that the styles differ, but if you have learned one style, and know the language of the text, you will not need any significant amount of time to read a different style of typesetting.
Russian of course normally uses the Cyrillic alphabet, not the Latin one, so obviously you do have to learn a whole new set of symbols to understand it even if you can read Latin symbols. And of course French uses slightly more letters/letter forms than English, with the sedile and four accents (egu, grave, circonflex, and very rare treme).
Lots of accents when using Cyrillic to write non-Russian text.
I didn’t mean the typesetting differences made reading a different language in any way hard, merely pointing out that there are lots of different aspects to text in different languages even when the alphabets are basically the same.
And there are a fair number of (inconsistent) rules for casing. Proper nouns vs. common nouns. Camel case (or other non-standard capitalizations). Title case. "Standard" body copy.
Are they a majority of languages if counted? I guess it also matters if you count the number of languages or if you count the number of people writing them.
Let’s just use Greek and its descendants (Latin, Greek and cyrillic alphabets) and Brahmic-derived writing (we said “alphabet” when I was a kid but now ppl say “ Abugida“. There are about 200 languages spoken in Europe, all of which use these alphabets. India has over a hundred “major” languages and about 1600 others, most of which use Bramic writing alphabets (the major exception, Urdu, uses a form of Arabic writing). So a big imbalance!
Oh, you want speakers? merely counting people who read Hanzi + Arabic-Alphabet readers + the Indian subcontinent gets you more half the world‘s population. And there are hundreds, maybe over a thousand writing systems.
When we sum up realistically, then world-wide the amount of users of writing systems with case/"cameral" are about equally balanced with those without.
I think that's GP's point: tolower() looks like it works well to English speakers but it's subtly wrong and will fail unexpectedly for people with other locales.
Internalization is hard. I think its too much to expect software written for a specific market to handle all languages in the world.
Fx in danish we have 3 letters (æøå) that is not common in the latin alphabet. I cant go to germany or turkey and expect people to be able to write out those letters when doing input in a local system.
Fun thing I've run into in a Germany-based but increasingly international company: German always spells out umlauts and eszetts when going to lower ASCII for email addresses ("Schäßler" -> "Schaessler"), but Hungarian does not. Not sure how Turkish ö and ü get fully lower-ASCII-ized there, but in Germany, they get spelled out "oe" and "ue" as if they were German ö and ü. This isn't as much of a corner case as one might think - there are a lot of people with Turkish names in Germany.
I’m always amused by these kinds of nonsensical usage for Turkish in Germany (ü->ue) but the thing that really trips people up is that Turkish has two letters that look like i, one with the tittle and one without — in both cases.
Germany relatively adopted an uppercase ß (and got it into Unicode) to try to help with case roundtripping but I’ve never seen it in the wild. And let’s not get into obsolete German Fraktur ligatures like tz or ch which also had no upper case equivalents.
> Germany relatively adopted an uppercase ß (and got it into Unicode)
The parenthetical part is true. It was an uphill battle, but not because of the consortium, but because of what the tropes wiki would describe as executive meddling.
The adoption is not recent, but about 110 years old. You have the wrong idea because of sloppy journalism.
> I’ve never seen it in the wild
I see it all the time. Maybe you are undercounting. Pay attention to non-standard letterforms on hand-written signs, and you also have to include print media where someone substituted lower-case ß in absence of a glyph in a font. This is a typographic mistake, but the intent is clear.
Sure. At the user signup side, block emails that are too close in ways like case, but as a sender you should always treat them as unique emails.