Every feature of the lexer should be testable through test cases written in the syntax of the language. That includes handling of bad lexical syntax also. For instance, a malformed floating-point constant or a string literal that is not closed are testable without having to treat the lexer as a unit. It should be easy to come up with valid syntax that exercises every possible token kind, in all of its varieties.
For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.
If there is a lexical analysis case (whether a successful token extraction or an error) that is somehow not testable through the parser, then that is dead code.
The division of the processing of a language into "parser" and "lexer" is arbitrary; it's an implementation detail which has to do with the fact that lexing requires lookahead and backtracking over multiple characters (and that is easily done with buffering techniques), whereas the simplest and fastest parsing algorithms like LALR(1) have only one symbol of lookahead.
Parsers and lexers sometimes end up integrated, in that the lexer may not know what to do without information from the parser. For instance a lex-generated lexer can have states in the form of start conditions. The parser may trigger these. That means that to get into certain states of the lexer, either the parser is required, or you need a mock up of that situation: some test-only method that gets into that state.
Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.
For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.
There is the problem, any tests that fail in the lexer now reach down through the parser to the lexer. The test is too far away from the point of failure. I'll now spend my time trying to understand a problem that would have been obvious when the lexer was being tested directly.
>Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.
This is part of the original point, the parser is the public interface which is why the OP was suggesting it should be the only contact point for the tests.
For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.
If there is a lexical analysis case (whether a successful token extraction or an error) that is somehow not testable through the parser, then that is dead code.
The division of the processing of a language into "parser" and "lexer" is arbitrary; it's an implementation detail which has to do with the fact that lexing requires lookahead and backtracking over multiple characters (and that is easily done with buffering techniques), whereas the simplest and fastest parsing algorithms like LALR(1) have only one symbol of lookahead.
Parsers and lexers sometimes end up integrated, in that the lexer may not know what to do without information from the parser. For instance a lex-generated lexer can have states in the form of start conditions. The parser may trigger these. That means that to get into certain states of the lexer, either the parser is required, or you need a mock up of that situation: some test-only method that gets into that state.
Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.