问题描述:

The language that I am lexing requires the ability to hot-swap keywords depending on runtime configuration.

It's relatively simple how to do this so long as you are OK embedding target-specific code in your grammar (Java):1

lexer grammar LanguageLexer;

tokens {

If, Else, While // etc

}

@header {

import java.util.Map;

}

@members {

private Map<String, Integer> keywords;

public NafiLexer(CharStream input, Map<String, Integer> keywords) {

this(input);

this.keywords = keywords;

}

}

WS: [ \n\t\r]+ -> skip;

ID: [a-zA-Z]+ { if(keywords.containsKey(getText())) setType(keywords.get(getText())); };

However, I would like to remove all target-specific code from my .g4 file, as my .g4s will be used across multiple target languages for separate projects.

In a Parser, you can use a Listener to remove embedded actions and decouple the grammar from application-specific code. However, if there exists a way to do this at the Lexer level2, I have yet to find it (thus asking this question).

The way to accomplish this seems to be to wrap the TokenStream pulled from the Lexer. This wrapping TokenStream would read Tokens as they were provided, and apply the transformation currently in an embedded action to any ID tokens present.

This (in theory) would not be difficult to implement; however, this feels like functionality that should be possible with just the already defined ANTLR symbols. So, the question is: is it possible to conditionally change the type of tokens passing through a TokenStream within the existing ANTLR system? If not, what is the lowest-friction way of accomplishing that task? An example using the Java library would be preferred, as that is the one I am most familiar with.

And as a sub-question: if I end up creating a TokenTransformationStream for my required targets, would it be worth suggesting adding it to the existing libraries? (I can create symbols for all current supplied targets.)


1 Yes, this will crash if you construct a Lexer with the regular constructor. In a real application, it might be worth fixing that, but for this example, it doesn't matter.

2 I feel this is an appropriate task for the lexer level for a couple reasons. The main reason is that it seems common practice to pass keywords as keyword tokens always, and then, if necessary, allow them as identifiers at the parser level (such as context-sensitive keywords). Also, other questions asking simply how to achieve this effect suggest a method basically equivalent to the above provided embedded actions solution.

网友答案:

This may not turn out to be the answer to the question, but it's simply too long for comment.
I meant lexer modes in the comments because I was focused on this part hot-swap keywords. I don't know why you need to change the token type, but if you use lexer modes maybe you will not care about it.

The only catch is there need to be some keywords which indicate changing of the lexer mode. Basically one lexer mode would be a sub-lexer grammar (of sorts.)

RUNTIME_CFG_! : 'runtime_cfg_1' -> mode(m_CGF_1);
...
mode m_CGF_1;
KEYWORD1 : 'key1;
...

If there are some same keywords you can also use lexer function type* to explicitly set the type of the token.

*I can't remember in the moment how it's called but by lexer function I mean one of those like mode, skip etc..

相关阅读:
Top