Rust编写的JavaScript引擎,该项目是一个试验性质的项目。
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

151 lines
4.8 KiB

//! This module implements lexing for identifiers (foo, myvar, etc.) used in the JavaScript programing language.
use super::{Cursor, Error, Tokenizer};
use crate::lexer::{StringLiteral, Token, TokenKind};
use boa_ast::{Keyword, Position, Span};
use boa_interner::Interner;
use boa_profiler::Profiler;
use boa_unicode::UnicodeProperties;
use std::io::Read;
/// Identifier lexing.
///
/// More information:
/// - [ECMAScript reference][spec]
/// - [MDN documentation][mdn]
///
/// [spec]: https://tc39.es/ecma262/#prod-Identifier
/// [mdn]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier
#[derive(Debug, Clone, Copy)]
pub(super) struct Identifier {
init: char,
}
impl Identifier {
/// Creates a new identifier/keyword lexer.
pub(super) fn new(init: char) -> Self {
Self { init }
}
/// Checks if a character is `IdentifierStart` as per ECMAScript standards.
///
/// More information:
/// - [ECMAScript reference][spec]
///
/// [spec]: https://tc39.es/ecma262/#sec-names-and-keywords
pub(super) fn is_identifier_start(ch: u32) -> bool {
matches!(ch, 0x0024 /* $ */ | 0x005F /* _ */)
|| if let Ok(ch) = char::try_from(ch) {
ch.is_id_start()
} else {
false
}
}
/// Checks if a character is `IdentifierPart` as per ECMAScript standards.
///
/// More information:
/// - [ECMAScript reference][spec]
///
/// [spec]: https://tc39.es/ecma262/#sec-names-and-keywords
fn is_identifier_part(ch: u32) -> bool {
matches!(
ch,
0x0024 /* $ */ | 0x005F /* _ */ | 0x200C /* <ZWNJ> */ | 0x200D /* <ZWJ> */
) || if let Ok(ch) = char::try_from(ch) {
ch.is_id_continue()
} else {
false
}
}
}
impl<R> Tokenizer<R> for Identifier {
Lexer string interning (#1758) This Pull Request is part of #279. It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too. Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed. Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors. The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this. Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there. Oh, and this depends on the VM-only branch, so that has to be merged before :) Another thing to check: should the interner be in its own module?
3 years ago
fn lex(
&mut self,
cursor: &mut Cursor<R>,
start_pos: Position,
interner: &mut Interner,
) -> Result<Token, Error>
where
R: Read,
{
let _timer = Profiler::global().start_event("Identifier", "Lexing");
let (identifier_name, contains_escaped_chars) =
Self::take_identifier_name(cursor, start_pos, self.init)?;
let token_kind = if let Ok(keyword) = identifier_name.parse() {
match keyword {
Keyword::True => TokenKind::BooleanLiteral(true),
Keyword::False => TokenKind::BooleanLiteral(false),
Keyword::Null => TokenKind::NullLiteral,
_ => TokenKind::Keyword((keyword, contains_escaped_chars)),
}
} else {
TokenKind::identifier(interner.get_or_intern(identifier_name.as_str()))
};
Ok(Token::new(token_kind, Span::new(start_pos, cursor.pos())))
}
}
impl Identifier {
#[inline]
pub(super) fn take_identifier_name<R>(
cursor: &mut Cursor<R>,
start_pos: Position,
init: char,
) -> Result<(String, bool), Error>
where
R: Read,
{
let _timer = Profiler::global().start_event("Identifier::take_identifier_name", "Lexing");
let mut contains_escaped_chars = false;
let mut identifier_name = if init == '\\' && cursor.next_is(b'u')? {
let ch = StringLiteral::take_unicode_escape_sequence(cursor, start_pos)?;
if Self::is_identifier_start(ch) {
contains_escaped_chars = true;
String::from(
char::try_from(ch)
.expect("all identifier starts must be convertible to strings"),
)
} else {
return Err(Error::Syntax("invalid identifier start".into(), start_pos));
}
} else {
// The caller guarantees that `init` is a valid identifier start
String::from(init)
};
loop {
let ch = match cursor.peek_char()? {
Some(0x005C /* \ */) if cursor.peek_n(2)?.get(1) == Some(&0x75) /* u */ => {
let pos = cursor.pos();
let _next = cursor.next_byte();
let _next = cursor.next_byte();
let ch = StringLiteral::take_unicode_escape_sequence(cursor, pos)?;
if Self::is_identifier_part(ch) {
contains_escaped_chars = true;
ch
} else {
return Err(Error::Syntax("invalid identifier part".into(), pos));
}
}
Some(ch) if Self::is_identifier_part(ch) => {
let _ = cursor.next_char()?;
ch
},
_ => break,
};
identifier_name.push(char::try_from(ch).expect("checked character value"));
}
Ok((identifier_name, contains_escaped_chars))
}
}