RICHARD HALLETT

2018-10-28 18:00

RustyLox - Scanners

This is a documentation and thoughts of my adventures in creating an interpreter for the lox language, in Rust, as part of following http://www.craftinginterpreters.com. I am adapting as appropriate for Rust as opposed to Java and these are my findings.

Setup

I'm skimming over all the setup part around the initial setting up for the interpreter, this is mostly much of a muchness and nothing about it was too different using Rust. I do note that there are some libraries to do CLI argument parsing, but I decided to keep in spirit of the book and just hand roll.

Token types

One of the nice things about Rust is it's influences from functional languages and the Enum type (see sum type, tagged union, discriminated union etc) I think allows for some nice readability when defining new types.

For the Tokens I decided to roll the TokenType and Token into just Token, with ancillary data that is specific to the scanned item in an 'Item' type. So the value for literals is stored next to the type so the Item only needs the lexeme it matches and the line where it was found. It is however more or less equivalent to the Java code in the book.

pub enum Token {
  // Single-character tokens.
  LeftParen, RightParen,
  LeftBrace, RightBrace,
  Comma, Dot, Minus, Plus, SemiColon, Slash, Star,

  // One or two character tokens.
  Bang, BangEqual,
  Equal, EqualEqual,
  Greater, GreaterEqual,
  Less, LessEqual,

  // Literals.
  Identifier(String),
  String(String),
  Number(f64),

  // Keywords.
  And, Class, Else, False, Fun, For, If, Nil, Or,
  Print, Return, Super, This, True, Var, While,

  Eof
}

pub struct Item {
    pub token: Token,
    pub lexeme: String,
    pub line: u32
}

And because of the Enum type we also get pattern matching to accompany us, this means when parsing a token we present something like the following:

let token = match c {
    '(' => Token::LeftParen,
    ')' => Token::RightParen,
    '{' => Token::LeftBrace,
    '}' => Token::RightBrace,
    ',' => Token::Comma,
    '.' => Token::Dot,
    '-' => Token::Minus,
    '+' => Token::Plus,
    ';' => Token::SemiColon,
    '*' => Token::Star,
    _ => {
        return Err(Error::Lexical(self.line,
                            self.lexeme.to_string(),
                            format!("Unexpected character '{}'", c).to_string()))
    }
};

Note: The above code is not the complete token match, it's just to give an example.

Peeking

One of the requirements of the scanner is to iterate through all the characters in the string while also being able to peek ahead more than just the current char, in the original Java code this is achieved through regular Java string handling and indexes.

For Rust initially I used regular iteration of Chars with std::string::String.chars() method along with Peekable iterator, however this did not give me an easy way to look multiple characters ahead, after a bit of trial and error trying to do it with just this, the answer was instead use the excellent itertools package which has "itertools::structs::MultiPeek". As per the docs "An iterator adaptor that allows the user to peek at multiple .next() values without advancing the base iterator." which is exactly what was needed.

So the state of my Scanner can be stored relativly succiently like:

pub struct Scanner<'a> {
    src_iter: MultiPeek<Chars<'a>>,
    lexeme: String,
    line: u32
}

Then operations like advance and advance until look like:

fn advance(&mut self) -> Option<char> {
    let c = self.src_iter.next();

    // Add to current lexeme
    if let Some(c) = c {
        self.lexeme.push(c);
    }

    return c;
}

fn advance_until(&mut self, until: char) -> Option<char> {
    let mut last = None;
    while let Some(n) = self.src_iter.peek() {
        if *n == until {
            break;
        } else {
            last = Some(*n);
            self.advance();
        }
    }
    last
}

TL;DR: Use itertools::structs::MultiPeek

Scan next

In the Java code this just scans over all the source and returns a big list of tokens, in my Rust version while I made a function that does scan all items and returns, the actual way I keep track of the scanners state means I can have a scan function that can be called on demand. Out of curiosity I read into the C part of this series and the scanner there actually works very similar to what I've produced in the Rust version.

Signature of the scan item:

pub fn scan_item(&mut self) -> Result<Item, Error> { ... }

Here note we use the Result type to return either a Result or an Error which brings us to...

Error handling

Instead of exceptions Rust is more akin to it's functional brethern where it likes to return Error values in a mixed "Result" Enum. The benefit of this of course is here we can define meaningful Errors that are returned at different parts of the application.

#[derive(Debug, PartialEq)]
pub enum Error {
    /// Error returned from the scanner
    Lexical(u32, String, String),

    /// Generic runtime error
    Runtime(String)
}

The above code you can see we define a Lexical Error that takes a line number, where it happened and a message associated. In the first match example you can see here we return the Lexical error.

This all gets boiled back up to the caller which can collect a list of errors and decide then how to handle, e.g. output to stderr

The other thing to mention is in the original code it had helper functions that were generic handlers for error strings, instead for rust we can implement the Display traits on the Error type.

impl fmt::Display for Error {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match *self {
            Error::Lexical(ref line, ref from, ref message) =>
                write!(f, "Lexical error [line {}] Error {}: {}",
                    line, from, message),
            Error::Runtime(ref message) =>
                write!(f, "Runtime error {}", message),
        }
    }
}

impl error::Error for Error {
    fn description(&self) -> &str {
        match *self {
            Error::Lexical(_, _, _) => "lexical error",
            Error::Runtime(_) => "runtime error",
        }
    }
}

Other stuff

Mostly everything else is pretty much the same, we keep track of the current lexame in the scanner struct, which we can then store against our Scanner Items as they are found. We also keep track of the current line number, again storing where appropriate.

Non Lexical Lifetimes - Quick note

Pre Rust 2018 lifetimes are always scoped but this could sometimes lead to code which would look perfectly valid but yet error saying you borrowed multiple times, however in reality you never tried to use the original borrowed and so you can as a programmer can see it should be 'fine'.

I had this crop up when iterating over the src_iter in the scanner functions, getting mutable references to self then trying to do a later modify in the same scope (see advance_until) it caused a bit of a headscratch. One way to solve this was just to not get a mutable borrow and instead get a full clone of what we were working with, in this case the "char" which is small enough that it makes no difference. However with Non Lexical Lifetimes implemented in Rust 2018 this can make the code a bit cleaner and you avoid needing to get a full clone.