Parsing Inputs

post

software engineering

fuzzing book

How can input parsers aid fuzzers?

Authors

Pallas-Athena Cain

Coltin Colucci

Chezka Quinola

Gregory M. Kapfhammer

Published

November 20, 2024

Overview

This post discusses the Parsing Inputs chapter from The Fuzzing Book. This chapter points out that, sometimes we need inputs that follow a certain grammar structure in order to effectively use fuzzing. A good way to do that is to use a parser to break down the grammar of a template string and form new strings that follow the same grammar rules. Then you can generate a whole bunch of valid inputs to use to test to your hearts content! This article breaks down how parsers work so we can use them. How to use them is described in the next chapter.

Summary

Reminder a non-terminal is the rule or building blocks for the grammar such as <digit> and <letter> whereas a terminal is the actual digit or letter think of it as the destination for the grammar like 2 or a. Given a string you can decompose it into its constituent parts that correspond to the parts of grammar used to generate it.

You can make a derivation tree of a given string to break it down into its grammar parts. Then it can be recombined using the same grammar to produce new strings. Trees allow us to mutate, crossover, and recombine their parts in order to generate new valid slightly different inputs. There are two main parsing classes to make a string into a derivation tree.

Parsing Expression Grammar Parser (PEGParser)
Earley Parsers

The PEGParser is efficient but limited to a specific grammar structure — rather than choosing all the rules that can potentially match it stops at the first match that succeeds. The EarleyParser can accept any kind of context-free grammars and explore all parsing alternatives.

To Use

To use these parsers first you have initiate a grammar:

>>> from Grammars import US_PHONE_GRAMMAR
>>> us_phone_parser = EarleyParser(US_PHONE_GRAMMAR)

Then you use the parse method to retrieve a list of possible derivation trees:

>>> trees = us_phone_parser.parse("(555)987-6543")
>>> tree = list(trees)[0]
>>> display_tree(tree)

These trees can be used for test generation. Notably for mutating and recombining existing inputs.

Why Parsing for Fuzzing?

Sometimes trying to generate valid inputs for fuzzing can be hard. For example if you have code that can only run if the first item has to be the strings van or car to work but you are randomly generating the input the chances of getting those to output are very low.

We can give the input generator our grammar but just the grammar itself is not enough to make valid inputs. Even if you modify the fuzzer to know more about the way the inputs are formatted there is still difficulty getting valid inputs.

class PooledGrammarFuzzer(GrammarFuzzer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._node_cache = {}

    def update_cache(self, key, values):
        self._node_cache[key] = values

    def expand_node_randomly(self, node):
        (symbol, children) = node
        assert children is None
        if symbol in self._node_cache:
            if random.randint(0, 1) == 1:
                return super().expand_node_randomly(node)
            return copy.deepcopy(random.choice(self._node_cache[symbol]))
        return super().expand_node_randomly(node)

gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
gf.update_cache('<item>', [
    ('<item>', [('car', [])]),
    ('<item>', [('van', [])]),
])
trials = 10
time = 0
for i in range(trials):
    vehicle_info = gf.fuzz()
    try:
        print(repr(vehicle_info), end="")
        process_vehicle(vehicle_info)
    except Exception as e:
        print("\t", e)
    else:
        print()

',h,van,|'   Invalid entry
'M,w:K,car,car,van'  Invalid entry
'J,?Y,van,van,car,J,~D+'     Invalid entry
'S4,car,car,o'   invalid literal for int() with base 10: 'S4'
'2*-,van'    not enough values to unpack (expected at least 4, got 2)
'van,%,5,]'  Invalid entry
'van,G3{y,j,h:'  Invalid entry
'$0;o,M,car,car'     Invalid entry
'2d,f,e'     not enough values to unpack (expected at least 4, got 3)
'/~NE,car,car'   not enough values to unpack (expected at least 4, got 3)

The solution: A parser. A parser can extract the template for inputs and valid values from samples and use them for fuzzing!

Using a Parser

The parser processes structured input. The parsers in this chapter take an input string and turn it into a derivation tree. The next chapter is about using the parser for testing but this chapter is about understanding how parsers work.

An Ad-Hoc Parser

# Used to extract the information
def simple_parse_csv(mystring: str) -> DerivationTree:
    children: List[DerivationTree] = []
    tree = (START_SYMBOL, children)
    for i, line in enumerate(mystring.split('\n')):
        children.append(("record %d" % i, [(cell, [])
                                           for cell in line.split(',')]))
    return tree

# Change the orientation to Left to Right
def lr_graph(dot):
    dot.attr('node', shape='plain')
    dot.graph_attr['rankdir'] = 'LR'

tree = simple_parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)

This is a simple example and if we encounter more complexity it may not parse correctly. We can separate these incorrectly parsed inputs manually but that would be insanity! It would be so much work. Instead we use formal parsers. With those changing the external structure does not have much impact on the internal structure.

Grammars in Parsing

Non-terminals are the building block rules that may be expanded. The symbols themselves are typically terminals unless they can be expanded further.

Here is an example derivation tree for this grammar:

tree = ('<start>', [('<expr>',
                     [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
                      ('+', []),
                      ('<expr>', [('<integer>', [('<digit>', [('2',
                                                               [])])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)

Recursion

Recursion is possible in grammar such as with <expr> being used in its own definition. There are two kinds of recursion to note in parsing left-recursive and right-recursive.

Left-recursive

A non-terminal is directly left-recursive if the left-most symbol of any of its productions is itself.

Example:

LR_GRAMMAR: Grammar = {
    '<start>': ['<A>'],
    '<A>': ['<A>a', ''],
}

mystring = 'aaaaaa'
display_tree(
    ('<start>', [('<A>', [('<A>', [('<A>', []), ('a', [])]), ('a', [])]),
                 ('a', [])]))

A grammar can also be indirectly left-recursive if the left-most symbols can be expanded using their definitions to produce the non-terminal as the left-most symbol of the expansion. For example <integer> will be considered hidden-left recursive if <digit> could derive an empty string.

Right-recursive

Right recursive is the same idea but it expands the other direction.

RR_GRAMMAR: Grammar = {
    '<start>': ['<A>'],
    '<A>': ['a<A>', ''],
}

display_tree(('<start>', [('<A>', [
                  ('a', []), ('<A>', [('a', []), ('<A>', [('a', []), ('<A>', [])])])])]
             ))

Discussion Question: Do you think left or right recursion is better for top-down parsers and why?

Click to Expand for the Answer

Answer: Right Recursion!

Left recursion can get caught in infinite loops with top-down parsers which is not ideal. Right recursion avoids that problem because it waits to do more recursive calls until after the regular input.

Ambiguity

Sometimes there are multiple trees from the same grammar called parses. The string 1+2+3 has two ways it can be parsed using the A1_GRAMMAR. One method to deal with this is to simply choose an order of resolution, perhaps choose the first tree in any case (PEGParser). Another approach returns all the trees (Earley parser).

A Parser Class

To develop parsers there needs to be defined interface for parsing. There are two approaches to parsing a string using a grammar.

The first approach is to use a lexer. A lexer tokenizes an incoming string then feeds the grammar one token at a time. Each token represents a meaningful unit such as a number or keyword. The lexer handles the tokenization process, while the parser only has to focus on understanding the structure of the language using these tokens. The result of parsing is usually a shallow derivation tree, which can be converted into an Abstract Syntax Tree (AST).

The second approach is using a tree pruner. This occurs after the complete parse. The tree pruner removes the nodes that correspond to individual tokens and then replaces them with their actual string values as leaf nodes. This approach is more flexible and powerful additionally there is no separate step for lexing and parsing.

Parsing Expression Grammars

Parsing Expression Grammars (PEG) are recognition based grammars that specify a sequence of steps to take to parse a given string. Parsing expression grammars are represented by a set of nonterminals and corresponding alternatives representing how to match each.

PEG1 = {
    '<start>': ['a', 'b']
}

Unlike context-free grammars alternatives represent ordered choices, meaning the parser will stop at the first matching rule it finds.

PEG2 = {
    '<start>': ['ab', 'abc']
}

The Packrat Parser for Predicate Expression Grammars

Packrat parsing is one of the simplest parsing techniques used for PEGs. It works by caching results from previous parsing steps so if the same problem comes up again it can reuse the cached solution instead of redoing the work.

mystring = "1 + (2 * 3)"
peg = PEGParser(EXPR_GRAMMAR)
for tree in peg.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))

Parsing Context-Free Grammars

Problems with PEG

PEGs may at first appear simple but in some cases may include a little bit more thinking.

PEG_SURPRISE: Grammar = {
    "<A>": ["a<A>a", "aa"]
}

When interpreted as a context free grammar and used as a string generator it will produce strings of the form aa, aaaa, aaaaaa. It produces strings where the number of a is \(2 * n\). PEGs are oriented towards language recognition and it is also unclear how to translate a PEG to a CFG. With the main focusing being fuzzing next we are going to look at parsers that can accept context-free grammars.

Earley Parsers

The Earley parser is a versatile general-purpose parser capable of parsing any arbitrary Context-Free Grammar (CFG). It was developed by Jay Earley in 1970 for computational linguistics. Although its computational complexity is \(O(n^3)\) for parsing strings with arbitrary grammars, it can efficiently parse strings with unambiguous grammars in \(O(n^2)\) time. It can also parse all LR(k) grammars in linear time, \(O(n)\) as noted by Joop M.I.M. Leo in 1991. Further enhancements, particularly in handling epsilon rules, were introduced by Aycock et al. in 2002.

Discussion Question: How does the computational complexity of the Earley parser affect its efficiency when dealing with large or ambiguous grammars? Can the parser be optimized further for specific use cases?

Grammar Restrictions

A notable limitation in the Earley parser implementation is that the start symbol can only have one alternative in its alternative expressions. However, this restriction can be easily bypassed by introducing a new start symbol. For example, given a grammar:

grammar = {
    '<start>': ['<A>', '<B>'],
    ...
}

You can rewrite it to conform to the single-alternative rule:

grammar = {
    '<start>': ['<start_>'],
    '<start_>': ['<A>', '<B>'],
    ...
}

Discussion Question: Why might the Earley parser require the start symbol to have only one alternative, and what implications does this have for the flexibility of grammar representations?

Implementing the EarleyParser

To implement the EarleyParser, we can create a class derived from a basic Parser class. Here’s an example of how to use the Earley parser with some simple expressions:

Example 1: Parsing an Expression

mystring = "1 + (2 * 3)"
earley = EarleyParser(EXPR_GRAMMAR)
for tree in earley.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))

The derivation tree for this example might look like this:

<start>
  <expr>
    <term>
      +
      <expr>
        <factor>
          <integer>
            <digit>
              1 (49)
        <term>
          <factor>
            ( (40)
        <expr>
          ) (41)
        <term>
          <factor>
            *
            <term>
              <integer>
                <digit>
                  2 (50)
              <factor>
                <integer>
                  <digit>
                    3 (51)

Example 2: Parsing Another Expression with Decimals

mystring = "1 * (2 + 3.35)"
for tree in earley.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))

The derivation tree here might look like:

<start>
  <expr>
    <term>
      <factor>
        *
        <term>
          <integer>
            <digit>
              1 (49)
          <factor>
            ( (40)
        <expr>
          ) (41)
        <term>
          +
        <expr>
          <factor>
            <integer>
              <digit>
                2 (50)
          <term>
            <factor>
              <integer>
                . (46)
              <integer>
                <digit>
                  3 (51)

Discussion Question: How does the parser handle nested expressions, and how does the structure of the derivation tree help us understand the parsing process?

Testing the Parsers

Testing parsers can be done effectively by generating random grammars and their corresponding strings. This provides an interesting way to ensure that parsers are working as expected.

Discussion Question: What challenges might arise when testing parsers with random grammars, and how can fuzzing help identify edge cases or parsing issues?

Background

There are various parsing techniques available, each designed to handle specific types of grammars. Some common approaches include LL and LR parsing. LL parsing processes the input left-to-right, using a leftmost derivation. LR parsing works similarly but uses a rightmost derivation. These approaches are particularly efficient, with LR(1) grammars parseable in linear time.

Discussion Question: How do LL and LR parsing differ in terms of efficiency and applicability, and when might one be preferred over the other?

Complexities of Parsing

For arbitrary CFGs, the best-known algorithms typically have a worst-case complexity of \(O(n^3)\). More advanced techniques, like boolean matrix multiplication, could theoretically reduce this further, but practical implementations often remain cubic.

Discussion Question: What factors influence the complexity of parsing algorithms, and how can they be optimized for practical use cases?

Peg Parsing

Parsing Expression Grammars (PEGs) are another important concept. While PEGs are effective for many contexts, their ability to represent certain languages (like anbncn) remains an open question. However, PEGs have the advantage of providing a simple top-down interpretation, making them intuitive for writing parsers.

Discussion Question: In what scenarios would PEG parsing be more advantageous than other parsing techniques, and what are its limitations?

This concludes our overview of the Earley parser and its role in parsing arbitrary context-free grammars. By integrating it with fuzzing techniques, we can effectively test parsers against a variety of grammars, ensuring robustness and correctness.

Reflection

The article “Parsing Inputs” highlights the critical role of parsers in fuzzing by emphasizing how input parsers break down strings into derivation trees, which can then be mutated and recombined to generate new, valid inputs. I found the explanation of non-terminals, terminals, and the use of parsers like PEGParser and EarleyParser to be insightful. The examples, particularly the one about using a parser to improve fuzzing, demonstrated how incorporating grammar into the fuzzing process can significantly increase the likelihood of generating valid inputs, thus making fuzzing more effective.

I particularly appreciated the in-depth exploration of the EarleyParser and its flexibility in handling context-free grammars, which was crucial for understanding how such parsers can be used for fuzz testing. The emphasis on using parsers to simplify complex input generation is an actionable takeaway that I will apply when generating more structured inputs in future fuzzing exercises. The discussion of recursion, ambiguity, and grammar restrictions also gave me a clearer understanding of the complexities involved in parsing different types of grammars, which can be beneficial when working with more complex systems.

Action Items

The action items derived from the reading emphasize the importance of understanding and implementing parsing techniques for enhancing fuzzing capabilities. To start, it is crucial to experiment with different parsing methods, such as PEG and Earley Parsers, to determine their applicability in generating valid input for fuzzing tools. The next step is to implement a parser class with the ability to handle context-free grammars, as it can support more complex structures and enhance the flexibility of fuzzing processes.

Furthermore, the implementation of a grammar fuzzer should be explored, particularly using techniques like cache updates and random node expansions, to facilitate the generation of diverse test cases. Additionally, it is important to apply these parsing methods to generate meaningful and valid test data, as well as to test the parsers against various types of grammars, ensuring that all edge cases are addressed. Finally, an evaluation of the performance of each parsing method in terms of parsing efficiency and the quality of generated test inputs should be conducted, with the goal of optimizing fuzzing strategies.

Reference

Unless otherwise noted, the source code segments in this article are excerpted directly from the Fuzzing Book. The authors of this online book licensed the source code and its written content under the BY-NC-SA 4.0 Creative Commons License. More details about the license for the Fuzzing Book are available in the The Fuzzing Book License.

Return to Blog Post Listing