Fork me on GitHub

Assignment 1 - Parsing (due Oct 11, noon)

General assignment instructions:

Updates and clarifications

Introduction

One of the most fundamental tasks in computer science is parsing: taking in some input (usually but not always in text format) and recovering meaningful data from it. In this assignment, we'll focus on parsing HTML, taking a string of HTML text and then recovering the underlying tree structure from it.

Generally speaking, when you open your browser and navigate to webpage, what you see - a nice-looking website - is a result of your browser rendering an HTML file (along with many other files that we won't go into here). HTML files are simply plain text files that store the basic content of a webpage in a tree-like form. The nodes of the tree correspond to HTML elements, which are marked with start and end tags like <html> and </html>, respectively. Between these two tags appear the contents of the element, which can be other elements (giving HTML its recursive tree structure) or plain text.

Here's a simple example of an HTML file. Try opening it in your browser!

<html>
<body>
    <h1>This is a heading!</h1>
    <div>
        <p>This is a paragraph.</p>
        <h2>This is a subheading.</h2>
        <p>This is another paragraph.</p>
    </div>
</body>
</html>

One of the first tasks in rendering a webpage is to take the raw HTML - which is just text, remember - and get the related tree structure. This is exactly your job in this assignment!

Parsing strategy

Inspired by the functional programming paradigm, we'll view parsing as a function taking a string and returning structured data (in our case, a tree). But of course, a function which handles arbitrary HTML is surely quite complex; therefore our strategy will be to build up such a function by combining and transforming simpler functions.

This is the essence of recursive descent parsing: building functions that recognize small components of data, and then larger functions that call these functions recursively, either directly or by passing them through higher-order functions. Each of our parsing functions will take a single string parameter, and return a pair (list data rest), where data contains the data parsed from the input, and rest is a string containing the remaining part of the string that hasn't been parsed yet.

; (parser-function source) -> '(<data> <rest>)

Getting started

Download parsing.rkt and parsing_test.rkt. You are required to submit unit tests for each of the files you write on this assignment. We strongly recommend that for each function, you write some unit tests before implementing it!

Note that you'll want to look up Racket's string documentation to see what functions you have available.

We'll start with the "base case": matching a single word. This type of function will attempt to read in a word at the start of the input string. On a success, it returns the word, and the rest of the input; on a failure, it signals an error using (list 'error str), where str is the original input. (Note: this isn't exactly the most graceful way of handling errors, but we trade this for simplicity.)

  1. As a warm-up, write a function parse-html-tag, that tries to read in the word <html> at the start of its input. Check the tests for examples what what the function should do.
  2. Next, a higher-order function: write a function make-text-parser, which takes as input a word, and returns a function that tries to parse a string starting with the word. So the previous parse-html-tag function could be equivalently defined as (define parse-html-tag (make-text-parser "<html>")); we're asking for a generalization here.
  3. Our last building block will be quite simple: recognizing character classes. Racket represents character literals using #\X, where X is actual character. For example, the string "hello" consists of the characters #\h, #\e, #\l, and #\o.
    In our html language, the only special characters will be #\<, #\>, #\=, and #\", AND #\/ (added Sept 15). All other characters are non-special (you may assume we're sticking to ASCII characters, though). The only whitespace character is #\space, representing the single character in " ". Write the two methods parse-non-special-char and parse-plain-char, which recognize any non-special character and any non-special, non-whitespace character, respectively.

Parser combinators

Believe it or not, we now have all the basic ingredients we need to perform parsing. However, we still need ways of combining these base parsers. Recall from CSC236 the three language operations associated with regular languages/expressions:

  1. In parsing.rkt, fill in the bodies for the functions either, both, and star, which implement the three language operations (see the documentation for details).

Note that this is a really good opportunity to use let to bind local variables. Also, remember that you are allowed and encouraged to use helper functions (that you should document!) to make your code cleaner.

The good stuff: parsing HTML

Now that you have both the basic building blocks and some basic combinators, let's get to the actual task at hand: parsing HTML.

The HTML Element Data Structure

We'll represent an HTML element in Racket as a list with 3 or more items, where:

For example, we would represent the initial simple.html example would be represented in Racket as

'("html"
  ()
  ("body"
   ()
   ("h1" () "This is a heading!")
   ("div"
    ()
    ("p" () "This is a paragraph.")
    ("h2" () "This is a subheading.")
    ("p" () "This is another paragraph."))))

HTML Attributes

HTML elements can have attributes, which enables functionality like custom styling and dynamic behaviour (like clicking a button to submit a form). Attributes are declares within the opening tag of an HTML element, as follows: <p attr="val" attr2="val2">. By the way, this is why we consider #\= and #\" to be special characters.

An attribute consists of a name and a value, separated by an =. The value must be enclosed in double-quotes, but note that these are not stored in Racket (that would be quite redundant).

We represent a single attribute in Racket as a list of two items, the attribute name and value. For example, the following HTML:

<html>
<body>
    <p id="main" class="super">Hey</p>
</body>
</html>

has this Racket representation:

'("html"
  ()
  ("body"
   ()
   ("p"
    (("id" "main") ("class" "super"))
    "Hey")))

Your Task

  1. Your only task for this part is to write a parser parse-html, which takes as input a string, and tries to parse a valid HTML element at the start of the string, returning as its data the tree representation of the parsed HTML. See the documentation in parsing.rkt and the final section "Specification Details" for more information.

This is a large task; we strongly recommend breaking this down into smaller pieces, e.g.

Guidelines

Write helper functions! This is much too large of a task to put in just one function.

Think hard about how to break up the task into smaller parts. You should use the tools you built in the first part of the assignment - we didn't make you do these for no reason! However, you might want to write other parsers, other functions that create parsers, or other parser combinators; this is strongly encouraged. The first tasks are a starting point, not the be-all and end-all of parsing tools.

You are required to write and submit tests for all of your functions! Get in the habit of testing early and often, to save yourself a lot of work at the end.

Specification Details

Obviously, your parser will only be able to handle a rather restricted subset of valid HTML. Here are some clarifications about what you do and do not have to handle:

  1. The only whitespace character you need to handle is the space #\space. We will not be testing your code on any strings containing #\newline, #\tab, etc.
  2. Whitespace appearing in text contents and attribute values should be preserved by your parser: so <p> Hi! </p should be parsed as '("p" () " Hi! ").
  3. But remember that HTML elements can have text, or other elements, but not both. So your parser should ignore the whitespace between <body> and <p> in here: <body> <p>Heyo</p></body>.
  4. No element or attribute name can contain whitespace. So <bo dy> is invalid.
  5. You can also assume that in our tests, closing tags contain no whitespace and are always closed: < / p > will not appear, nor will an unclosed tag like </p.
  6. In our tests, element opening tags contain no whitespace if there are no attributes. However, if there are attributes, then arbitrary amounts of whitespace are allowed to separate:
    • element name from first attribute
    • around the =
    • between an attribute value and the next attribute name or #\>. For example, the following tag is allowed: <p class ="main" id= "me " >Hi</p>. This should be parsed as '("p" (("class" "main") ("id" "me ")) "Hi").
  7. You can also assume that all opening tags have a closing #\>, so you do not need to handle a string like "<html <body>Hi</body></html>, in which the opening tag is not closed.
  8. Empty elements can be interpreted as having text contents containing the empty string. So <p></p> should be parsed as '("p" () "").
  9. There are no semantic restrictions on element nesting, the element names, attribute names or values, or text contents. For example, a head element may appear after (or nested inside) a body element. The root element can be anything, not just <html>.
  10. Sept 15 clarification: like all other parsers we've asked you to create, you should only parse one root element at the beginning of the string. For example:

    > (parse-html "<hi>Hello</hi><bye>Bye</bye>")
    '(("hi" () "Hello") "<bye>Bye</bye>")

Submitting the assignment

Once you are finished and confident in your work, follow these steps to submit your assignment on MarkUs.

  1. Login to MarkUs.
  2. Go to the "Assignment 1" page.
  3. If you worked in a group, form a group and make sure your partner joins. Remember that this requires the "invitee" to actually login and join the group; otherwise, that person won't receive any credit for the assignment.
  4. Submit two files: parsing.rkt, containing the functions you implemented for the assignment, and parsing_test.rkt, containing the tests your wrote.
  5. Congratulations, you're done! Eat some chocolate.