Due: Tuesday, Feb 14, 10 p.m.
Extended to Friday, Feb 17, 10 p.m.
One of the ways that Google makes searching the web so fast is that when it finds pages on the internet it makes a copy of them and creates a search index. (Of course Google does a *lot* more, but we have to start somewhere.) Your task in this assignment is to create a search index for a set of files based on word frequency.
We will use a simple algorithm to create an index for a set of files. We will simply count the frequency of each word in each of the files. This means that we need a data structure to keep track of this information. Since we don't know how many words will be in an index, we will use a sorted linked list to store the words. The picture below shows what the linked list looks like.
Each list node contains three elements: the word, an array that
stores the number of times the word has been seen in each file, and a
pointer to the next element of the list. There is another data
structure which is an array of strings, that stores the name of each
file that is indexed. The index of a file name corresponds to the index
of the freq
array in a list node. Storing the file names separately
means that we don't need to store the name of each file many times.
In the diagram above, four words have been extracted from two files.
The two files are called Menu1
and Menu2
.
The linked list shows that the word spinach appears 2 times in the file
Menu1
and 6 times in the file Menu2
. Similarly the word
potato appears 11 times in the file Menu1
and 3 times in Menu2
.
The words in the linked list are in alphabetical order.
Since this is your first C assignment, much of the machinery to
implement the data structure described above is given to you in the
files freq_list.c
and freq_list.h
. These
files can be found in the a2 directory of your repository. The only
function you need to write to complete the set of functions is
add_word()
(20%). You should study these files carefully to
understand how the code works and why certain decisions were made.
Your first task is to write a program that will build an index for
one or more files. The names of the files to index are given as command
line arguments. The default name for the file that stores the array
of file names is filenames
and the default name for the file
that stores the index is index
. These names can be modified
through optional command line arguments as follows:
indexfile [-i FILE] [-n FILE] FILE ... OPTIONS -i FILE Write the linked list to FILE. This overrides the default file name "index". -n FILE Write the array of file names to FILE. This overrides the default file name "filenames".
The code that handles the options is given to you in indexfile.c
.
You can use it as a template for reading in arguments. You must use getopt
to process the command line arguments. Please see the examples online and in tutorial for
how to use getopt.
A Makefile is given to you as well. At this point, you should be able to run make
indexfile
and an executable file will be generated for indexfile
.
(Note that make
with no arguments will not work until you write query.c
).
If indexfile
does not encounter any errors it will produce no output on standard output.
It will simply create and write to the files. If the files already exist, they will be overwritten. Your
program should read each file one line at a time and break the line up into words. You might consider
using strsep
or strtok
to break a line into words.
If you just use strsep
or strtok
to break a line into words, the punctuation
will go along with each word and "hello!" and "hello" will be two different words.
Write a function that will remove punctuation characters from the beginning and endings of words. Your function should take a string an and argument and return a pointer to a copy of the string with the punctuation removed. Note that there may be multiple punctuation characters at the beginning and end of a words. A punctuation character is anything that is not a letter and is not a number. Your function will also change all uppercase letters to lowercase letters. Hint, there are built in functions that will make this an easy task.
Incorporate your function into your indexfile
program.
Now that you have a program to create an index, write a program called query
that takes one word as an argument and writes to standard output the
list of files containing the word ordered by the number of occurrences
of the word in the file. Each line of output will be the frequency
followed by the filename. For example, if we ran query spinach
using the index above, the output would be:
6 Menu2 2 Menu1
The query
program will take the same optional arguments as indexfile
. That is, it will use the same default names for the two files that hold the data structures, and will use -i
and -n
to specify different names for these files. It must use getopt
to read in arguments.
In a file named critique.txt
, write a short (no more
than 250 words) critique of the starter code given to you. A critique is
a critical evaluation. It can be both positive and negative. The
issues you should be considering when thinking about your critique in
the case are
We will do an example critique in class in the week of Feb 6. If you can't make it to class, please ask a friend to take notes for you. There are two reasons to ask you to write a critique: to give you more practice explaining code in written English, and to give you a reason to study the starter code more thoroughly.
The critique should be professional in tone. Don't be rude; don't insult; don't be sarcastic. The point is to identify strengths and weaknesses in the code, and to demonstrate your understanding.
Your critique will be graded on the quality of the writing, how clearly you make your arguments, and your choice of things to critique. The emphasis in the marking scheme will be on the clarity of the arguments and the structure of your writing. You should write in full sentences, not point form. and you may assume that the reader is a knowledgeable C programmer. Note that 250 words is not a lot of text, so you will need to pick a small number of points to make, and you will not be able to carry out an exhaustive review.
We are not going to be sticklers on precisely the 250 work limit, but if you go over by too much, you will not receive full marks.
Commit to your repository under a2 all of the files that are needed to run
query
and indexfile
, and your critique.txt
.
A Makefile
has been provided. The markers must be able to run
make
such that both programs are compiled with no warnings. You may need
to modify the Makefile to add more source code files.
Coding style and comments are just as important in 209 as they were in previous courses. Use good variable names, appropriate functions, and descriptive comments.
Please remember that if you submit code that does not compile, it will receive a grade of 0. The best way to avoid this potential problem is to write your code incrementally. For example, the starter code compiles and solves one small piece of the problem which can be checked. Get a small piece working, commit it, and the move on to the next piece. This is a much better approach than writing a whole bunch of code and the spending a lot of time debugging it step by step.