Regular Expressions

CSCI 136
Fundamentals of Computer Science II
Spring 2022

Schedule | Assignments | Resources | Course Syllabus | Moodle

ASSIGNMENT #8 - Regular Expressions

In this assignment, you will practice using regular expressions, recursion, and writing to a file. Your code will implement a web crawler that searches web pages for links to other web pages and then explores those links and so on.

Web Crawler. For this assignment you are to write a program that takes two command line arguments. The first argument is the full url of the page where you want your web crawler to start. For example, if you want to start on our class website you would say https://katie.cs.mtech.edu/classes/csci136 . The second argument is how many links deep you want to explore. While testing your program, I'd only go two links deep. Three links gets pretty large. There is not a lot of code to this assignment, but it will need to be the correct code.

Your program should have two parts.

Main Program.The main program will start everything off by reading the command line arguments and sending them to a function. Once the function completes, your main program should write the contents of the dictionary to a file name links.txt. Each link should be on its own line when written to file, and the count should be listed afterward.

Recursive Crawl Functoin.The function, called crawl will take three arguments: the recursion level, the current link, and a dictionary to store links in. Crawl will:

Open up the web page specified in the link or url.

Read the contents of that page.

Find all the links specified on that page, and for each one, recursively call itself with a smaller recursion number, the link and the dictionary.

The dictionary should be used to store a count of how many times a link has been encountered.

The dictionary should be indexed by the url and should store the count associated with that link.

There are two base cases:

If the recursion count is 0, simply return.

If the link has been encountered before, increment the count in the dictionary for that link and return.

You should use a regular expression to find the links on each page you read in. Links are specifed by the text
href="url"
where url is the link to the next page. Note that the quotes around the url are part of the specification. Be careful in using python's regular expressions - if you use .* to catch 0 or more characters, it will read everything until the end of the text instead of stopping at the end of the statement you are trying to catch. Use .*? instead.

You are probably wondering how to access a web page from Python.

First, you will need to import urllib.request.

When you want to access a page, you will need to open the page:
page = urllib.request.urlopen(link)

and then read that page:
result = str(page.read())

The contents of "result" will be the html code from that page in string format.

Good luck -- and have fun!

Grading

Grade Item	Points Possible	Points Earned
Program Compiles and Runs	4
Header Comment	2
Regular expression works	4
crawl Function works	3
crawl Function is recursive	4
url pages opened and read correctly	4
Links stored in dictionary	3
Count of link encounters is correct	3
Results written to file	3
Total	30

Submission. Submit your code, Crawler.py, using Moodle. Be sure your submitted code has the required header with your name and a description of the program.

Page last updated: March 09, 2022