Social media search engine

CSCI 470
Web Science
Spring 2015

Schedule | Assignments | Resources | Course syllabus | Moodle

Social media search engine

You will implement an information retrieval system that searches a corpus of social media updates.
You will create and benchmark applications using different dynamic content technologies:

CGI
Apache module
PHP
A method of your own choice

Application overview. Using each dynamic content technology, you will be developing a web application that has the identical functionality. On your web server you will have a corpus of 100K social media updates taken from the ICWSM 2011 Spinn3r dataset. Here are the first 10 lines of the file:

0       after we get done cleaning
1       hoping to get the heat fixed today
2       what is he really doing in north louisiana
3       but got chipotle to make me feel better
4       you know a sister is ready
5       want to know a secret about me
6       they say yes and i say maybe
7       you know what night it is
8       hope it's an easy fix
9       i knew they would from the get go

The first column is an integer ID of a particular social media update. The second column contains the update. The update has been converted to all lowercase. Words are separated by a single space. All punctuation aside from apostrophe has been removed.

Your web applications take input via the required HTTP GET parameter search. Search contains whitespace separated terms that are used to search the updates. Note the search parameter will be URL encoded, so spaces will appear as + or %20. Other punctuation may be present and the terms may be in mixed case.

Your applications also take an optional GET parameter limit. If this parameter is present, and is greater than zero, it specifies the maximum number of matching updates to return. If the limit parameter is less than or equal to zero, or is not parseable as an integer, your application should fallback to simply returning all matching results with no limit.

An update matches the search query if all the words in the query also appear in the update. This matching ignores the case of the query terms and also strips any characters in the terms that are not a-z or apostrophe. Since the applications may be used as a web service, you cannot guarantee query normalization will take place before the input is passed to the application (as might be possible using client-side Javascript code). Matching updates are returned in the order they appear in the corpus. You can make the following assumptions about the search query and the social media data:

The search query and social media updates contain no more than 1024 words.
A word in the query or update will be no longer than 256 characters.

In the event of input exceeding these limits, your program should still be safe, i.e. not crash and not be susceptible to a buffer overrun attack.

Your web applications return plain ASCII content, i.e. with an HTTP header of Content-Type: text/plain;charset=us-ascii. The first line has the number of search terms followed by the list of terms (after URL decoding). This is followed by a blank line, then the zero or more matching updates from the corpus. Each line has the update's ID followed by the update's text. After all the updates, there is a blank line followed by the number of matches found. Example output for the query search=snow+storm+ICE!:

Content-Type: text/plain;charset=us-ascii

3 search term(s): storm snow ice

49742   the ice storm knocked the power out at my house and now the snow has drifted across the road
67299   ice and snow storm is closing in
94566   personally would much rather all snow than an ice storm

Matches: 3

Example output for the query search=snow+storm&limit=4:

Content-Type: text/plain;charset=us-ascii

2 search term(s): storm snow

5723    they said we were getting a snow storm
6847    ok i'm really starting to get nervous about this snow storm
8301    hope all my friends and family do well through the snow storm that's getting ready to hit michigan
13495   i hate rain in the middle of a snow storm

Matches: 4

Example output for the query search=storm+snow+epic&limit=4:

Content-Type: text/plain;charset=us-ascii

3 search term(s): storm snow epic


Matches: 0

If your application receives no GET parameters, return the message ERROR: no GET parameters!. If your application receives no search terms, return the message ERROR: search parameter must be specified!. You may also want to include your own ERROR messages for other unexpected events (e.g. the corpus file on the web server is missing or not accessible).

Benchmarking. While we have previously discussed the apache benchmark ab utility, this utility lacks the ability to benchmark using different URLs. A more realistic way to benchmark is to execute random queries from a large set of possibilities. For this, we will use the siege utility.

siege has been installed on katie. You can type siege at the command line to see the available options. Here are the options we will be using and an example run:

  -c, --concurrent=NUM      CONCURRENT users, default is 10
  -i, --internet            INTERNET user simulation, hits URLs randomly.
  -t, --time=NUMm           TIMED testing where "m" is modifier S, M, or H
                            ex: --time=1H, one hour test.
  -f, --file=FILE           FILE, select a specific URLS FILE.      
  
% siege -c 16 -i -t 60s -f urls.txt
** SIEGE 3.0.9
** Preparing 16 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200   1.63 secs:      48 bytes ==> GET  /cgi-bin/SocialSearch?search=alltel%20wireless                        
...
Lifting the server siege...      done.

Transactions:                 563 hits
Availability:                 100.00 %
Elapsed time:               59.91 secs
Data transferred:              0.07 MB
Response time:               1.20 secs
Transaction rate:       9.40 trans/sec
Throughput:                0.00 MB/sec
Concurrency:                     11.29
Successful transactions:           563
Failed transactions:                 0
Longest transaction:              2.86
Shortest transaction:             0.46

I have generated a set of 100,000 searches for benchmarking purposes. Copy the file from /home/classes/csci470/urls.txt to your home directory. You will need to change the first line to reflect the IP address of your VM as well as the path to the web application you are testing.

The main metrics you will be reporting are the transaction rate and the response time. You should also keep an eye on the availability. Report whenever this goes below 100% (which it may for some technologies under heavy load).

Part 1: CGI. Implement the above program as a CGI script or program. You can use any language you like, though you may need to install a compiler/interpreter for your chosen language. Here is a skeleton CGI program in C that shows how to obtain the query string and output a page. It also has a couple helper functions for parsing out and normalizing the search query.

Be sure your CGI script or program is in the correct location, normally /usr/lib/cgi-bin. The script/program should have an owner and group of www-data. The owner should have execute permission. You probably also want to make sure the corpus text file is located so the CGI program can find and access it. If you get an internal server error, you've done something wrong. Check the log entries in /var/log/apache2/error.log

Create an HTML page called search_cgi.html in the root directory of your web server. The page should have a form with two text input fields, one for the search query and one for the optional limit value. The form should have a submit button that executes the entered query against your CGI program.

Benchmark your CGI program, putting the data into this readme.txt file. You can test my CGI solution out at: http://104.236.117.243/cgi-bin/SocialSearch

Part 2: Apache module. Create an in-process Apache module version of the search application.

The following instructions show how to build and install a simple module that prints "Hello world!" and returns the passed in query string.

Download the source file mod_hello.c.
Compile the module: apxs2 -c mod_hello.c. Note: You may need to install development tools onto your server first:
- sudo apt-get install apache2-dev
Install the compiled module into Apache: sudo apxs2 -i -a -n hello mod_hello.la

You should now find that the module has been copied to the /usr/lib/apache2/modules directory.
You should also find that a hello.load file has been added to your /etc/apache2/mods-available directory with a symbolic link placed in /etc/apache2/mods-enabled.

Configure your Apache server to send requests to a specific subdirectory to your custom Apache module by creating /etc/apache2/conf-available/hello.conf.
```
    <IfModule mod_hello.c>
      <Location /hello>
         SetHandler hello
      </Location>
    </IfModule>
    
```
You need to make sure Apache loads the above configuration whenever it starts. To do this, add a symbolic link to the helo.conf file to the /etc/apache2/conf-enabled subdirectory.
Restart your Apache server: sudo service apache2 restart
Test that requests to the /hello subdirectory are now serviced by the Apache module. For example, try visiting http://X.X.X.X/hello

Using the hello module as a template, create a module that implements the social search application. Here are some additional helpful tips:

I had trouble naming a module in mixed case, so it might be best to stick to all lowercase for your module name and directory location.
You will either need to figure out the "home" directory the module runs at, or else put an absolute path to the corpus file. I put my corpus in /var/www/ and put this path into my module's C program.
You need to use the function ap_rputs to output text to the client.

Create an HTML page called search_module.html in the root directory of your web server. The page should have a form with two text input fields, one for the search query and one for the optional limit value. The form should have a submit button that executes the entered query against your Apache module.

Benchmark your Apache module, putting the data into your readme.txt file. You can test my Apache module solution out at: http://104.236.117.243/social

Part 3: PHP. Create a PHP server-side scripting version of the search application.

Parameters from a GET request are available in the PHP superglobal $_GET variable, e.g. $_GET["search"].
PHP supports a wide-variety of C-like file function, for example fopen.
Create an HTML page search_php.html that obtains search results using your PHP script.
During development, you probably want to see any PHP error messages, change display_errors to on in /etc/php5/apache2/php.ini.

Create an HTML page called search_php.html in the root directory of your web server. The page should have a form with two text input fields, one for the search query and one for the optional limit value. The form should have a submit button that executes the entered query against your PHP script page.

Benchmark your Apache module, putting the data into your readme.txt file. You can test my PHP solution out at: http://104.236.117.243/SocialSearch.php

Part 4: Choose your own technology. Implement the application using another dynamic web technology of your own choice (but not CGI, Apache module, or PHP). Some possibilities include: FastCGI, mod_perl, mod_python, your own custom web server, a web server besides Apache.

Benchmark your chosen technology, putting the data into your readme.txt file.

In the class immediately following the deadline, we will do a live bakeoff between everyone's solutions. The student achieving the fastest average transactions per second will be awarded extra points. Extra extra points will be award for beating my solution.

Submission. I will be testing your application by logging into your sever, starting Apache and testing your various search*.html pages. I will also be testing via siege. You submit all your source code and readme.txt to the Moodle dropbox. The Moodle timestamp is the completion time for the assignment.

Page last updated: February 02, 2015