The CFEngine Users Manual

Release 0.2 (An Alpha Release)


1      Introduction to the CFEngine. 1

1.1       Target Audience. 1

1.2       Services Provided by the CFEngine. 1

1.3       Collaborative Filtering (CF) (and a rough introduction to CF terminology) 1

1.4       Ratings. 2

1.4.1        Explicit and Implicit Ratings. 3

1.5       Detailed Functionality of CFEngine. 3

1.6       High Level Architecture of an Application Using the CFEngine. 4

2      System Requirements. 5

3      Server Administration Manual 6

3.1       Running the Server on Windows. 6

3.2       Downloading the MySQL Connector/J library. 6

3.3       Configuring the Server 7

3.3.1        The configure Script 8

3.3.2        Additional Server Properties Files. 8

3.4       Configure MySQL. 8

3.5       Loading Existing Ratings Data. 8

3.5.1        Loading the Sample Ratings Datasets. 9

3.6       Starting and Stopping the Server 9

3.7       Running the Sample Console-based Client 10

3.8       Supported Algorithms. 10

3.9       Performance Features (User Caching) 10

4      Client Programming Manual 11

4.1       Available Client Access Methods. 11

4.1.1        Java Remote Method Invocation (RMI) 11

4.1.2        CORBA (C++, many others) 11

4.2       Representing Entities within the CFEngine. 12

4.3       Client-Server Architecture of the CFEngine. 12

4.4       Creating a CFEngine Client in Java using RMI 12

4.4.1        Introduction to the Simple Console-Based Client 13

4.4.2        Connecting to the CFEngine. 13

4.4.3        Common Exceptions in the CFEngine Interface. 14

4.4.4        Sending Ratings to the CFEngine. 15

4.4.5        Maintaining a Client-side Mapping of Ids to Names. 17

4.4.6        Removing/Deleting Ratings from the Server 18

4.4.7        Retrieving Ratings from the Server 18

4.4.8        Predicting Ratings for Specified Items. 19

4.4.9        Requesting a List of Recommendations. 20

4.4.10      Querying Valid Rating Values. 21

4.4.11      Identifying the Next Usable Id for Users and Items. 21

4.4.12      Shutting down the Server 22

4.5       Creating a CFEngine Client in C++ Using CORBA.. 22


1        Introduction to the CFEngine

This document is intended to serve as a manual for the CFEngine collaborative filtering recommendation engine.

1.1    Target Audience

Expected target audience is as follows:

  • Chapters 1-2. Introductory material describing what the CFEngine is, how it works at a high level, and why you might want to use it. Read this if you want to know if the CFEngine is right for your personalization and recommendation needs, or if you want to know what kind of hardware or software is required.
  • Chapter 3. Administration, setup, configuration, and testing of the server.
  • Chapter 4. Programming manual and programming reference – how to integrate your applications with CFEngine recommendations and predictions.

1.2    Services Provided by the CFEngine

The CFEngine is a recommendation engine. Given the proper data, the CFEngine will predict exactly what items an individual is likely to enjoy or find useful. It can be used in a wide variety of environments. Examples include:

  • Reducing information overload – when there are too many items to consider them all, a recommender system can predict what items have the highest likelihood of being useful, interesting, entertaining, or valuable. For example the MovieLens web site (MoviLens.org) predicts what movies you are likely to enjoy out of thousands of possible movies and videos.
  • Improving usability – even in cases where you don’t have an enormous number of items to select from, recommender systems can improve the usability of a system by predicting items that you are most likely to find useful and organizing the user interface to place those items in the most easy to access spots.
  • Improving sell rates – Recommender engines can predict what items your customers are most likely to be interested in purchasing, allowing you to make those items prominently available to the users, or perhaps provide them with coupons for those items.

1.3    Collaborative Filtering (CF) (and a rough introduction to CF terminology)

Collaborative filtering (a term coined by David Goldberg et al. [tapestry]) refers to a environment in which a community of people come together to share the burden of filtering information. Consider an online newspaper with fifty news articles. Any one person doesn’t have the time to read fifty news articles, but if you have a community of 50 people, each member of the community can read one article and determine just how much value that article provides. We say that they rate the article – they give it a rating.  If a member’s rating for an article is sufficiently high, the article is recommended to the rest of the community. Pooling everybody’s recommendations, the community builds a list of the top 10 articles that are worth reading. In return for only reading one item, the user has been saved the time of scanning through fifty articles to find the interesting ones. In reality, not everybody shares the same tastes or interests or needs. However, suppose instead that each member first reads and rates five articles instead of one. Now, we can examine the ratings of a hypothetical member Joe, and find the top ten other members of the community with the most similar ratings to Joe. That is – the people who read some of the same five articles as Joe and rated them similar values. For example, Joe may enjoy reading the international section, thus he read and rated highly five articles from that section. Joe can now be matched up with ten other people who also read international articles and rated them highly. We call these people Joe’s neighbors. Once we have identified Joe’s neighbors, we can look to see what articles have been rated highly by Joe’s neighbors, yet Joe read. We can then recommend these articles to Joe for reading. A list of the items most likely to meet a member’s needs is known as a set of recommendations.

As an extension to this example, consider a online news service which displays news article titles for free, but charges to display the full text of the article. Joe might want to make really sure that an article was going to be worth reading before paying for it. Using the same approach described above, Joe could consult his neighbors who have already read the article to predict his rating for the article in question. When a member wants to know not just what the top items are, but what the actual predicted ratings are, we call this a prediction.

Collaborative filtering recommendation systems are software systems that enable communities to perform collaborative filtering. At the center of a collaborative filtering system is a collaborative filtering recommendation engine – the software system responsible for the computation involved in collaborative filtering. The recommendation engine is responsible for analyzing ratings, determining who is neighbor to whom, and computing the predictions and recommendations. The CFEngine is one such recommendation engine.

We refer to the people who provide ratings to the recommender engine (and request recommendations and predictions) as users.

1.4    Ratings

As described in Section 1.3, the recommender systems take ratings as input and can then output recommendations and predictions. What exactly are these ratings?

The CFEngine operates on single-dimensional numeric ratings. There are three broad classes of numeric ratings:

  •  Multi-valued ratings data. Each rating is a number on a predefined scale. The low end of the scale indicates that an item was poor – the user felt that the item did not provide value or wasn’t interesting or entertaining as appropriate. The high end of the scale means the item was great – it had high value. A multi-valued rating is similar to what you might expect on a survey “On a scale of 1-5, rate how much you enjoyed this movie…” For example a discrete scale might be the values 1, 2, 3, 4, or 5. A continuous scale might be any real number between 1 and 5.
  • Binary data. Each rating is either 0 or 1. For example, we might have survey data where we asked each user “Did you like product X? Yes or No?” If they said yes, we have a 1 rating, and if they said no, then we have a 0 rating.
  • Unary data. This data is trickier. Imagine that you are an e-commerce retailer. You have billing records for all of your customers. You know exactly what products they have bought. The fact that a customer has purchased a product indicates a high-likelihood that they valued the item purchased. Thus we have a positive rating. However, just because a customer hasn’t purchased a particular item, we can’t really infer that they don’t like that product. Thus there is only really one rating value and the data is not binary. We call this situation unary.

The CFEngine is currently designed and tested to work the best with multi-valued ratings data. However, it can also be successful with binary data. Unary data is very hard problem. The CFEngine will only work with Unary data if you first convert it to binary or multi-valued ratings data. Another approach is to combine unary purchase data (or other unary data) with some form of multi-valued observed implicit ratings as described in the next sub section.

1.4.1   Explicit and Implicit Ratings

Most of the ratings that we have given examples of so far are what we call explicit ratings. An explicit rating is a value given directly by a user in response to a query – “On a scale of 1 to 10, what would you rate this book?” Explicit ratings usually are seen as a strong source of information. A user can tell you exactly how they feel about a particular item.

Implicit ratings are ratings that are inferred from observing the behavior of a user. For example, we might observe that a user spends a lot of time reading a short news article online – from this we might infer that the user found that article valuable. Of course, we may be wrong – they may have simply started reading the article, and then were interrupted by a phone call. Implicit ratings can be collected from a variety of sources. Other examples of implicit ratings of varying utility include page views; time-spent-reading; emailing, printing, or saving a document; bookmarking; etc.

The CFEngine currently makes no distinction between explicit and implicit ratings. All rating values that are fed into the CFEngine are treated with the same strength. If you plan to use a combination of explicit and implicit ratings, you may have to think carefully about how you encode the rating values for each.

Both implicit and explicit ratings have strengths and weaknesses. Explicit rating are often much more precise than implicit ratings. On the other hand, they can be more easily biased than observed ratings. They essentially represent a user’s self-evaluation of their perception of a particular item. Implicit ratings on the other hand are observations of the user’s behavior – less prone to bias, but much more noisy. The process of inferring a rating from an observed behavior may often make mistakes.

The CFEngine will fit into your environment the easiest if you have a homogenous set of multi-value ratings data – all explicit, all implicit, or normalized to be comparable.

1.5    Detailed Functionality of CFEngine

Once you seed the CFEngine with ratings from all your users (your customers, your employees, etc), the CFEngine provides support for the following operations

  • For a given user, list the top N items that the user is most likely to rate high. This is the bread and butter of the CFEngine.
  • For a given user, list the top N items from a category that the user is most likely to rate high. In many cases, we don’t want recommendations for any kind of item – we want to limit to a subset of the items. The CFEngine provides supports for this via categories. You can define any set of items to form a category, and then get the top recommendations from that category. For example, you might want to recommend the top mystery books.
  • For a given user and a given item, predict the rating value that the user will rate that item. Useful for evaluating on an item-by-item basis. For example, a user might be viewing the description of a product online. You could use this functionality to predict how much value the user puts in the item they are viewing. If the predicted rating is high, you might want to offer that user a coupon. Or perhaps only if the predicted rating is medium-high, because with a high predicted rating the user is likely to buy the item without the incentive of a coupon.

There are a good number of other supporting functions (such as functions for sending ratings to the CFEngine), but they are secondary to the three key functions listed above, which are the core of the system functionality.

1.6    High Level Architecture of an Application Using the CFEngine

The CFEngine will store numeric ratings, and provide recommendations based on those ratings. In the CFEngine, each user and item is identified by a unique number. The CFEngine does not support storage or retrieval of any information besides numeric ratings and numeric predictions/recommendations.

As a result, a typical application utilizing the CFEngine will need to maintain its own databases of domain specific user and item information. For example, you may want to maintain demographic information on each user, or product catalog information about each item. Figure 1 illustrate an example architecture of web server-side content management application that is using the CFEngine to predict what items of content should be display to which users. In this figure, a user – through their web browser – logs into a web site that utilizes the CFEngine. On the web server side, the content management application locates the user’s record in the user data, and from that determines the CFEngine userid associate with that user. A request for the given userid is sent to the CFEngine, which responds with itemids representing recommendations for items that the user will like. The content management application then accesses its item database to retrieve the content associated with those itemids and displays those items to the user.

Figure 1: Example architecture of a web server-side content management application that utilizes the CFEngine to generate recommendations.

2        System Requirements

The CFEngine is written in platform-independent Java, and should run on any platform that fully supports Java. We have tested the CFEngine on Solaris 2.8, Red Hat Linux 8, Windows 2000, and Windows XP. 

To run the CFEngine, you will need: (all freely available)

  • Java 1.4 or greater
  • MySQL (we will be adding support for other databases)

Recommended, but not absolutely required (all freely available)

  • On Windows 2000/XP: Cygwin Unix environment for Windows. If you don’t have Cygwin installed, you will have to write your own batch files to start and stop the server.

Recommended Hardware

  • The faster the processors the better. The more processors the better. A 1Ghz Pentium 4 with 256MB RAM can return about 80 top Ns/second to a single threaded requester given a database of 1000 users and 100,000 ratings, if no other processing is taking place.

To integrate your application with the CFEngine, you will need

  • If your application is in Java (such as a JSP-based web content management system), then all the necessary software is included in the standard Java distribution. The CFEngine supports both RMI and CORBA.
  • If your application is written in some language other than Java, then the only currently supported option is to use CORBA to communicate with the CFEngine. You will need CORBA ORB software that supports your application’s programming language installed. In theory, any CORBA compliant ORB will be sufficient – we have only tested with the freely available C++ TAO ORB. In the future, we intend to provide a Web Services interface to the CFEngine.

3        Server Administration Manual

This chapter describes briefly some of the important issues in installing and administrating a CFEngine server. It also describes how to load the sample MovieLens data, and run the sample client, in order to get recommendations for movies, or test other aspects of the CFEngine system.

3.1    Running the Server on Windows

The CFEngine server is completely written in Java, so it should run on any platform that supports Java. That being said – all of the support utilities that we have written will only run in a Unix shell environment. However, you can download a Unix shell environment for free – Cygwin. Here is the process:

  • Download the Cygwin installer from http://www.cygwin.com/.
  • Just install the defaults – no need to change any of the installation package settings.
  • After Cygwin is installed, you will need to do you more tweak to get our utility scripts to work
    • Start a cygwin shell

o        cp /bin/sh /bin/sh.old

o        cp /bin/bash /bin/sh

  • The reason that you need to do the previous step is that the default shell is a restricted shell that doesn’t have all of the necessary shell features. I believe this was done to keep the memory footprint down. However, having /bin/bash as the default sh seems to cause no problems whatsoever for us. If you don’t feel comfortable making this change, you can instead change the first line of all the utility scripts to point to /bin/bash instead of /bin/sh.

After following those steps, you should be able to run most or all of the utility scripts from within the Cygwin shell. Note that you will not be able to compile using the Makefile – this is currently not supported under windows. If you want to recompile the code on Windows, we recommend that you either using a visual development environment or write your own build batch files. We use IntelliJ all the time for development on Windows without need for a Makefile.

Now just continue from the next section – same as if you were running on Unix.

3.2    Downloading the MySQL Connector/J library

First download the MySQL Connector/J JDBC library which will allow the Java CFEngine to communicate with the MySQL server. It is available from

http://www.mysql.com/downloads/api-jdbc-old.html

From there, you will download an archive file with a name similar to

            mysql-connector-java-2.0.14.tar.gz

Uncompress the archive file, and extract the .jar file with the name

            mysql-connector-java-2.0.14-bin.jar

Save this file in the lib/ subdirectory, and rename it as mysql.jar

 

3.3    Configuring the Server

config.sh is the file where you set some of the key parameters for the CFEngine server environment. Any time that you edit config.sh, you need to re-run the “configure” script, which will propagate the configuration settings to all the necessary files. The key parameters to be set here are:

WINDOWS – Make sure you set to 1 if you are running on Windows.

 

MySqlBinDirDirectory that contains the mysql program binary file.

 

SqlHost, SqlUser, SqlPassword, SqlDbNameParameters defining the database host, the username and password to use when connecting to the database server, and the name of the database to connect to.

 

maxSampledUsersNumber of users to initially load into the cache. The users with the highest utility, as defined by the utility column in the USER_INFO table are loaded. Initially, we recommend that you try and load all of your users, and then evaluate the performance. To do this, set maxSampledUsers to be larger than the number of users you have in your ratings data.  If the performance is too slow, then you can increase the performance by decreasing the number of users sampled into the cache. This way, there are fewer users with which to compute correlations.

 

maxCachedUsers – If a request to the CFEngine refers to a user that was not sampled, then that user must be fetched from the database at runtime. Since accessing the database (and thus the disk drive) is exceptionally slow, the user is cached in memory. However, to prevent this “secondary” cache from taking all of memory, this parameter limits the number of users that are cached in this manner. After this cache fills up, users who have not been used recently are removed from the cache to make room for new users.

 

numTopNRecords – The number of recommendations that should be computed whenever one of the getRecommendations methods is called.

 

SERVER_MAX_HEAP – The amount of memory that should be allocated to the CFEngine server. This should be large enough to hold all the ratings that you want to cache. It should also be less than the amount of memory you have on the machine. If you get out of memory errors, then you definitely need to increase this value. If performance is slow, and you have free memory, you can try increasing this value to see if there is an improvement.

 

JAVA_HOME – The location of your Java installation.

3.3.1   The configure Script

Any time you change config.sh, you need to run the configure shell script, which will propagate the configuration values to all the necessary utility scripts and properties files. This is usually done by changing to the CFEngine root directory and typing “./configure

3.3.2   Additional Server Properties Files

config.sh contains the parameters that most people will need to change to get up and running. However, more detailed configuration can be found in lib/CFServer.properties. The file CFServer.properties is automatically generated from config.sh from the file lib/CFServer.properties.in, so we recommend that you make your edits to CFServer.properties.in, and then rerun configure. Otherwise, the next time you run configure, you will lose your changes to CFServer.properties

3.4    Configure MySQL

If your MySQL server is not already running, you will need to start it now before you can load any ratings.

In Section 3.3, you configured the SqlUser and SqlPassword. You need to make sure that your MySQL server has security configured to allow connections for SqlUser using SqlPassword from the host that you will be connecting to. Refer to your MySQL documentation for details, but here is how we do it (assuming SqlDbName = “cfengine”, SqlUser = “cfengine”, and SqlPassword = “cfengine”):

% mysql –user=root

grant all privileges

        on cfengine_db.*

        to cfengine_user@localhost

        identified by 'cfengine_pass';

 

3.5    Loading Existing Ratings Data

In some cases, you will not have any existing ratings data. In such a case, you can safely skip this step. If you want to try running with the sample MovieLens movie recommendations data, go to Section 3.5.1. Otherwise jump to Section 3.6 to start the server, so that you can being talking to it via Java RMI.

If you are transitioning from another recommendation engine, or if you have some existing source of information on user preferences for items, you can load these ratings directly into the relational database that the CFEngine server users, without having to write a special client to load the data.

The MySQL tables that support the CFEngine are very simple. There are three relational tables:

  • RATING_TABLE. Information about the actual rating the user gives to a specific item with three fields:  userId, itemId and rating.
  • ITEM_TYPE_TABLE. Describes what items belong to which types (item categories).  Some items may belong to several categories. This table has two fields: ItemId and typeId. If no types are going to be used, this table can be empty.
  • USER_INFO. This table contains one row for each user in the database that is computed by the calcUserInfo utility once all the initial ratings are loaded. It has three fields: UID, NumRatings and Utility. UID is user’s ID.  NumRatings is the total count of the       items that the user has rated. Utility is the likely utility of the user as a predictor for other users, which is used to prioritize users during sampling. If no sampling needed, this table can be empty.

The load_data.sh utility in the bin/ directory will load ratings from a flat text data file into the database for you. You need separate data files for ratings and types. Each data file should contain one row for each row to be loaded into the table, with each column separated by a tab and terminated by a “\n” (unix new line). Load_data.sh takes several different parameters. For a complete listing of the available parameters, simply run “./bin/load_data.sh” on the command line with no arguments.

The load_data.sh file will also run the calcUserInfo utility, which computes the rows of the USER_INFO table. This computation can take a long time for very large datasets, so be patient.

You don’t have to use the load_data.sh to load information – you can use any mechanism you are comfortable with in loading data into the appropriate database tables. The initialize_tables.sh script will create the necessary tables without loading any data. If you are planning to use sampling, then make sure that you run calcUserInfo after you are done loading your ratings.

3.5.1   Loading the Sample Ratings Datasets

Included with the CFEngine distribution are scripts to load two existing datasets, found in the Data directory. The first directory – simple contains an exceptionally simple set of ratings that is only useful for minimal debugging. The second directory – MovieLens contains a script that will load the 100,000 rating movie dataset that was released by the GroupLens Research group at the University of Minnesota. This is the dataset that you need to load to successfully run the test client. To load the MovieLens data set:

·        Make sure that you have edited config.sh appropriately and run ./configure.

·        Download the 100,000 MovieLens ratings from www.grouplens.org into the Data/MovieLens directory

·        Run the Data/MovieLens/load.sh script. This will untar the MovieLens ratings data, convert the files to the necessary format, and then load them into the database using the methods described in Section 3.4. It will also compute the utility of each user.

·        Now you can run the supplied example console-based client. See Section 3.7.

3.6    Starting and Stopping the Server

Once you have edited config.sh appropriately, and run the configure script, you can start the server by using the bin/cf_server script

      bin/cf_server start

 

To shut down the server,

      bin/cf_server stop

3.7    Running the Sample Console-based Client

Included in the CFEngine distribution is a sample console-based client that demonstrates how to create an application that uses the CFEngine. This client has been designed to work as a movie recommendation client, based on the 100,000 ratings provided by the GroupLens Research Group at the University of Minnesota. To use the client:

  • Follow the instructions in Section 3.5.1 to load the data

·         bin/cf_client

3.8    Supported Algorithms

We have implemented several of the most popular published CF algorithms, and the code for those algorithms can be found in the org.recommender.algorithms package. However, of those algorithms, only one is supported for this current release. The remaining algorithms, which can be found in org.recommender.algorithms.experimental, are not even guaranteed to run. At one point, they all worked, but in the mad rush to release the CFEngine software, we made many changes to the core system, and did not verify that those algorithms still work. As soon as the work on the core system stabilizes, we will return and update those algorithms.

The one algorithm that has been well tested is the classic user-to-user nearest neighbor prediction algorithm based on Pearson Correlation [Herlocker information retrieval]. It will generate both individual predictions as well as requests for top N recommendations per user (as proposed in [Sarwar]) In this algorithm, to compute a prediction for the active user,  the CFEngine computes the Pearson Correlation between the active user and all other users. To compute a prediction for a specific item, the algorithm computes the weighted average of the non-active users’ ratings for that item. The average is weighted by the correlation values (users with negative correlations are discarded) and the system subtracts from each user’s rating their mean rating.

This algorithm supports top N’s by type. You can assign integer type labels to each item, and then request the top N predictions from items of a particular type.

3.9    Performance Features (User Caching)

In theory, nearest neighbor algorithms do no scale well. Their performance decreases in proportion to the number of users in the system. This is not an issue if you do not have many users. For those who may have many users, or just really slow hardware, the CFEngine provides user sampling to keep the computation time roughly constant regardless of the number of users used.

In lib/CFServer.properties, you can specify the maximum number of users to sample from the total set of all users (CFServer.mem.maxSampledUsers). When the CFEngine server starts, it will selectively load users (i.e. load all ratings associated with a user) up to the amount specified in CFServer.properties. The CFEngine server selects those users based on the “utility” of each user, which is specified in the utility column of the USER_INFO table. There are many ways to compute the utility of particular users as predictors. We provide a program – calcUserInfo, which will compute two different measures of utility, Entropy and Inverse User Frequency. These measures compute the utility of users as predictors based on the popularity of items they have rated and the number of items they have rated. Roughly, users who have rated items that very few people have rated have higher utility, and users who have rated more items have higher utility. More investigation of these and new measures is needed – these are just trial measures. For example, the measures do not directly take into account the currency of the ratings or the coverage of items by the entire sampled data.

 

 

4        Client Programming Manual

The CFEngine is designed to run as a separate process from the application that is utilizing the functionality of the CFEngine. Thus it follows a client-server model. A CFEngine client is a software application (running in a separate process from the CFEngine) that makes requests to the CFEngine. This chapter provides the necessary background necessary to understand how to write a CFEngine client. It also provides step-by-step recipes for creating clients in several different languages and platforms.

4.1    Available Client Access Methods

Because the CFEngine and your CFEngine client application will be running in different threads, your client must use a remote protocol to communicate with the CFEngine. The current release of the CFEngine supports two different methods for communicating with the CFEngine:

  • Java Remote Method Invocation (RMI)
  • CORBA (C++, Java, many others)

We provide a brief description of these two methods in the next two subsections.

4.1.1   Java Remote Method Invocation (RMI)

Integrated with Java is the capability to perform Remote Method Invocation or RMI. RMI allows one Java process to execute methods on an object that exists in a separate Java process, potentially between different computers on a network. Java RMI is a good, high performance solution if you client application is written in Java.

One consideration is that RMI will create a new thread in the server for every single incoming request, with no upper limit (this is a “feature” of Java RMI, not the CFEngine) and if you have thousands of different applications all connecting to the CFEngine at the same time, then the CFEngine will probably get bogged down with thread creation and slow to a crawl. However, the CFEngine should support 10-100 concurrent threads accessing it reasonably well. One individual thread can handle around 100 recommendation requests per second.

4.1.2   CORBA (C++, many others)

CORBA is the Common Object Request Broker Architecture. In theory CORBA allows applications written in arbitrarily different languages to communicate with each other if the appropriate software exists through the use of a language independent communication protocol called IIOP. The CFEngine provides an interface that CORBA-compliant clients can connect to. For more information about CORBA, see http://www.omg.org/gettingstarted/corbafaq.htm

We have successfully tested a CORBA client written in C++ with the CFEngine using the freely-available TAO CORBA ORB. Both of these example client APIs are provided with the CFEngine distribution. As a result, you are guaranteed to be able to access the server from C++ without having to purchase any additional software.

We also provide the CORBA  Interface Definition Language (IDL) interface file for those who wish to access the CFEngine from different programming language, or using different ORBs.

4.2    Representing Entities within the CFEngine

In an abstract sense, the CFEngine deals entirely with numeric data. Users are represented by unique integer userids, items are represented by unique integer itemids, and types (categories) are represented by unique typeids. When you specify to the CFEngine a particularly user, item, or type, you use an integer. When the CFEngine returns back a list of recommendations or ratings, those are identified by itemids and numeric ratings or predictions (both doubles).

Thus, in order to work with the CFEngine, you will need to generate numeric encodings for your users and your items. These encodings may be sparse – for example if you had five users, their userids would not have to be “1, 2, 3, 4, 5”. They could be “123, 3000, 10000, 34565”. The same holds for itemids. You may generate this encoding yourself offline and then load the ratings into the database before the CFEngine is started. The CFEngine server interface also provides two methods to provide you with unique userIds and itemIds during runtime (CFEngine.getNextUserId(), and CFEngine.getNextItemId())

4.3    Client-Server Architecture of the CFEngine

As described in Section 7, the CFEngine is designed to run in a client-server architecture. In theory, you could link the CFEngine classes directly into your Java application, but this has not been tested extensively.

The CFEngine does not currently manage its own thread pool for recommendation and prediction computation. Rather, it relies on the remote procedure call interface (RMI or CORBA) to initiate new computation threads for each request. If you choose to try and link the CFEngine classes directly into your Java application, then you will not get overlapping of I/O and computation unless you create multiple threads in your application that call the CFEngine interface.

On the other hand, the CFEngine is thread-safe, so concurrent execution of all CFEngine interface methods is supported.

For the remainder of this manual, we will assume that a client-server model is used, with the CFEngine and the client application in separate processes (possibly on separate machines).

4.4    Creating a CFEngine Client in Java using RMI

If you will be connecting to the CFEngine from a Java application, such as a Java servlet, applet, or other Java application, then using RMI will probably be the best option. In this section, we demonstrate how to use the RMI interface, using the example console-based client that is included with the CFEngine distribution.

4.4.1   Introduction to the Simple Console-Based Client

Included in the CFEngine distribution is a simple, console based client, found in the org.recommender.clients.console package. The shell script bin/cf_client will start the console-based CFEngine client for you. The simple client application will provide you with a simple, menu-based interface for interacting with the CFEngine. You can find the source code for this console in the org/recommender/clients/console subdirectory. All of the code in the Console Client that communicates with the CFEngine is found the class org.recommender.clients.console.ClientCFManager.

Figure 2. A screenshot of the console-based recommendation application that will be used as an example in this section.

4.4.2   Connecting to the CFEngine

The first step in any client application is to initialize a connection to the CFEngine server. In the Console Client, this is done in the constructor of the ClientCFManager class. The relevant code is listed below in Error! Reference source not found..

Connecting to the server is as simple as getting a reference to a remote object in the server that implements the CFEngine class, and then invoking any method on that object. By default, the CFEngine server registers itself with the RMI name server under the name “cfengine” – all lowercase. Therefore, getting a reference to the CFEngine object is as simple as calling Naming.lookup(cfengine”) and typecasting the result to CFEngine.

Once you have a reference to a remote object, you can test the connection to the server by invoking any method. The CFEngine interface has a simple method called test(), designed just for this purpose. The test() method executes a simple method on the server that returns a string indicating that the server is alive.

If your CFEngine server is not currently running on the host defined by cfengineHost, then either Naming.lookup() or server.test() will throw an exception. Usually the exception will be thrown by Naming.lookup() because the RMI registry runs as a thread in the CFEngine server.

import org.recommender.server.CFEngine;

import java.rmi.Naming;

 

CFEngine server;

String cfengineHost = "localhost";

 

private void connectToServer() {

String url = "rmi://" + cfengineHost + "/";

try {

server = (CFEngine) Naming.lookup(url + "cfengine");

           

String resultString = server.test();

 

System.err.println(resultString);

} catch (Exception e) {

System.err.println("Error: Cannot connect to service       \"cfengine\" via registry on " + cfengineHost + e);

e.printStackTrace();

System.exit(1);

}

}

Table 1: A code segment for initiating a connection to the CFEngine server

           

4.4.3   Common Exceptions in the CFEngine Interface

There are two commonly seen exceptions within the CFEngine Interface. For the most part, these exceptions will probably be handled in the same way regardless of what CFEngine method you are calling.

java.rmi.RemoteException. This is an exception that will be thrown by the Java RMI subsystem if the is a communication problem with the CFEngine server. This could mean that a) your CFEngine server is not running on the host you are trying to connect to, b) your connection to the server has been closed for some reason (timed-out perhaps), or c) there is a network outage between your client and your server.

CFIllegalParameterException. This exception is thrown whenever one or more of the parameters passed to the method are not valid. Most commonly this is due to passing a null reference or a negative id. Specifying a positive userid or itemid is never illegal – the CFEngine server assumes that every possible userid or itemid id exists, even if it has no existing data for that id.

CFIllegalListParameterException. This exception is thrown by methods which are passed an array (ie a list) of input. The exception is identical to the CFIllegalParameter exception, with the addition of a method getNumSuccessful(), that returns the number of elements of the array that were successfully processed before the erroneous parameter occurred. For example, if you called setRatingList(), and the first ten ratings were valid, but the eleventh rating had an id of -1, CFIllegalListParamterException would be thrown, and getNumSuccessful() on the exception would return 10.

4.4.4   Sending Ratings to the CFEngine

The CFEngine needs ratings to compute recommendations. Before you start the CFEngine server process, you can load the ratings directly into the relational database on which the CFEngine server operates. However, while the server is running, new ratings must be sent to the server through the CFEngine interface. This is because the server maintains a cache of ratings to increase the performance of recommendation computation. Loading ratings into the database while the server is running could lead to inconsistencies in the data.

The CFEngine interface provides two methods for sending rating to the server:

public void setRating(int user, int item, double rating)

     throws java.rmi.RemoteException, CFIllegalParameterException;

public int setRatingList(ItemRating[] newRatings)

throws java.rmi.RemoteException,CFIllegalListParameterException;

Table 2.  CFEngine interface methods for sending ratings to the server.

 

 

setRating() sends a single rating, while setRatingList() is used when you want to send more than one rating at a time. setRatingList() should be more efficient if you have more than one rating to send over a short period of time.

All methods in the CFEngine interface will throw the java.rmi.RemoteException if there is a network communication problem with the server. This could happen if the server crashes or is shut down while the client is still running, if there is a network outage between the client and the server, or if the socket connection between the client and the server is shut down for any reason.

It is possible for an RMI connection to time out if it isn’t used after some time? Or is there some sort of keep-alive mechanism that will ensure the connections never time out?

Table 3 demonstrates sending one rating using the CFEngine interface. The setRating() method can throw two exceptions. java.rmi.RemoteException will be thrown if there is a error communicating with the server (for example if the server is not running). In our example, we try once to reconnect to the sever - the connectToServer() method (Table 1) will exit the application if it fails to reconnect to the server.

CFIllegalParameterException will be thrown if the userid, itemid, or rating is invalid. The only invalid userids and itemids are negative numbers. An invalid rating is any rating outside of the range specified in the server CFServer.properties file. do {

try {

 

server.setRating(userid, itemid, rating);

 

return;

} catch (java.rmi.RemoteException e1) {

connectToServer();

} catch (CFIllegalParameterException e1) {

e1.printStackTrace();

System.exit(1);

}

 

} while (true);

Table 3: Example of sending a new rating to the CFEngine server.

 

There are no CFEngine methods for creating a user - sending a rating for a userid that has not been seen before by the server will automatically result in that user being created by the server.

If you send a rating for a (userid, item) pair for which there is already a rating, the new rating will overwrite the existing rating.

setRatingList() works in much the same was as setRating(), except that an array of ItemRating objects is passed, rather than a single (userid, itemid, rating) triplet. Table 4 shows the method exported by the ItemRating class, by which you can create an ItemRating object and access its elements.

 

// org.recommender.server.ItemRating

// Constructor

  public ItemRating(int userID, int itemID, double rating)

 

// Accessors

  public int getUserID()

  public int getItemID()

  public double getRating()

Table 4.  Methods exported by the ItemRating class – a CFEngine server data type used by methods in the CFEngine interface.

Our example Console Client uses its own class – org.recommender.clients.console.Rating – to store ratings internally. Thus in the example shown in Table 5, we first must convert from the client’s internal representation of ratings (using the Rating object) to the representation supported by setRatingList() (using ItemRating ).

// “ratings” is an array of objects representing ratings

// on the client side that need to be sent to the server

//

ItemRating[] newRatingList = new ItemRating[ratings.length];

int number = 0;

 

for (int i = 0; i < ratings.length; i++) {

Rating r = ratings[i];

newRatingList[i] = new ItemRating(  r.getUser().getID(),                                            r.getItem().getID(),                                              r.getRatingValue()  );

}

 

do {

try {

 

number = server.setRatingList(newRatingList);

 

return number;

} catch (java.rmi.RemoteException e1) {

connectToServer();

} catch (CFIllegalListParameterException e1) {

      System.err.println("Only " + e1.getNumSuccessfull() + " ratings were sucessfully removed");

e1.printStackTrace();

System.exit(1);

}

} while (true);         

Table 5.  Example code for sending a list of ratings to the CFEngine server.

 

In Table 5, notice that we catch the CFIllegalListParamterException. For performance reasons, the CFEngine server does not validate all parameters of the array before starting to add ratings to the database. Thus, an illegal parameter may be encountered half way through an array, with some ratings having been processed, and some not. In such a case, the CFEngine server immediately throws an exception and you can query the exception to determine exactly how many ratings were successfully processed before the exception occurred.

4.4.5   Maintaining a Client-side Mapping of Ids to Names

While the CFEngine identifies each user, item, and type with an integer, we have much more useful representations of those items that we will want to show to the user. For example, in a movie recommender, we will want to display the titles of movies, not their itemids. And users will find it more useful to log in with a text login name and not a long numeric userid. As a result, your client will most likely need to maintain data structures that map to and from numeric identifiers and application specific data types.

As an example, the Console Client assumes that each user, item, and type has a name that is represented by a String. The Console client maintains its own database tables that define relations between numeric identifiers and String names. The code to manage these mappings can be found in org.recommender.clients.console.ClientDBManger. The ClientDBManager maintains three simple tables in its own relational database manager that map from userid, itemid, or typeid to names. We will not discuss in detail how the Console Client handles this mapping. See the source code for more details.

The appropriate data structures for performing the mapping will depend on your application. In many cases, you will want to store additional information about each user, item, or type beyond just its name.

4.4.6   Removing/Deleting Ratings from the Server

For the most part, you probably won’t need to delete ratings, but there are a few circumstances. For example, you might have a privacy policy that allows users to explicitly request that their ratings be deleted from the system. Or on a retailer web site, a customer might want to delete a rating that was implicitly recorded due to their purchase of a gift for another person. Or a user may want to tinker with their profile by deleting ratings and seeing the change in the recommendations.  The CFEngine interface provides two methods to support deleting of ratings from the server database.

public void removeRating(int user, int item)

      throws java.rmi.RemoteException, CFIllegalParameterException;

 

public void removeRatingList(int user, int[] itemIDs)

      throws java.rmi.RemoteException, CFIllegalParameterException;

 

Table 6. CFEngine methods for deleting ratings stored by the server.

 

 

4.4.7   Retrieving Ratings from the Server

Often the client may want to determine what items the user has already rated and what those rating values are. Or the client may want to determine if a user has rated a specific item, or a list of items. In such cases, you can make use of the methods shown in Table 7.

public ItemRating getRating(int userID, int itemID)

throws RemoteException, CFIllegalParameterException;

public ItemRating[] getRatingList(int userID, int[] itemIDs)

      throws RemoteException, CFIllegalListParameterException;

public ItemRating[] getUserRatingList(int userID)

throws java.rmi.RemoteException, CFIllegalParameterException;

public ItemRating[] getItemRatingList(int itemID)

throws java.rmi.RemoteException, CFIllegalParameterException;

Table 7. CFEngine interface methods for retrieving ratings from the server.

All methods return zero or more objects of class ItemRating, which is described in Table 4.

In Table 7, note that getUserRatingList() and getItemRatingList() do not throw the CFIllegalListParamterException. This can be explained by the fact that neither take arrays as parameters.

If a rating is requested for an item that the user has not yet rated, the CFEngine interface does not throw an exception. Rather, it simply reports that user’s rating as the “error rating”, which is the value returned by CFEngine.getErrorRating() (see Table 13).

4.4.8   Predicting Ratings for Specified Items

Now we get to the core functionality of the CFEngine server – predicting ratings for items that the user has not already rated. This section describes methods whereby you specify exactly what items you would like ratings predicted for. The next section describes how you can ask for the list of items with the highest predicted ratings, without having to specify all the different options.

As in the previous sections, there is a method to request a prediction for a single item, and a separate method to request a prediction for a list of items. These are listed in Table 8.

public ItemPrediction getPredictedRating(int userID, int itemID)

            throws RemoteException, CFIllegalParameterException;

public ItemPrediction[] getPredictedRatingList(int userID, int[] itemID)

            throws java.rmi.RemoteException, CFIllegalListParameterException;

Table 8. Methods for requesting predictions from the server via the CFEngine interface.

 

These methods and the methods in the next section return ItemPrediction objects, which are described in

// org.recommender.server.ItemPrediction class

public ItemPrediction(int userID, int itemID, float pred);

public int getUserID()

public int getItemID()

public float getPrediction()

Table 9. Interfaces to the ItemPrediction class, which is return by getPredictedRating…() and getRecommendation…() methods.

These methods are most used for situations where the user specifically requests to see an item, perhaps by searching for it by name or browsing for it.

Table 10 shows an example of how to request a single prediction for a given userid and itemid. Requesting a list of predictions using getPredictedRatingList() is handled very similar. See Section 18 for an example of using a CFEngine method that takes an array as a parameter.

do {

try {

predictedRating = server.getPredictedRating(user.getID(),                                                   item.getID());

return new Prediction(item, predictedRating.getPrediction());

} catch (RemoteException e) {

connectToServer();

} catch (CFIllegalParameterException e) {

           e.printStackTrace();

            System.exit(1);

     }

} while (true);

Table 10. Example code for predicting a single rating using the CFEngine interface. Note that “Prediction” is a client-side datatype defined in the sample client application.

 

If you ask for a prediction for a user or an item (or both) that does not exist, the CFEngine methods will not throw exceptions (unless the id < 0, which is illegal). Rather the CFengine will simple predict the “error rating”. You can determine what the error rating is using the CFEngine method getErrRating().

If you request a prediction for an item for which the CFEngine server already has a rating, getPredictedRating() and getPredictedRatingList() will not return that rating. Rather they will still try and compute a prediction using the defined algorithm. If this is not the desired result for you, then you should first call getRating(), getRatingList(), or getUserRatingList() to determine what items the user has already rated.

4.4.9   Requesting a List of Recommendations

While the previous section describes methods that will predict ratings for items that you specify, most users will want a list of “best bets”. Collaborative filtering is most often used in information overload situations, where there are just too many items to evaluate individually. Rather, the user wants to be immediately recommended the items they are most likely to enjoy or find useful. To support this, the CFEngine interface provides two methods, shown in Table 11.

public ItemPrediction[] getRecommendations(int curUser, int number,

                                            int offset)

throws RemoteException, CFIllegalParameterException;

public ItemPrediction[] getRecommendationsByType(int curUser, int number,

 int offset, int type)

      throws RemoteException, CFIllegalParameterException;

Table 11.  Methods for requesting recommendations (best bets) from the CFEngine interface.

See Table 9 for a description of the ItemPrediction class.

The CFEngine recommendation methods will compute the top N most recommended (highest predicted ratings) items, where N is specified at boot time from the CFServer.properties file (CFServer.mem.numTopNRecords). From this list of recommendations, it will return up to “number” recommendations, starting from the recommendation rank specified in “offset.” For example, getRecommendation(20, 10, 15) might compute 100 recommendations (CFServer.mem.numTopNRecords=100), yet it would return the 10 recommendations recommendations ranked 15 through 25.

The getRecommendations…() methods are designed to work in the fashion that we commonly see with search engines interfaces. For example, on a search engine, you might have 1000 matching results, but you want to display ten results per page. Rather than making the client cache those 1000 ratings, the server will cache the recommended items for a user until new ratings are received or until the space in the cache is needed for other things. With this approach, you can design an interface that allows the user to click a “See next 10 recommendations” button without having to repeat any computation.

The getRecommendationsByType() method allows you to request “best bets” from a subset of the items in your database. For example, if you are recommending movies, you might want to get recommendations for the best bets in comedy movies. Types are defined at boot time via a table in the CFEngine server database. See the administration guide for instructions on how to define types. Each type is given a unique identifier – much like userids or itemids.  

Table 12 shows an example of using getRecommendations(). Note that getRecommendations…() may return less than the number of predictions you asked for. Check the length of the returned array to determine exactly how many recommendations were retuned (using array.length).

 

ItemPrediction[] recommendations = null;

 

do {

try {

 

           recommendations = server.getRecommendations(user.getID(),

  number, offset);

           break;

} catch (RemoteException e) {

           connectToServer();

     } catch (CFIllegalParameterException e) {

           e.printStackTrace();

           System.exit(1);

     }

} while (true);

Table 12. Example of getting a list of recommendations via the CFEngine server interface.

4.4.10                     Querying Valid Rating Values

The CFEngine server interface provides three methods to allow a client to identify what are appropriate rating values. These methods are shown in Table 13.

public float getMinRating() throws java.rmi.RemoteException;

public float getMaxRating() throws java.rmi.RemoteException;

public float getErrRating() throws java.rmi.RemoteException;

Table 13. Methods for identifying valid ratings recognized by the server.

 

getMinRating() and getMaxRating() will identify the minimum and maximum values (inclusive) of acceptable ratings.

getErrRating() returns the numeric value that is used when the CFEngine server does not have a rating for a requested item, or cannot predict a rating for a requested item.

4.4.11                     Identifying the Next Usable Id for Users and Items

When you have multiple client processes connecting to a single CFEngine, you need to ensure that each new identifier assigned is unique. That is, we don’t want client #1 to create a new user as userid=100, while client #2 creates a separate user as userid=100. To ensure that this doesn’t happen, the CFEngine interface provides two methods that will generate identifiers for users and items that are guaranteed to be unique. The same id will never be given out twice, even across reboots of the server.

public int getNextUserId() throws java.rmi.RemoteException;

public int getNextItemId() throws java.rmi.RemoteException;

Table 14. Methods for generating unique identifiers.

4.4.12                     Shutting down the Server

To ease administration, it is possible for clients to shut down the server. Note that this means that the CFEngine is only intended to be run in a controlled network environment. There is no access control.

public void shutdown() throws java.rmi.RemoteException;

4.5    Creating a CFEngine Client in C++ Using CORBA

To be written.