The CFEngine Users Manual
Release 0.2 (An Alpha Release)
1 Introduction to the CFEngine
1.2 Services Provided by the CFEngine
1.3 Collaborative Filtering (CF) (and
a rough introduction to CF terminology)
1.4.1 Explicit and Implicit Ratings
1.5 Detailed Functionality of CFEngine
1.6 High Level Architecture of an
Application Using the CFEngine
3 Server Administration Manual
3.1 Running the Server on Windows
3.2 Downloading the MySQL Connector/J
library
3.3.2 Additional Server Properties
Files
3.5 Loading Existing Ratings Data
3.5.1 Loading the Sample Ratings
Datasets
3.6 Starting and Stopping the Server
3.7 Running the Sample Console-based
Client
3.9 Performance Features (User
Caching)
4.1 Available Client Access Methods
4.1.1 Java Remote Method Invocation
(RMI)
4.1.2 CORBA (C++, many others)
4.2 Representing Entities within the
CFEngine
4.3 Client-Server Architecture of the
CFEngine
4.4 Creating a CFEngine Client in Java
using RMI
4.4.1 Introduction to the Simple
Console-Based Client
4.4.2 Connecting to the CFEngine
4.4.3 Common Exceptions in the CFEngine
Interface
4.4.4 Sending Ratings to the CFEngine
4.4.5 Maintaining a Client-side Mapping
of Ids to Names
4.4.6 Removing/Deleting Ratings from
the Server
4.4.7 Retrieving Ratings from the
Server
4.4.8 Predicting Ratings for Specified
Items
4.4.9 Requesting a List of
Recommendations
4.4.10 Querying Valid Rating Values
4.4.11 Identifying the Next Usable Id for
Users and Items
1 Introduction to the CFEngine
This document is intended to serve as a manual for the CFEngine collaborative filtering recommendation engine.
Expected target audience is as follows:
The CFEngine is a recommendation engine. Given the proper data, the CFEngine will predict exactly what items an individual is likely to enjoy or find useful. It can be used in a wide variety of environments. Examples include:
Collaborative filtering (a term coined by David Goldberg et al. [tapestry]) refers to a environment in which a community of people come together to share the burden of filtering information. Consider an online newspaper with fifty news articles. Any one person doesn’t have the time to read fifty news articles, but if you have a community of 50 people, each member of the community can read one article and determine just how much value that article provides. We say that they rate the article – they give it a rating. If a member’s rating for an article is sufficiently high, the article is recommended to the rest of the community. Pooling everybody’s recommendations, the community builds a list of the top 10 articles that are worth reading. In return for only reading one item, the user has been saved the time of scanning through fifty articles to find the interesting ones. In reality, not everybody shares the same tastes or interests or needs. However, suppose instead that each member first reads and rates five articles instead of one. Now, we can examine the ratings of a hypothetical member Joe, and find the top ten other members of the community with the most similar ratings to Joe. That is – the people who read some of the same five articles as Joe and rated them similar values. For example, Joe may enjoy reading the international section, thus he read and rated highly five articles from that section. Joe can now be matched up with ten other people who also read international articles and rated them highly. We call these people Joe’s neighbors. Once we have identified Joe’s neighbors, we can look to see what articles have been rated highly by Joe’s neighbors, yet Joe read. We can then recommend these articles to Joe for reading. A list of the items most likely to meet a member’s needs is known as a set of recommendations.
As an extension to this example, consider a online news service which displays news article titles for free, but charges to display the full text of the article. Joe might want to make really sure that an article was going to be worth reading before paying for it. Using the same approach described above, Joe could consult his neighbors who have already read the article to predict his rating for the article in question. When a member wants to know not just what the top items are, but what the actual predicted ratings are, we call this a prediction.
Collaborative filtering recommendation systems are software systems that enable communities to perform collaborative filtering. At the center of a collaborative filtering system is a collaborative filtering recommendation engine – the software system responsible for the computation involved in collaborative filtering. The recommendation engine is responsible for analyzing ratings, determining who is neighbor to whom, and computing the predictions and recommendations. The CFEngine is one such recommendation engine.
We refer to the people who provide ratings to the recommender engine (and request recommendations and predictions) as users.
As described in Section 1.3, the recommender systems take ratings as input and can then output recommendations and predictions. What exactly are these ratings?
The CFEngine operates on single-dimensional numeric ratings. There are three broad classes of numeric ratings:
The CFEngine is currently designed and tested to work the best with multi-valued ratings data. However, it can also be successful with binary data. Unary data is very hard problem. The CFEngine will only work with Unary data if you first convert it to binary or multi-valued ratings data. Another approach is to combine unary purchase data (or other unary data) with some form of multi-valued observed implicit ratings as described in the next sub section.
Most of the ratings that we have given examples of so far are what we call explicit ratings. An explicit rating is a value given directly by a user in response to a query – “On a scale of 1 to 10, what would you rate this book?” Explicit ratings usually are seen as a strong source of information. A user can tell you exactly how they feel about a particular item.
Implicit ratings are ratings that are inferred from observing the behavior of a user. For example, we might observe that a user spends a lot of time reading a short news article online – from this we might infer that the user found that article valuable. Of course, we may be wrong – they may have simply started reading the article, and then were interrupted by a phone call. Implicit ratings can be collected from a variety of sources. Other examples of implicit ratings of varying utility include page views; time-spent-reading; emailing, printing, or saving a document; bookmarking; etc.
The CFEngine currently makes no distinction between explicit and implicit ratings. All rating values that are fed into the CFEngine are treated with the same strength. If you plan to use a combination of explicit and implicit ratings, you may have to think carefully about how you encode the rating values for each.
Both implicit and explicit ratings have strengths and weaknesses. Explicit rating are often much more precise than implicit ratings. On the other hand, they can be more easily biased than observed ratings. They essentially represent a user’s self-evaluation of their perception of a particular item. Implicit ratings on the other hand are observations of the user’s behavior – less prone to bias, but much more noisy. The process of inferring a rating from an observed behavior may often make mistakes.
The CFEngine will fit
into your environment the easiest if you have a homogenous set of multi-value
ratings data – all explicit, all implicit, or normalized to be comparable.
Once you seed the CFEngine with ratings from all your users (your customers, your employees, etc), the CFEngine provides support for the following operations
There are a good number of other supporting functions (such as functions for sending ratings to the CFEngine), but they are secondary to the three key functions listed above, which are the core of the system functionality.
The CFEngine will store numeric ratings, and provide recommendations based on those ratings. In the CFEngine, each user and item is identified by a unique number. The CFEngine does not support storage or retrieval of any information besides numeric ratings and numeric predictions/recommendations.
As a result, a typical application utilizing the CFEngine
will need to maintain its own databases of domain specific user and item
information. For example, you may want to maintain demographic information on
each user, or product catalog information about each item. Figure 1 illustrate an example architecture of web server-side
content management application that is using the CFEngine to predict what items
of content should be display to which users. In this figure, a user – through
their web browser – logs into a web site that utilizes the CFEngine. On the web
server side, the content management application locates the user’s record in
the user data, and from that determines the CFEngine userid
associate with that user. A request for the given userid
is sent to the CFEngine, which responds with itemids
representing recommendations for items that the user will like. The content
management application then accesses its item database to retrieve the content
associated with those itemids and displays those
items to the user.
Figure 1: Example architecture of a web
server-side content management application that utilizes the CFEngine to
generate recommendations.
The CFEngine is written in platform-independent Java, and should run on any platform that fully supports Java. We have tested the CFEngine on Solaris 2.8, Red Hat Linux 8, Windows 2000, and Windows XP.
To run the CFEngine,
you will need: (all freely available)
Recommended, but not absolutely
required (all freely available)
Recommended Hardware
To integrate your
application with the CFEngine, you will need
3 Server Administration Manual
This chapter describes briefly some of the important issues in installing and administrating a CFEngine server. It also describes how to load the sample MovieLens data, and run the sample client, in order to get recommendations for movies, or test other aspects of the CFEngine system.
The CFEngine server is completely written in Java, so it should run on any platform that supports Java. That being said – all of the support utilities that we have written will only run in a Unix shell environment. However, you can download a Unix shell environment for free – Cygwin. Here is the process:
o cp /bin/sh /bin/sh.old
o cp /bin/bash /bin/sh
After following those steps, you should be able to run most or all of the utility scripts from within the Cygwin shell. Note that you will not be able to compile using the Makefile – this is currently not supported under windows. If you want to recompile the code on Windows, we recommend that you either using a visual development environment or write your own build batch files. We use IntelliJ all the time for development on Windows without need for a Makefile.
Now just continue from the next section – same as if you were running on Unix.
First download the MySQL Connector/J JDBC library which will allow the Java CFEngine to communicate with the MySQL server. It is available from
http://www.mysql.com/downloads/api-jdbc-old.html
From there, you will download an archive file with a name similar to
mysql-connector-java-2.0.14.tar.gz
Uncompress the archive file, and extract the .jar file with the name
mysql-connector-java-2.0.14-bin.jar
Save this file in the lib/
subdirectory, and rename it as mysql.jar
config.sh is the file where you set some of the key parameters for the CFEngine server environment. Any time that you edit config.sh, you need to re-run the “configure” script, which will propagate the configuration settings to all the necessary files. The key parameters to be set here are:
WINDOWS
– Make sure you set to 1 if you are running on Windows.
MySqlBinDir – Directory that contains the mysql
program binary file.
SqlHost, SqlUser, SqlPassword, SqlDbName – Parameters
defining the database host, the username and password to use when connecting to
the database server, and the name of the database to connect to.
maxSampledUsers – Number of users to initially load into the cache. The users with the highest utility, as defined by the utility column in the USER_INFO table are loaded. Initially, we recommend that you try and load all of your users, and then evaluate the performance. To do this, set maxSampledUsers to be larger than the number of users you have in your ratings data. If the performance is too slow, then you can increase the performance by decreasing the number of users sampled into the cache. This way, there are fewer users with which to compute correlations.
maxCachedUsers – If a request to the CFEngine refers to a user that was not sampled, then that user must be fetched from the database at runtime. Since accessing the database (and thus the disk drive) is exceptionally slow, the user is cached in memory. However, to prevent this “secondary” cache from taking all of memory, this parameter limits the number of users that are cached in this manner. After this cache fills up, users who have not been used recently are removed from the cache to make room for new users.
numTopNRecords – The number of recommendations that should be computed whenever one of the getRecommendations methods is called.
SERVER_MAX_HEAP – The amount of memory that should be allocated to the CFEngine server. This should be large enough to hold all the ratings that you want to cache. It should also be less than the amount of memory you have on the machine. If you get out of memory errors, then you definitely need to increase this value. If performance is slow, and you have free memory, you can try increasing this value to see if there is an improvement.
JAVA_HOME
– The location of your Java installation.
Any time you change config.sh, you need to run the configure shell script, which will propagate the configuration values to all the necessary utility scripts and properties files. This is usually done by changing to the CFEngine root directory and typing “./configure”
config.sh
contains the parameters that most people will need to change to get up and
running. However, more detailed configuration can be found in lib/CFServer.properties. The file CFServer.properties
is automatically generated from config.sh from the
file lib/CFServer.properties.in, so we recommend that
you make your edits to CFServer.properties.in, and
then rerun configure. Otherwise, the next time you run configure, you will lose
your changes to CFServer.properties
If your MySQL server is not already running, you will need to start it now before you can load any ratings.
In Section 3.3, you configured the SqlUser and SqlPassword. You need to make sure that your MySQL server has security configured to allow connections for SqlUser using SqlPassword from the host that you will be connecting to. Refer to your MySQL documentation for details, but here is how we do it (assuming SqlDbName = “cfengine”, SqlUser = “cfengine”, and SqlPassword = “cfengine”):
% mysql –user=root
grant all privileges
on cfengine_db.*
to cfengine_user@localhost
identified by 'cfengine_pass';
In some cases, you will not have any existing ratings data. In such a case, you can safely skip this step. If you want to try running with the sample MovieLens movie recommendations data, go to Section 3.5.1. Otherwise jump to Section 3.6 to start the server, so that you can being talking to it via Java RMI.
If you are transitioning from another recommendation engine, or if you have some existing source of information on user preferences for items, you can load these ratings directly into the relational database that the CFEngine server users, without having to write a special client to load the data.
The MySQL tables that support the CFEngine are very simple. There are three relational tables:
The load_data.sh utility in the bin/ directory will load ratings from a flat text data file into the database for you. You need separate data files for ratings and types. Each data file should contain one row for each row to be loaded into the table, with each column separated by a tab and terminated by a “\n” (unix new line). Load_data.sh takes several different parameters. For a complete listing of the available parameters, simply run “./bin/load_data.sh” on the command line with no arguments.
The load_data.sh file will also run the calcUserInfo utility, which computes the rows of the USER_INFO table. This computation can take a long time for very large datasets, so be patient.
You don’t have to use the load_data.sh to load information – you can use any mechanism you are comfortable with in loading data into the appropriate database tables. The initialize_tables.sh script will create the necessary tables without loading any data. If you are planning to use sampling, then make sure that you run calcUserInfo after you are done loading your ratings.
Included with the CFEngine distribution are scripts to load
two existing datasets, found in the Data directory. The first directory –
simple contains an exceptionally simple set of ratings that is only useful for
minimal debugging. The second directory – MovieLens contains a script that will
load the 100,000 rating movie dataset that was released by the GroupLens
Research group at the
· Make sure that you have edited config.sh appropriately and run ./configure.
· Download the 100,000 MovieLens ratings from www.grouplens.org into the Data/MovieLens directory
· Run the Data/MovieLens/load.sh script. This will untar the MovieLens ratings data, convert the files to the necessary format, and then load them into the database using the methods described in Section 3.4. It will also compute the utility of each user.
· Now you can run the supplied example console-based client. See Section 3.7.
Once you have edited config.sh appropriately, and run the configure script, you can start the server by using the bin/cf_server script
bin/cf_server start
To shut down the server,
bin/cf_server stop
Included in the CFEngine distribution is a sample
console-based client that demonstrates how to create an application that uses
the CFEngine. This client has been designed to work as a movie recommendation
client, based on the 100,000 ratings provided by the GroupLens Research Group
at the
· bin/cf_client
We have implemented several of the most popular published CF algorithms, and the code for those algorithms can be found in the org.recommender.algorithms package. However, of those algorithms, only one is supported for this current release. The remaining algorithms, which can be found in org.recommender.algorithms.experimental, are not even guaranteed to run. At one point, they all worked, but in the mad rush to release the CFEngine software, we made many changes to the core system, and did not verify that those algorithms still work. As soon as the work on the core system stabilizes, we will return and update those algorithms.
The one algorithm that has been well tested is the classic user-to-user nearest neighbor prediction algorithm based on Pearson Correlation [Herlocker information retrieval]. It will generate both individual predictions as well as requests for top N recommendations per user (as proposed in [Sarwar]) In this algorithm, to compute a prediction for the active user, the CFEngine computes the Pearson Correlation between the active user and all other users. To compute a prediction for a specific item, the algorithm computes the weighted average of the non-active users’ ratings for that item. The average is weighted by the correlation values (users with negative correlations are discarded) and the system subtracts from each user’s rating their mean rating.
This algorithm supports top N’s by type. You can assign integer type labels to each item, and then request the top N predictions from items of a particular type.
In theory, nearest neighbor algorithms do no scale well. Their performance decreases in proportion to the number of users in the system. This is not an issue if you do not have many users. For those who may have many users, or just really slow hardware, the CFEngine provides user sampling to keep the computation time roughly constant regardless of the number of users used.
In lib/CFServer.properties, you can specify the maximum number of users to sample from the total set of all users (CFServer.mem.maxSampledUsers). When the CFEngine server starts, it will selectively load users (i.e. load all ratings associated with a user) up to the amount specified in CFServer.properties. The CFEngine server selects those users based on the “utility” of each user, which is specified in the utility column of the USER_INFO table. There are many ways to compute the utility of particular users as predictors. We provide a program – calcUserInfo, which will compute two different measures of utility, Entropy and Inverse User Frequency. These measures compute the utility of users as predictors based on the popularity of items they have rated and the number of items they have rated. Roughly, users who have rated items that very few people have rated have higher utility, and users who have rated more items have higher utility. More investigation of these and new measures is needed – these are just trial measures. For example, the measures do not directly take into account the currency of the ratings or the coverage of items by the entire sampled data.
The CFEngine is designed to run as a separate process from the application that is utilizing the functionality of the CFEngine. Thus it follows a client-server model. A CFEngine client is a software application (running in a separate process from the CFEngine) that makes requests to the CFEngine. This chapter provides the necessary background necessary to understand how to write a CFEngine client. It also provides step-by-step recipes for creating clients in several different languages and platforms.
Because the CFEngine and your CFEngine client application will be running in different threads, your client must use a remote protocol to communicate with the CFEngine. The current release of the CFEngine supports two different methods for communicating with the CFEngine:
We provide a brief description of these two methods in the
next two subsections.
Integrated with Java is the capability to perform Remote Method Invocation or RMI. RMI allows one Java process to execute methods on an object that exists in a separate Java process, potentially between different computers on a network. Java RMI is a good, high performance solution if you client application is written in Java.
One consideration is that RMI will create a new thread in
the server for every single incoming request, with no upper limit (this is a
“feature” of Java RMI, not the CFEngine) and if you have thousands of different
applications all connecting to the CFEngine at the same time, then the CFEngine
will probably get bogged down with thread creation and slow to a crawl. However,
the CFEngine should support 10-100 concurrent threads accessing it reasonably
well. One individual thread can handle around 100 recommendation requests per
second.
CORBA is the Common Object Request Broker Architecture. In
theory CORBA allows applications written in arbitrarily different languages to
communicate with each other if the appropriate software exists through the use
of a language independent communication protocol called IIOP. The CFEngine
provides an interface that CORBA-compliant clients can connect to. For more
information about CORBA, see http://www.omg.org/gettingstarted/corbafaq.htm
We have successfully tested a CORBA client written in C++ with the CFEngine using the freely-available TAO CORBA ORB. Both of these example client APIs are provided with the CFEngine distribution. As a result, you are guaranteed to be able to access the server from C++ without having to purchase any additional software.
We also provide the CORBA Interface Definition Language (IDL) interface file for those who wish to access the CFEngine from different programming language, or using different ORBs.
In an abstract sense, the CFEngine deals entirely with numeric data. Users are represented by unique integer userids, items are represented by unique integer itemids, and types (categories) are represented by unique typeids. When you specify to the CFEngine a particularly user, item, or type, you use an integer. When the CFEngine returns back a list of recommendations or ratings, those are identified by itemids and numeric ratings or predictions (both doubles).
Thus, in order to work with the CFEngine, you will need to generate numeric encodings for your users and your items. These encodings may be sparse – for example if you had five users, their userids would not have to be “1, 2, 3, 4, 5”. They could be “123, 3000, 10000, 34565”. The same holds for itemids. You may generate this encoding yourself offline and then load the ratings into the database before the CFEngine is started. The CFEngine server interface also provides two methods to provide you with unique userIds and itemIds during runtime (CFEngine.getNextUserId(), and CFEngine.getNextItemId())
As described in Section 7, the CFEngine is designed to run in a client-server architecture. In theory, you could link the CFEngine classes directly into your Java application, but this has not been tested extensively.
The CFEngine does not currently manage its own thread pool for recommendation and prediction computation. Rather, it relies on the remote procedure call interface (RMI or CORBA) to initiate new computation threads for each request. If you choose to try and link the CFEngine classes directly into your Java application, then you will not get overlapping of I/O and computation unless you create multiple threads in your application that call the CFEngine interface.
On the other hand, the CFEngine is thread-safe, so concurrent execution of all CFEngine interface methods is supported.
For the remainder of this manual, we will assume that a
client-server model is used, with the CFEngine and the client application in
separate processes (possibly on separate machines).
If you will be connecting to the CFEngine from a Java application, such as a Java servlet, applet, or other Java application, then using RMI will probably be the best option. In this section, we demonstrate how to use the RMI interface, using the example console-based client that is included with the CFEngine distribution.
Included in the CFEngine distribution is a simple, console based client, found in the org.recommender.clients.console package. The shell script bin/cf_client will start the console-based CFEngine client for you. The simple client application will provide you with a simple, menu-based interface for interacting with the CFEngine. You can find the source code for this console in the org/recommender/clients/console subdirectory. All of the code in the Console Client that communicates with the CFEngine is found the class org.recommender.clients.console.ClientCFManager.
Figure 2. A screenshot of the console-based recommendation application that will be used as an example in this section.
The first step in any client application is to initialize a connection to the CFEngine server. In the Console Client, this is done in the constructor of the ClientCFManager class. The relevant code is listed below in Error! Reference source not found..
Connecting to the server is as simple as getting a reference to a remote object in the server that implements the CFEngine class, and then invoking any method on that object. By default, the CFEngine server registers itself with the RMI name server under the name “cfengine” – all lowercase. Therefore, getting a reference to the CFEngine object is as simple as calling Naming.lookup(“cfengine”) and typecasting the result to CFEngine.
Once you have a reference to a remote object, you can test the connection to the server by invoking any method. The CFEngine interface has a simple method called test(), designed just for this purpose. The test() method executes a simple method on the server that returns a string indicating that the server is alive.
If your CFEngine server is not currently running on the host defined by cfengineHost, then either Naming.lookup() or server.test() will throw an exception. Usually the exception will be thrown by Naming.lookup() because the RMI registry runs as a thread in the CFEngine server.
import org.recommender.server.CFEngine; import java.rmi.Naming; CFEngine server; String cfengineHost = "localhost"; private void connectToServer() { String url = "rmi://" + cfengineHost + "/"; try { server = (CFEngine) Naming.lookup(url +
"cfengine");
String resultString = server.test(); System.err.println(resultString); } catch (Exception e) { System.err.println("Error: Cannot connect to service \"cfengine\" via registry on " + cfengineHost + e); e.printStackTrace(); System.exit(1); } } |
Table 1: A code segment for initiating a connection to the CFEngine server
There are two commonly seen exceptions within the CFEngine Interface. For the most part, these exceptions will probably be handled in the same way regardless of what CFEngine method you are calling.
java.rmi.RemoteException. This is an exception that will be thrown by the Java RMI
subsystem if the is a communication problem with the CFEngine server. This
could mean that a) your CFEngine server is not running on the host you are
trying to connect to, b) your connection to the server has been closed for some
reason (timed-out perhaps), or c) there is a network outage between your client
and your server.
CFIllegalParameterException. This exception is thrown whenever one or more of the parameters passed to the method are not valid. Most commonly this is due to passing a null reference or a negative id. Specifying a positive userid or itemid is never illegal – the CFEngine server assumes that every possible userid or itemid id exists, even if it has no existing data for that id.
CFIllegalListParameterException.
This exception is thrown by methods which are passed an array (ie a list) of input. The exception is identical to the CFIllegalParameter exception, with the addition of a method
getNumSuccessful(), that returns the number of elements of the array that
were successfully processed before the erroneous parameter occurred. For
example, if you called setRatingList(), and the first ten ratings were valid, but the eleventh
rating had an id of -1, CFIllegalListParamterException
would be thrown, and getNumSuccessful() on the
exception would return 10.
The CFEngine needs ratings to compute recommendations. Before you start the CFEngine server process, you can load the ratings directly into the relational database on which the CFEngine server operates. However, while the server is running, new ratings must be sent to the server through the CFEngine interface. This is because the server maintains a cache of ratings to increase the performance of recommendation computation. Loading ratings into the database while the server is running could lead to inconsistencies in the data.
The CFEngine interface provides two methods for sending rating to the server:
public void setRating(int user, int item, double rating) throws java.rmi.RemoteException, CFIllegalParameterException; public int setRatingList(ItemRating[] newRatings) throws java.rmi.RemoteException,CFIllegalListParameterException; |
Table 2. CFEngine interface methods for sending ratings to the server.
setRating() sends a single rating, while setRatingList() is used when you want to send more than one rating at a time. setRatingList() should be more efficient if you have more than one rating to send over a short period of time.
All methods in the CFEngine interface will throw the java.rmi.RemoteException if there is a network communication problem with the server. This could happen if the server crashes or is shut down while the client is still running, if there is a network outage between the client and the server, or if the socket connection between the client and the server is shut down for any reason.
It is possible for an RMI connection
to time out if it isn’t used after some time? Or is there some sort of
keep-alive mechanism that will ensure the connections never time out?
Table 3 demonstrates sending one rating using the CFEngine interface. The setRating() method can throw two exceptions. java.rmi.RemoteException will be thrown if there is a error communicating with the server (for example if the server is not running). In our example, we try once to reconnect to the sever - the connectToServer() method (Table 1) will exit the application if it fails to reconnect to the server.
CFIllegalParameterException will be thrown if the userid, itemid, or rating is invalid. The only invalid userids and itemids are negative numbers. An invalid rating is any rating outside of the range specified in the server CFServer.properties file. do { try { server.setRating(userid, itemid, rating); return; } catch (java.rmi.RemoteException e1) { connectToServer(); } catch (CFIllegalParameterException e1) { e1.printStackTrace(); System.exit(1); } } while (true); |
Table 3: Example of sending
a new rating to the CFEngine server.
There are no CFEngine methods for creating a user - sending a rating for a userid that has not been seen before by the server will automatically result in that user being created by the server.
If you send a rating for a (userid, item) pair for which there is already a rating, the new rating will overwrite the existing rating.
setRatingList() works in much the same was as setRating(), except that an array of ItemRating objects is passed, rather than a single (userid, itemid, rating) triplet. Table 4 shows the method exported by the ItemRating class, by which you can create an ItemRating object and access its elements.
// org.recommender.server.ItemRating // Constructor public ItemRating(int userID, int itemID, double rating) // Accessors public int getUserID() public int getItemID() public double getRating() |
Table 4. Methods exported by the ItemRating class – a CFEngine server data type used by methods in the CFEngine interface.
Our example Console Client uses its own class – org.recommender.clients.console.Rating – to store ratings internally. Thus in the example shown in Table 5, we first must convert from the client’s internal representation of ratings (using the Rating object) to the representation supported by setRatingList() (using ItemRating ).
// “ratings” is an array of objects representing ratings // on the client side that need to be sent to the server // ItemRating[] newRatingList = new ItemRating[ratings.length]; int number = 0; for (int i = 0; i < ratings.length; i++) { Rating r = ratings[i]; newRatingList[i] = new ItemRating( r.getUser().getID(), r.getItem().getID(), r.getRatingValue() ); } do { try { number = server.setRatingList(newRatingList); return number; } catch
(java.rmi.RemoteException e1) { connectToServer(); } catch (CFIllegalListParameterException e1) { System.err.println("Only " +
e1.getNumSuccessfull() + " ratings were sucessfully removed"); e1.printStackTrace(); System.exit(1); } } while (true); |
Table 5. Example code for sending a list of ratings to the CFEngine server.
In Table 5, notice that we catch the CFIllegalListParamterException. For performance reasons, the CFEngine server does not validate all parameters of the array before starting to add ratings to the database. Thus, an illegal parameter may be encountered half way through an array, with some ratings having been processed, and some not. In such a case, the CFEngine server immediately throws an exception and you can query the exception to determine exactly how many ratings were successfully processed before the exception occurred.
While the CFEngine identifies each user, item, and type with an integer, we have much more useful representations of those items that we will want to show to the user. For example, in a movie recommender, we will want to display the titles of movies, not their itemids. And users will find it more useful to log in with a text login name and not a long numeric userid. As a result, your client will most likely need to maintain data structures that map to and from numeric identifiers and application specific data types.
As an example, the Console Client assumes that each user, item, and type has a name that is represented by a String. The Console client maintains its own database tables that define relations between numeric identifiers and String names. The code to manage these mappings can be found in org.recommender.clients.console.ClientDBManger. The ClientDBManager maintains three simple tables in its own relational database manager that map from userid, itemid, or typeid to names. We will not discuss in detail how the Console Client handles this mapping. See the source code for more details.
The appropriate data structures for performing the mapping will depend on your application. In many cases, you will want to store additional information about each user, item, or type beyond just its name.
For the most part, you probably won’t need to delete ratings, but there are a few circumstances. For example, you might have a privacy policy that allows users to explicitly request that their ratings be deleted from the system. Or on a retailer web site, a customer might want to delete a rating that was implicitly recorded due to their purchase of a gift for another person. Or a user may want to tinker with their profile by deleting ratings and seeing the change in the recommendations. The CFEngine interface provides two methods to support deleting of ratings from the server database.
public void removeRating(int user, int item) throws java.rmi.RemoteException, CFIllegalParameterException; public void removeRatingList(int user, int[] itemIDs) throws
java.rmi.RemoteException, CFIllegalParameterException; |
Table 6. CFEngine methods for deleting ratings stored by the server.
Often the client may want to determine what items the user
has already rated and what those rating values are. Or the client may want to
determine if a user has rated a specific item, or a list of items. In such
cases, you can make use of the methods shown in Table 7.
public ItemRating getRating(int userID, int itemID) throws RemoteException, CFIllegalParameterException; public ItemRating[] getRatingList(int
userID, int[] itemIDs)
throws RemoteException, CFIllegalListParameterException; public ItemRating[] getUserRatingList(int userID) throws java.rmi.RemoteException, CFIllegalParameterException; public ItemRating[] getItemRatingList(int itemID) throws java.rmi.RemoteException, CFIllegalParameterException; |
Table 7. CFEngine interface methods for retrieving ratings from the server.
All methods return zero or more objects of class ItemRating, which is described in Table 4.
In Table 7, note that getUserRatingList() and getItemRatingList() do not throw the CFIllegalListParamterException. This can be explained by the fact that neither take arrays as parameters.
If a rating is requested for an item that the user has not yet rated, the CFEngine interface does not throw an exception. Rather, it simply reports that user’s rating as the “error rating”, which is the value returned by CFEngine.getErrorRating() (see Table 13).
Now we get to the core functionality of the CFEngine server – predicting ratings for items that the user has not already rated. This section describes methods whereby you specify exactly what items you would like ratings predicted for. The next section describes how you can ask for the list of items with the highest predicted ratings, without having to specify all the different options.
As in the previous sections, there is a method to request a prediction for a single item, and a separate method to request a prediction for a list of items. These are listed in Table 8.
public ItemPrediction getPredictedRating(int userID, int itemID) throws RemoteException, CFIllegalParameterException; public ItemPrediction[] getPredictedRatingList(int userID, int[] itemID) throws java.rmi.RemoteException, CFIllegalListParameterException; |
Table 8. Methods for requesting predictions from the server via the CFEngine interface.
These methods and the methods in the next section return ItemPrediction objects, which are described in
// org.recommender.server.ItemPrediction class public ItemPrediction(int userID, int itemID, float pred); public int getUserID() public int getItemID() public float getPrediction() |
Table 9. Interfaces
to the ItemPrediction class, which is return by getPredictedRating…() and getRecommendation…() methods.
These methods are most used for situations where the user specifically requests to see an item, perhaps by searching for it by name or browsing for it.
Table 10 shows an example of how to request a single prediction for a given userid and itemid. Requesting a list of predictions using getPredictedRatingList() is handled very similar. See Section 18 for an example of using a CFEngine method that takes an array as a parameter.
do { try { predictedRating = server.getPredictedRating(user.getID(), item.getID()); return new Prediction(item, predictedRating.getPrediction()); } catch (RemoteException e) { connectToServer(); } catch (CFIllegalParameterException e) { e.printStackTrace(); System.exit(1); } } while (true); |
Table 10. Example code for predicting a single rating using the CFEngine interface. Note that “Prediction” is a client-side datatype defined in the sample client application.
If you ask for a prediction for a user or an item (or both) that does not exist, the CFEngine methods will not throw exceptions (unless the id < 0, which is illegal). Rather the CFengine will simple predict the “error rating”. You can determine what the error rating is using the CFEngine method getErrRating().
If you request a prediction for an item for which the CFEngine server already has a rating, getPredictedRating() and getPredictedRatingList() will not return that rating. Rather they will still try and compute a prediction using the defined algorithm. If this is not the desired result for you, then you should first call getRating(), getRatingList(), or getUserRatingList() to determine what items the user has already rated.
While the previous section describes methods that will predict ratings for items that you specify, most users will want a list of “best bets”. Collaborative filtering is most often used in information overload situations, where there are just too many items to evaluate individually. Rather, the user wants to be immediately recommended the items they are most likely to enjoy or find useful. To support this, the CFEngine interface provides two methods, shown in Table 11.
public ItemPrediction[] getRecommendations(int curUser, int number, int offset) throws RemoteException, CFIllegalParameterException; public ItemPrediction[] getRecommendationsByType(int curUser, int number, int offset, int type) throws RemoteException, CFIllegalParameterException; |
Table 11. Methods for requesting recommendations (best bets) from the CFEngine interface.
See Table 9 for a description of the ItemPrediction class.
The CFEngine recommendation methods will compute the top N most recommended (highest predicted ratings) items, where N is specified at boot time from the CFServer.properties file (CFServer.mem.numTopNRecords). From this list of recommendations, it will return up to “number” recommendations, starting from the recommendation rank specified in “offset.” For example, getRecommendation(20, 10, 15) might compute 100 recommendations (CFServer.mem.numTopNRecords=100), yet it would return the 10 recommendations recommendations ranked 15 through 25.
The getRecommendations…() methods are designed to work in the fashion that we commonly see with search engines interfaces. For example, on a search engine, you might have 1000 matching results, but you want to display ten results per page. Rather than making the client cache those 1000 ratings, the server will cache the recommended items for a user until new ratings are received or until the space in the cache is needed for other things. With this approach, you can design an interface that allows the user to click a “See next 10 recommendations” button without having to repeat any computation.
The getRecommendationsByType() method allows you to request “best bets” from a subset of the items in your database. For example, if you are recommending movies, you might want to get recommendations for the best bets in comedy movies. Types are defined at boot time via a table in the CFEngine server database. See the administration guide for instructions on how to define types. Each type is given a unique identifier – much like userids or itemids.
Table 12 shows an example of using getRecommendations(). Note that getRecommendations…() may return less than the number of predictions you asked for. Check the length of the returned array to determine exactly how many recommendations were retuned (using array.length).
ItemPrediction[] recommendations = null; do { try { recommendations
= server.getRecommendations(user.getID(), number, offset); break; } catch (RemoteException e) { connectToServer(); } catch (CFIllegalParameterException e) { e.printStackTrace(); System.exit(1); } } while (true); |
Table 12. Example of getting a list of recommendations via the CFEngine server interface.
The CFEngine server interface provides three methods to allow a client to identify what are appropriate rating values. These methods are shown in Table 13.
public float getMinRating() throws java.rmi.RemoteException; public float getMaxRating() throws java.rmi.RemoteException; public float getErrRating() throws java.rmi.RemoteException; |
Table 13. Methods for identifying valid ratings recognized by the server.
getMinRating() and getMaxRating() will identify the minimum and maximum values (inclusive) of acceptable ratings.
getErrRating() returns the numeric value that is used when the CFEngine server does not have a rating for a requested item, or cannot predict a rating for a requested item.
When you have multiple client processes connecting to a single CFEngine, you need to ensure that each new identifier assigned is unique. That is, we don’t want client #1 to create a new user as userid=100, while client #2 creates a separate user as userid=100. To ensure that this doesn’t happen, the CFEngine interface provides two methods that will generate identifiers for users and items that are guaranteed to be unique. The same id will never be given out twice, even across reboots of the server.
public int getNextUserId() throws java.rmi.RemoteException; public int getNextItemId() throws java.rmi.RemoteException; |
Table 14. Methods for generating unique identifiers.
To ease administration, it is possible for clients to shut down the server. Note that this means that the CFEngine is only intended to be run in a controlled network environment. There is no access control.
public void shutdown() throws java.rmi.RemoteException; |
To be written.