Similarity Engine - Alternate Methods

Posted by David Albert ON Dec 12th, 2012

Neural Network Approach

After training the neural network on approximately 200 training samples, it was evident that the network did not necessarily generalize to games outside of the scope of the ones rated by experts in the training set. Further input will be necessary to continue to complete the model; this will hopefully be completed in the coming weeks as Alex makes more vists to CABS meetings. Once a better picture of the training data develops, the network should hypothetically generalize better to the regions of the input space that do not perform well at this point in time (for instance, war games.) This portion of the project will continue to be a work in progress. In the mean time, as seen in our progress reports, I have also set up an alternate method to provide users with a very literal correlation of similarity between games.

Simple Statistical Approach

In order to provide results that were more relevant in the immediate context of publicizing our project, I introduced an alternate method to the neural network approach that backtracks to some of the work that I had done earlier in the project. Rather than train an network with the training data, the stats method performs component analysis to return a 10 dimensional vector for each game. The confidence level for each game relative to the other games is then calculated merely from the euclidean distance between the two vectors.

Drawbacks

While this method provides games that are numerically similar to each other, often returning different versions of the same title in the search results, it fails to meet our initial goal of providing users with recommendations based on the more esoteric relationships with games. While two games, by their attributes, may be incredibly similar, they may be a poor recommendation, particularly if the user is looking for variety and new options in the games that they are playing. It is for these reasons that, while we are leaving this method as an option, we would much prefer the neural net approach to be used and to be successful in providing users with a diverse listing of similar games.

Tag Administration

Posted by Alex Burkhart ON Dec 12th, 2012

Purpose

In all the Board Game Geek data, there are a lot of different Mechanics and Categories. Within the system they are treated identically, collectively called Tags. Each Tag has a name along with a detailed description.

We wanted to be able to quickly modify the metadata related to the Tags quickly and easily. We already had a webserver setup, so a simple web interface seemed a natural way to accomplish this.

Features

With the Tag Administration page, we were able to edit the titles and descriptions of each tag as we see fit. We pulled all the descriptions in from Board Game Geek with a script automatically and were able to quickly edit them here.

We also wanted to be able to modify the weight for each Tag within the query scoring system. I added a “Postive Influence” and “Negative Influence” field to each Tag. By modifying the Postive Influence, an administrator is able to change the maximum score assigned to an attribute by the query engine. Similarly, the Negative Influence adjusts the maximum penalty when an attribute is marked thumbs down by the user.

Lastly, we had plans to break up the giant wall of tri-state checkboxes on the query page by grouping related Tags. A common example of this was to group all “Theme” tags together. The administration page gave us a simple way to change the grouping.

Preview

You can see the final version of the Tag Administration Page here.

Expert Interface

Posted by Ryan McGowan ON Dec 12th, 2012

Purpose

Data for the similarity engine (neural network) did not come out of thin air. It needed to be collected from our experts. The purpose of expert authentication and the expert interface is to easily allow our experts to contribute to the training of the neural network. The end goal is for an expert to rate each game as similar to another one. With 1,000 games there ate 500,000 pairs (1,000 choose 2). Getting 500,000 ratings at our scale is difficult thus the neural network is needed to expand on the ratings we have.

Issues & Design

Our initial design was very simple. A random pair of games is selected. If the expert is familiar with both games, they rate how similar they are. Otherwise, he or she skips that pair. Unfortunately, most experts are not familiar with a significant proportion of the games in our database. That means, under normal conditions the expert would be skipping a lot.

To address this the interface was turned into a wizard with the following steps:

Game Selection – 12 random games are selected from the database. The expert selectis which one they are familiar.
Pair Rating – For all pair combinations of games selected on page 1 obtain a rating from the expert.

An expert is far more likely to be familiar with 2 or more games out of a randomly selected twelve. Thus, less skipping is involved.

Features

Do not display games the expert has not seen before.
Limit the number of combinations shown.
Do not show combinations the expert has already rated.

Source Code

Development Setup

Posted by Ryan McGowan ON Dec 12th, 2012

Setting up this board ultimatum for development is not a complicated task, but it has several dependencies. Most notably, you will need to install a project and dependency management tool for Clojure called leiningen. Installation and usage instructions for leiningen can be found on its github page.

You can get a zip of our current source code but we suggest using git which can be downloaded here.

Once you have leiningen setup the instructions on our github repository should suffice. Further dependencies exist if you want to modify or use styles. Instructions for doing this are also available on the projects README.

Final System Design

Posted by Chris Powers ON Dec 11th, 2012

The overall design of the system is now solidified. Here’s a quick look at our design:

System Design

It may be a bit simplified, but it captures all of the core elements of the system.

Final Presentation

Posted by Chris Powers, Ryan McGowan, David Albert ON Dec 10th, 2012

We’ll be presenting our capstone project at CETI day. You can view the poster we made here.

The relative breakdown of work on the poster and presentation was as follows:

Worklog

Poster

Template styling and modification — David
Initial outline content — Ryan
Created and arranged design flowchart — Chris
Added some parts to the design text — Chris

board-ultimatum

Create simple statistical method for determining similarity of games. — David
Allow the user to specify similarity engine provider. See the pull-request — Ryan

Providers:
- Neural Network
- Simple Stats

Timebox 6

Posted by Ryan McGowan, Alex Burkhart, David Albert, Chris Powers ON Nov 26th, 2012

Worklog

Chris Powers

Alex Burkhart

Ryan McGowan

David Albert

Created icons for recommendation GUI
Added cross validation to the Neural Net system
Refined similar search results UI

Commit History

For more information on who has done what see our commit history on our github project.

Timebox 5

Posted by Chris Powers, Ryan McGowan, David Albert ON Nov 12th, 2012

Worklog

Chris Powers

Created the attribute scoring engine
Implemented filtering on the number of players
Added results page sorting by score.
Hooked up front-end with filtering and scoring components to ensure queries now work (takes user input, pulls data, filters, scores, presents, in that order)

Alex Burkhart

Ryan McGowan

Completed expert functionality. (Commits)
Highlights:
- Designed and implemented routes and markup for expert-select and expert-compare interfaces
- Designed and implemented basic insecure expert authentication.
- Some minor style changes/improvements.
- Rewrite of expert.js
- Created the following helper namespaces (engine.model.expert, engine.model.relationship, form-validators, session, flash). Explore here.

David Albert

Create auto-completing search UI for finding similar games
Connect UI to database containing results from neural network

Commit History

For more information on who has done what see our commit history on our github project.

Timebox 4

Posted by Chris Powers, Ryan McGowan, David Albert ON Nov 2nd, 2012

Worklog

Chris Powers

Created the attribute scoring engine
Implemented filtering on the number of players
Added results page sorting by score.
Hooked up front-end with filtering and scoring components to ensure queries now work (takes user input, pulls data, filters, scores, presents, in that order)

Alex Burkhart

Ryan McGowan

Created expert interface. (Commits)
Highlights:
- Designed and implemented routes and markup for expert-select and expert-compare interfaces
- Designed and implemented basic insecure expert authentication.
- Some minor style changes/improvements.
- Rewrite of expert.js
- Created the following helper namespaces (engine.model.expert, engine.model.relationship, form-validators, session, flash). Explore here.

David Albert

Data Processing and Neural Net Setup
- See blog post here
- Data Converstion Script
- Neural Net Script

Commit History

For more information on who has done what see our commit history on our github project.

Data Analysis and Neural Nets

Posted by David Albert ON Oct 31st, 2012

Data Conversion and Analysis

The first task that I took on during this timebox was converting the data from the board game database into numerical vectors representing each game. The source code for this script can be found here. There are many categories and mechanics included in this vector - its value is 1.0 if the game contains that tag and 0.0 if it does not.

As you can tell by examining the code, the resulting vector has 100+ dimensions. For the sake of performance, I decided that this raw input was impractical for a neural net application. To solve this problem, I turned to PCA (Principal Component Analysis). PCA uses orthagonal transforms to maximize the variability of the data. By projecting my set of 1000 100+ dimensional vectors onto the first 10 principal components, I was able to reduce the dimensionality of the vectors down to ten components. The following is a graph of the data on the first two principal components.

PC Analysis

It is worthwhile to note that the outliers in the bottom right are party games that can accomodate 50+ people and are fairly unique in that regard.

Once the data is converted, it is stored in the Mongo database.

The library I used to perform this analysis is Incanter which contains much of the functionality found in R, if you are familiar with stats packages.

Neural Net and Engine output

The next step was to run the data through the neural net. Since the expert interface is not accessible to public users at this point, an untrained network is used. The network I used is arbitrary at this point as far as hidden layers and propogation algorithms are concerned, since there is no training data. I ran each combination of games through the network with the input as the following: [GameA Game B]. The results were then stored in another Mongo collection.

One problem that I encountered was the amount of storage that the output takes. This problem was mitigated by only storing the 50 games with the best score for each game, reducing the size by a factor of 20. All 1000 values would be calculated, then trimmed down. This would ensure that the database would never have more than ~50000 records, versus 1000000.

← Older Blog Archives