Building a Personalized Steam Game Recommendation System Using BERT and LDA
Introduction
Recommendation systems are everywhere, from Netflix suggesting your next bingeworthy show to Amazon recommending products you might like. This tutorial takes inspiration from a research paper that combined sentiment analysis and matrix factorization for recommendations, but instead, we will focus on integrating BERT embeddings with LDA topic modeling.
What we’ll cover:
- Data Preparation: Fetching and cleaning data from the Steam API.
- Word Embeddings with BERT: Understanding and implementing BERT for word embeddings.
- Topic Modeling with LDA: Using LDA to extract topics from game reviews.
- Combining BERT and LDA: Merging the two feature sets to power the recommendation engine.
- Building the Streamlit App: Deploying the model in a user-friendly web app.
Step 1: Data Preparation
Before we dive into modeling, we need to fetch and clean the data. We’ll be using data from the Steam API, which provides details on thousands of games.
import sqlite3
import pandas as pd
# Connect to the SQLite database
conn = sqlite3.connect('steam_games.db')
# Load the game details and reviews into pandas DataFrames
games_df = pd.read_sql_query("SELECT * FROM game_details", conn)
reviews_df = pd.read_sql_query("SELECT * FROM game_reviews", conn)
# Close the connection
conn.close()
# Remove unwanted entries like DLCs, soundtracks, and demos
filtered_games_df = games_df[~games_df['name'].str.contains('soundtrack|OST|demo|DLC|playtest|resource pack', case=False, na=False)]
# Filter reviews based on the filtered games
filtered_reviews_df = reviews_df[reviews_df['appid'].isin(filtered_games_df['appid'])]
filtered_games_df.to_csv('filtered_games_df.csv', index=False)
Here, we use SQLite to store our data locally, making it easy to manipulate and filter. We remove unnecessary entries like soundtracks, demos, and DLCs to focus on actual games.
Step 2: Word Embeddings with BERT
BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art NLP model from Google. It captures the context of a word in a sentence better than traditional word embeddings like Word2Vec or GloVe.
What Are Word Embeddings?
Word embeddings map words or phrases from a vocabulary to vectors of real numbers. BERT’s embeddings are context-aware, meaning the word “bank” will have different embeddings in “river bank” and “bank account.”
Implementing BERT for Game Descriptions
from transformers import BertTokenizer, BertModel
import numpy as np
# Load pre-trained BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
# Function to get BERT embeddings
def get_embedding(text):
inputs = bert_tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
outputs = bert_model(**inputs)
return outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
# Generate embeddings for all game descriptions
embeddings = []
for description in filtered_games_df['description']:
embeddings.append(get_embedding(description).flatten())
bert_item_feature_matrix = np.array(embeddings)
np.save('bert_item_feature_matrix.npy', bert_item_feature_matrix)
In the code above, we use BERT to convert game descriptions into vector representations. These vectors capture the semantic meaning of the descriptions, allowing us to compare them easily.
Step 3: Topic Modeling with LDA
What is LDA?
Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups, which explain why some parts of the data are similar. In simpler terms, LDA identifies topics within a set of documents (in our case, game reviews).
Implementing LDA for Game Reviews
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Text preprocessing function
def clean_text(text):
text = text.lower()
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\d+', '', text)
text = text.strip()
return text
# Apply the clean_text function to the reviews
filtered_reviews_df['cleaned_text'] = filtered_reviews_df['review_text'].apply(clean_text)
# Vectorize the reviews
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
reviews_vectorized = vectorizer.fit_transform(filtered_reviews_df['cleaned_text'])
# Fit the LDA model
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_topic_matrix = lda_model.fit_transform(reviews_vectorized)
# Save the LDA topics per game
lda_df = pd.DataFrame(lda_topic_matrix, columns=[f'topic_{i}' for i in range(lda_topic_matrix.shape[1])])
lda_df['appid'] = filtered_reviews_df['appid'].values
lda_topic_matrix_per_game = lda_df.groupby('appid').mean().to_numpy()
Here, LDA is applied to game reviews, extracting topics that represent the major themes within the reviews. Each game now has a topic distribution vector that can be used alongside BERT embeddings.
Step 4: Combining BERT and LDA
Now that we have two sets of features, we can combine them to create a more robust recommendation system.
# Combine BERT and LDA features
combined_feature_matrix = np.hstack((bert_item_feature_matrix, lda_topic_matrix_per_game))
np.save('combined_feature_matrix.npy', combined_feature_matrix)
Step 5: Building the Streamlit App
Finally, we’ll use Streamlit to create a web app that allows users to input game descriptions and get recommendations.
import streamlit as st
from sklearn.metrics.pairwise import cosine_similarity
@st.cache_data
def load_data():
games_df = pd.read_csv('filtered_games_df.csv')
combined_feature_matrix = np.load('combined_feature_matrix.npy')
return games_df, combined_feature_matrix
def recommend_games(user_input, combined_feature_matrix, games_df):
user_embedding = get_embedding(user_input)
similarities = cosine_similarity(user_embedding, combined_feature_matrix)
top_n = 5
recommendations = similarities[0].argsort()[-top_n:][::-1]
return recommendations
# Streamlit app
st.title("Steam Game Recommendation System")
user_input = st.text_input("Describe your ideal game:")
if user_input:
games_df, combined_feature_matrix = load_data()
recommendations = recommend_games(user_input, combined_feature_matrix, games_df)
st.subheader("Top 5 Recommended Games")
for idx in recommendations:
game_info = games_df.iloc[idx]
st.image(f"https://steamcdn-a.akamaihd.net/steam/apps/{game_info['appid']}/header.jpg")
st.write(f"**{game_info['name']}**")
st.write(f"Description: {game_info['description']}")
st.write(f"Price: {game_info['price']}")
st.write(f"Release Date: {game_info['release_date']}")
st.write("---")
The recommend_games
function uses cosine similarity to match the user’s input to the games in the dataset, returning the top 5 recommendations.
Conclusion
In this project, we built a recommendation system using BERT for word embeddings and LDA for topic modeling, combining them to create a robust feature matrix that powers our recommendations. By deploying the model using Streamlit, we can easily make our work accessible to others.
This approach can be extended or modified to fit different types of recommendation tasks, and with the flexibility of BERT and LDA, it’s possible to adapt this framework for various domains.
Feel free to experiment with more data, tweak the models, or even incorporate additional features to improve the recommendations further!
Final Notes:
- Optimization: Consider using PCA or similar techniques if the model becomes too resource-intensive.
- Scalability: For larger datasets or real-time applications, consider moving to a more robust backend like Google Firestore, SQLite, or even an SQL-based cloud service.