MLRecommender in practice

I was researching how to make personalized recommendations for my app “Code Conf”. As I have found I can use Machine Learning for this purpose and use CoreML MLRecommender model.

To train this model you need 2-3 inputs and as an output you gain set of recommendations based on input data.

It looks very easy and it is so if you know what I know now after almost 3 weeks of research.

Training MLRecommender model can be done in two ways:

using only reference between users and items for recommendations
using not only above but also user rating for items

My data set dumped from database looks like that:

user_id                               item_id                               score
f69df0fd-3b7e-489d-9197-28d94be3d281, 53fb60b1-7d6c-473f-91bf-42fd670ae055, 6
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, 10034e91-1698-4f16-9cc2-483aa2e84372, 1
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, 396a1d47-ca8f-4189-8526-85e40875c363, 35
8730655f-b7b9-4d36-a4c2-f48e866e4533, 53fb60b1-7d6c-473f-91bf-42fd670ae055, 1
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, 4827fb22-4bd1-47f8-aee4-fb45b8900cb6, 1
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, be9d114d-2a16-4e24-b142-44fc97351cc6, 1
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, d8ab8d67-7efa-4370-b7d0-6d176c81901f, 1
9155b83d-d443-46d8-a24d-cff329eb0d07, 73bfd56b-b799-43cd-b17b-4ef259d18fcc, 35
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, 53fb60b1-7d6c-473f-91bf-42fd670ae055, 1
d1d1dad9-af15-4bc6-9066-5bc39a830eb0, 22ca0dc8-1607-4f48-bef3-84a267607cf5, 1

So I have here table of user_ids and item_ids which are both UUID type and score which is Int but can also be Double.

As you look closer on this data you will see that some users rated few and some users rated more items. And on the other hand that some items are rated only by one user and others by more users.

It looks that this example data should be sufficient to train your model. But as I have learned it is not. Every time I wanted to learn my model I got this error:

Training Error: Item IDs in the recommender model must be numbered 0, 1, ..., num_items - 1

Very informative error which makes you do not know anything what is happening here.

Item IDs has to be numbered? Hey … but they are allowed to be String type.

It turned out at first that MLRecommender model wants you to have as input normalised data. What does it mean?

It means that it requires you to provide data where every item is rated by every user. You can ask: But how to force every user to rate every item to make this dataset normalised?

This is rather not possible at all. We have to go another way. My example data contains small number of users and items so it is possible to create such set of data by hand. It is not very productive but for testing purposes possible. What if this data set contains hundreds of users and thousands of items? Good luck to make it manually.

Fortunatelly there is a tool which can transform your data from CSV file and update it to suite MLRecommender requirements.

This tool is named pandas and it is a Python module to transform data sets. I have very brief idea how it works so I won’t exaplain it here deeply but I will share with you my script which works for me later on.

To install pandas use Python’s pip command:

$ pip install pandas

Here is my pandas script with comments what is going on. This script is probably not the most efficient one but I have zero experience with writing python code so when it returned valid dataset I leaved it as is.

import csv
import numpy
import pandas as pd
import uuid

# 1. Read input CSV file
ratings = 'x7.csv'
ratings_df = pd.read_csv(ratings)

# 2. get unique user_id's and item_id's
item_ids = ratings_df.item_id.unique()
user_ids = ratings_df.user_id.unique()

# 3. make temporary ratings dataset
new_ratings_df = ratings_df

# 4. iterate by items and add all missing user ratings
for item_id in item_ids:
    mock = pd.DataFrame({'item_id': item_id, 'user_id': user_ids, 'score': 0.000001})
    new_ratings_df = new_ratings_df.append(mock)

# 5. iterate by users and add all missing items
for user_id in user_ids:
    mock = pd.DataFrame({'item_id': item_ids, 'user_id': user_id, 'score': 0.000001})
    new_ratings_df = new_ratings_df.append(mock)

# 6. drop all duplicates and leave only first values (the original ratings + new added where they were missing)
new_ratings_df.drop_duplicates(subset=['user_id', 'item_id'], keep='first', inplace=True)

# 7. sort by items column
new_ratings_df = new_ratings_df.sort_values(by=['item_id'], ignore_index=True)

# 8. export data to new CSV file
new_ratings_df.to_csv('new_ratings.csv', quoting=csv.QUOTE_NONNUMERIC, index=False)

You probably have noticed score I have added to this dummy data. In my use case I am working with ratings 1 - 40. This dummy rating cannot interfere with my real user ratings. This is why its 0.000001.

Be warned here: You cannot use as ratings values = 0. The rating has to be greater than zero to meet MLRecommender requirements. If this would be the only one thing I wanted you to remember from this article this is it.

Now when we run pandas script on original data set it gives us such working data:

"user_id"                             , "talk_id"                             , "score"
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "10034e91-1698-4f16-9cc2-483aa2e84372", 1.0
"9155b83d-d443-46d8-a24d-cff329eb0d07", "10034e91-1698-4f16-9cc2-483aa2e84372", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "10034e91-1698-4f16-9cc2-483aa2e84372", 1e-06
"f69df0fd-3b7e-489d-9197-28d94be3d281", "10034e91-1698-4f16-9cc2-483aa2e84372", 1e-06
"9155b83d-d443-46d8-a24d-cff329eb0d07", "22ca0dc8-1607-4f48-bef3-84a267607cf5", 1e-06
"f69df0fd-3b7e-489d-9197-28d94be3d281", "22ca0dc8-1607-4f48-bef3-84a267607cf5", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "22ca0dc8-1607-4f48-bef3-84a267607cf5", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "22ca0dc8-1607-4f48-bef3-84a267607cf5", 1.0
"9155b83d-d443-46d8-a24d-cff329eb0d07", "396a1d47-ca8f-4189-8526-85e40875c363", 1e-06
"f69df0fd-3b7e-489d-9197-28d94be3d281", "396a1d47-ca8f-4189-8526-85e40875c363", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "396a1d47-ca8f-4189-8526-85e40875c363", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "396a1d47-ca8f-4189-8526-85e40875c363", 35.0
"9155b83d-d443-46d8-a24d-cff329eb0d07", "4827fb22-4bd1-47f8-aee4-fb45b8900cb6", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "4827fb22-4bd1-47f8-aee4-fb45b8900cb6", 1.0
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "4827fb22-4bd1-47f8-aee4-fb45b8900cb6", 1e-06
"f69df0fd-3b7e-489d-9197-28d94be3d281", "4827fb22-4bd1-47f8-aee4-fb45b8900cb6", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "53fb60b1-7d6c-473f-91bf-42fd670ae055", 1.0
"9155b83d-d443-46d8-a24d-cff329eb0d07", "53fb60b1-7d6c-473f-91bf-42fd670ae055", 1e-06
"f69df0fd-3b7e-489d-9197-28d94be3d281", "53fb60b1-7d6c-473f-91bf-42fd670ae055", 6.0
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "53fb60b1-7d6c-473f-91bf-42fd670ae055", 1.0
"f69df0fd-3b7e-489d-9197-28d94be3d281", "73bfd56b-b799-43cd-b17b-4ef259d18fcc", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "73bfd56b-b799-43cd-b17b-4ef259d18fcc", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "73bfd56b-b799-43cd-b17b-4ef259d18fcc", 1e-06
"9155b83d-d443-46d8-a24d-cff329eb0d07", "73bfd56b-b799-43cd-b17b-4ef259d18fcc", 35.0
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "be9d114d-2a16-4e24-b142-44fc97351cc6", 1.0
"f69df0fd-3b7e-489d-9197-28d94be3d281", "be9d114d-2a16-4e24-b142-44fc97351cc6", 1e-06
"9155b83d-d443-46d8-a24d-cff329eb0d07", "be9d114d-2a16-4e24-b142-44fc97351cc6", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "be9d114d-2a16-4e24-b142-44fc97351cc6", 1e-06
"d1d1dad9-af15-4bc6-9066-5bc39a830eb0", "d8ab8d67-7efa-4370-b7d0-6d176c81901f", 1.0
"f69df0fd-3b7e-489d-9197-28d94be3d281", "d8ab8d67-7efa-4370-b7d0-6d176c81901f", 1e-06
"8730655f-b7b9-4d36-a4c2-f48e866e4533", "d8ab8d67-7efa-4370-b7d0-6d176c81901f", 1e-06
"9155b83d-d443-46d8-a24d-cff329eb0d07", "d8ab8d67-7efa-4370-b7d0-6d176c81901f", 1e-06

Having this file exported we can learn our MLRecommender model using new in Xcode 11 CreateML app.

Create new project, choose Recommender model type and then use our new normalised data set to train this model. After it is trained you can use copy it to your app and start using.

Here I wanted to say thank you to the person on StackOverflow who was also bothering with same error and who provided me to find a solution for it. It’s mpmontanez and his topic on StackOverflow: https://stackoverflow.com/questions/62270353

I also wanted to say thanks Apple Engineers on forums and Twitter who got interested in my case. They have not provided me above solution but I know that this issue will probably be fixed in new macOS Big Sur.

Other useful resources:

My feedback number for this issue is: FB7854032 & Apple Developers Forum Topic

If you have any questions or something is not clear enough about my MLRecommender journey feel free to ask me on Twitter.

CSV data has added spaces after commas for better article readability.

- Jul 8, 2020 | Paweł Madej

Licensing: Content Code Apache 2.0

permalink

MLRecommender in practice

Have you written a response to this page?