Over the last decade, the renaissance of Web technologies has transformed the online world into an application (App) driven society. While the abundant Apps have provided great convenience, their sheer number also leads to severe information overload, making it difficult for users to identify desired Apps. To alleviate the information overloading issue, recommender systems have been proposed and deployed for the online App domain. However, existing work on App recommendation has largely focused on one single platform (e.g., smartphones), while ignores the rich and relevant data of other platforms (e.g., tablets and computers).
In this paper, we tackle the problem of cross-platform App recommendation, aiming at leveraging users' and Apps' data on multiple platforms to enhance the recommendation accuracy that is based on one single platform. One key advantage of our proposal is that by leveraging the multi-platform data, the perpetual issues in personalized recommender systems -- data sparsity and cold-start -- can be largely alleviated. To this end, we propose a sound and principled solution, STAR (short for ''croSs-plaTform App Recommendation''), that integrates both the numerical ratings and the textual content from multiple platforms. In STAR, we innovatively represent an App as an aggregation of the common features across platforms (e.g., the App's functionalities) and the specific features dependent on the released platform. In the light of this, STAR can discriminate a user's preference on an App by separating the user's interest into the two parts (either in the App's inherent factors or the platform-relevant features). To evaluate our proposal, we construct two real-world datasets, which are crawled from the App stores of iPhone, iPad and iMac.Through extensive experiments on the datasets, we show that our STAR method consistently outperforms the highly competitive recommendation methods, justifying the rationality of our cross-platform App recommendation proposal and the soundness of our solution.
We constructed two datasets -- one is of two platforms and the other is of three platforms -- to evaluate our method. The first dataset is constructed based on the platforms of iPhone and iPad. The name of App on iPad usually contains the word of ''HD'', which represents high definition. So we first chose the Apps whose names contain ''HD'', and then found their corresponding versions on iPhone. We found that 3,800 pairs of Apps exist on both of these platforms. Since most users use the same Apple account to download Apps on both iPhone and iPad, we could identify the same users by matching user IDs on the two different platforms. We further processed the dataset by retaining users who rated at least once on both of these platforms. Ultimately, we obtained 112,024 users, 2,704 pairs of Apps, and 320,535 ratings (168,489 ratings on iPhone, and 152,046 ratings on iPad). The user-App ratings matrix has a sparsity of 99.95%. The time span of ratings ranges from September 13th, 2008 to October 24th, 2015.
The second dataset is constructed based on the platforms of iPhone, iPad, and iMac. Since the Apps in iTunes Store and Mac App Store share the same name, they can be easily linked. We used the same method mentioned in the first dataset to link the Apps on the platforms of iPhone and iPad. Finally, we obtained 260 triples of Apps that are existed on all the three platforms. We selected users who had at least two ratings in the 260 triples, and finally obtained 121,905 users (102,789 users rated on one platform, 18,960 users rated on two platforms, and 156 users rated on there platforms), 201 triples of Apps, and 268,929 ratings (224,591 ratings on iPhone, 41,530 ratings on iPad, and 2,808 ratings on iMac). The user-App ratings matrix has a sparsity of 99.63%. The time span of ratings ranges from January 27th, 2008 to November 16th, 2015. The datasets can be downloaded from the following links.
We conducted extensive experiments on our two collected datasets aiming to answer the following four research questions:
RQ1. How does our designed STAR approach perform as compared with other state-of-the-art recommendation algorithms?
RQ2. How is the performance of STAR in handling the new-user and new-App cold-start problems?
RQ3. Do users exhibit distinct preferences for different platforms of an App? Is STAR able to target the exact platform of an App that the user has rated?
RQ4. How do the common features and specific features of Apps contribute to the overall effectiveness of STAR?
RQ5. In addition to rating prediction that is prevalent to a recommendation algorithm, how does STAR perform in the more practical top-N recommendation?
Overall Performance Comparisons (RQ1)
We carried out experiments on the iPhone-iPad and iPhone-iPad-iMac datasets, respectively. The experimental results were based on five-fold cross-validation. To demonstrate the overall effectiveness of our proposed STAR model, we compared the STAR with several state-of-the-art recommendation approaches: 1) SVD++; 2) RMR; 3) FM; and 4) CMF. The codes can be downloaded from the following links.
The new-user cold-start problem refers to existing user appearing on a new platform. We conducted the experiment on the iPhone-iPad-iMac dataset. There exist 19,116 users who have rated at least on two platforms, which is 15.68% of the whole 121,905 users. For each of these users, we first randomly selected one platform from those whereby he/she has rating records, and removed the ratings on this platform. Thereafter, we used the remaining ratings for training, and the removed ratings for testing. We repeated this experimental settings five times and reported the average results. Since SVD++ regards the same user on different platforms as different users, it cannot handle new-user cold-start problem. We compared our STAR method with these three approaches: 1) RMR; 2) FM; and 3) CMF.
Handling Cold-Start Problems (RQ2)
The new-App cold-start problem refers to such cases that developers release an existing App on a new platform and the App has no ratings on the new platform. We carried out experiments on the iPhone-iPad-iMac dataset. The textual content used here is App descriptions. Each App exists on these three platforms. For each App, we first randomly selected its one platform and removed the ratings on this platform. The remaining ratings of each App were used for training, and the removed ratings were used for testing. We repeated this experimental settings five times and reported the average results. Both RMR and CTR are semantics enhanced techniques, while CTR is a method to handle out-of-matrix cold-start problem which is the same in our scenario. So we used CTR to solve our new-App cold-start problem. FM treats the platform as a context and the context information of new App can be obtained from other Apps. CMF regards the App on all platforms as a single one and there are enough information about the App on the known platforms. In brief, we compared our STAR method with: 1) CTR; 2) FM; and 3) CMF.
User Preference on App-Platform (RQ3)
In our datasets, we only know user rated an App on a specific platform. However, we are not sure whether the user likes the App on the platform he/she has rated as compared with other platforms, or it is just the first platform that the user encounters the App. To answer this question, we explored the prediction of user's rating of an App on the current platform (i.e., the platform that was rated on by the target user), but also the prediction of the ratings on other platforms. If the rating prediction of the current platform is statistically higher than other platforms, it demonstrates the current platform is favoured by the users.
In the overall performance comparison, STAR outperforms CMF in both datasets, which demonstrates the importance of capturing common features and specific features of Apps across different platforms. To further understand the influence of common features and specific features of App latent factors, we performed experiments by removing common features and specific features separately. For convenience, we used cSTAR to represent ''STAR with common features only'', and sSTAR to represent ''STAR with specific feature only''.
Justification of Common Features and Specific Features (RQ4)
Conclusion and Future Work
In this paper, we presented a novel cross-platform App recommendation framework via jointly modeling numerical ratings and textual content. Particularly, it improves the performance of rating prediction by capturing common features and distinguishing specific features of Apps across multiple platforms. It is capable of alleviating the data sparsity problem via forcing common feature sharing across multiple platforms, which is especially beneficial to unpopular platforms. Meanwhile, it is able to solve the new-user and new-App cold-start problems. To validate the effectiveness of our proposed approach, we constructed two benchmark datasets. Experiment results over these two datasets have demonstrated the advantages of our work. We also performed micro-analysis to show how our method targets particular platforms of Apps and how common features and specific features of Apps affect the results.
In the future, we plan to expand our work in the following two aspects: 1) Modeling user preferences on different platforms. There are enough implicit feedbacks (e.g., browsing history, purchase history) accumulated on different platforms and they are beneficial to build user preferences on a specific platform. 2) Training topic model and latent factor method in an unified framework. The performance of topic model and latent factor method can be mutually enforced.
Evaluation of the Top-N Recommendation (RQ5)
The optimization for recommender systems has long been divided into rating prediction and top-N recommendation, which leads to two branches of evaluation metrics
— error-based (e.g., MAE and RMSE) and accuracy-based (e.g., recall and NDCG). According to the conclusion drawn from [Cremonesi et al. 2010], there is no monotonic relation between error metrics and accuracy metrics. It means that even though a method achieves a lower error rate for rating prediction, it does not necessarily mean it will outperform other algorithms for the top-N recommendation. One of the key insights is in the modeling of missing data, which is crucial for a model to obtain good performance in ranking unconsumed items for users [He et al. 2016]. Since rating prediction models [Hu et al. 2008; Koren 2010] account for observed entries only and forgo the missing data, they may be suboptimal for the top-N task. As such, we resorted to make some adjustments upon STAR to apply it to the top-N task.