Monday, November 4, 2019

Predicting book sales prior to publication [#DH]

Wang, X., Yucesoy, B., Varol, O. et al., Success in books: predicting book sales before publication, EPJ Data Sci. (2019) 8: 31. https://doi.org/10.1140/epjds/s13688-019-0208-6.
Abstract: Reading remains a preferred leisure activity fueling an exceptionally competitive publishing market: among more than three million books published each year, only a tiny fraction are read widely. It is largely unpredictable, however, which book will that be, and how many copies it will sell. Here we aim to unveil the features that affect the success of books by predicting a book’s sales prior to its publication. We do so by employing the Learning to Place machine learning approach, that can predicts sales for both fiction and nonfiction books as well as explaining the predictions by comparing and contrasting each book with similar ones. We analyze features contributing to the success of a book by feature importance analysis, finding that a strong driving factor of book sales across all genres is the publishing house. We also uncover differences between genres: for thrillers and mystery, the publishing history of an author (as measured by previous book sales) is highly important, while in literary fiction and religion, the author’s visibility plays a more central role. These observations provide insights into the driving forces behind success within the current publishing industry, as well as how individuals choose what books to read.

1 Introduction

Books, important cultural products, play a big role in our daily lives—they both educate and entertain. And it is big business: the publishing industry revenue is projected to be more than 43 billion dollars, selling more than 2.7 billion books only in the United States every year [1]. Meanwhile, authors enter a very competitive marketplace: of the over three million books published in 2015 in the United States [1], only about 4000 new titles sold more than 1000 copies within a year, and only about 500 of them became New York Times bestsellers. There are more than 45,000 published authors in the US market; while most of them struggle to get published, a few of them like J.K. Rowling earn hundreds of millions of dollars from their books [1].

The driving forces shaping the success of books have been studied by various researchers over the years, explaining the role of writing styles [2], critics [3], book reviews [4], awards [5], advertisements [6], social network [7] and word of mouth effect [8], etc. However, predicting book success from multiple factors has received much less attention. The only published study in this area focused on book sales in the German market, applying a linear model [9] and reported limited accuracy.

Similar studies have focused on other cultural products, from music to movies, like using on-line reviews to forecast motion pictures sales [10], predicting the success of music and movie products by analyzing blogs [11], predicting success within the fashion industry using social media such as Instagram [12]. Nevertheless, the early-prediction of success is of great importance in cultural products. Early-prediction has been studied in various papers to address market needs for introducing new products [13], to predict movie box office success using Wikipedia [14] or to detect promoted social media campaigns [15]. Yet, predicting which cultural product will succeed before its release and understanding the mechanisms behind its success or failure remains a difficult task.

In our previous work [16], we analyzed and modeled the dynamics of book sales, identifying a series of reproducible patterns: (i) most bestsellers reach their sales peak in less than ten weeks after release; (ii) sales follow a universal “early peak, slow decay” pattern that can be described by an accurate statistical model; (iii) we showed that the formula predicted by the model helps us predict future sales. Yet, to accurately predict the future sales using the model of Ref. [16], we need at least the first 25 weeks of sales after publication, a period within which most books have already reached their peak sales and started to lose momentum. Therefore, predictions derived from this statistical model, potentially useful for long-term inventory management, are not particularly effective for foreseeing the sales potential of a new book.

In the publishing industry, limited information is available to publishers to assist their decisions on publishing (including how many copies to print, how much advance to provide, how much should they invest in marketing, etc.). Currently, publishers base their decision on the authors’ previous success, the appeal of the topic, and insights from writing samples and sales of similar books, rather than relying on data specifically linked to the book considered for publication. Early-prediction of book success using the available pre-publication information could be instrumental in supporting decision makers. Indeed, we would like to predict performance of a book prior to its publication. To offer such predictions, here we focus on variables available before the actual publication date, pertaining to the book’s author, topic and publisher, and use machine learning to unearth their predictive power. As we show, the employed machine learning is able to accurately predict sales and to discover which features are the most influential in determining the sales of the book.

[...]

6 Conclusions

In this paper, our goal was to develop tools capable of predicting a book’s sales prior to the book’s publication, helping us understand what factors contribute to the success of a book. To do that, we first extracted the pertinent features of each book, focusing on those that are available to readers before or at publication, and employed a new machine-learning approach, Learning to Place, which solves the prediction problem of heavy-tailed outcome distributions [30].

We extracted features from three categories: author, book and publisher. For the author feature group, we measure the visibility and the previous sales of an author; for the book feature group, we consider the genre, topic and publication month of the book; and for the publisher, we measure the reputation of the publisher.

An important challenge of our prediction task is that we have far more low-selling books than high-selling books; therefore, traditional methods like Linear Regression systematically underpredict high-selling books. We employed the Learning to Place algorithm to correct this limitation. For this, we first obtain the pairwise preferences between books, and use it to assign the place of the book compared to other books and obtain its sales prediction. Similar pairwise relations has been used to rank items using tournament graphs [34], inferring fitness of each instance [35], and optimizing constraints of pairwise relations [36]. However, our task aims to accurately estimate book sales. We found that with our Learning to Place algorithm, we can predict the sales of fiction and nonfiction fairly accurately and the algorithm does not suffer from systematic underprediction for high-selling books comparing to Linear Regression and k-nearest neighbors.

The developed framework also allows us to understand the features driving the book sales. We found that for both fiction and nonfiction, the publisher quality and experience is the most important feature, due to the fact that the publisher both pre-selects and advertises the book. Previous publishing history and visibility of the author are very important as well since readers are more likely to read books written by experienced authors or celebrities. The genre, topic and publication month of the book, however, have only limited influence on the sales of the book.

We also found that the feature importance are slightly different for different genres. For Thrillers and Mystery & Detective, author’s visibility and previous sales are more important than in other fiction genres. In nonfiction genres, Biography relies more on visibility than previous sales; while this is the opposite for History. Using the ternery plot we also find that author and publisher are very important for most books and for most of high selling books, author, publisher and book contributes equally to the sales.

We expect our methodology and findings to serve as a starting point towards a better understanding of the mechanisms driving the publishing industry and reader preferences. We hope that our research will inspire more investigation in the success of books and authors, helping us to create a more innovative, predictive as well as profitable environment for authors as well as for the publishing industry.

See also Chaos in the Movie Biz: A Review of Hollywood Economics [#DH].

No comments:

Post a Comment