Predicting the Success of Online Petitions Leveraging Multidimensional Time-Series

Applying classical time-series analysis techniques to online content is challenging, as web data tends to have data quality issues and is often incomplete, noisy, or poorly aligned. In this paper, we tackle the problem of predicting the evolution of a time series of user activity on the web in a manner that is both accurate and interpretable, using related time series to produce a more accurate prediction. We test our methods in the context of predicting signatures for online petitions using data from thousands of petitions posted on The Petition Site - one of the largest platforms of its kind. We observe that the success of these petitions is driven by a number of factors, including promotion through social media channels and on the front page of the petitions platform. We propose an interpretable model that incorporates seasonality, aging effects, self-excitation, and external effects. The interpretability of the model is important for understanding the elements that drives the activity of an online content. We show through an extensive empirical evaluation that our model is significantly better at predicting the outcome of a petition than state-of-the-art techniques.


INTRODUCTION
The ability to predict user activity or engagement on the web has many applications in a wide range of domains. This includes predicting the number of people who will install an application in an app marketplace, buy a product from an online retailer, or participate in an e-government action etc. Ideally, a forecast of user involvement should be generated as early as possible, in a manner that is both accurate and interpretable. The quest for interpretability is due to the importance of knowing what are the elements that are driving predictions up or down as a process unfolds, in order to take corrective actions whenever possible. The problem of generating early, accurate, and interpretable predictions on the web challenges our understanding of complex interactions over time, and is further complicated by the presence of multiple confounders. Web data is almost invariably noisy and incomplete, and often comes from several heterogeneous sources. Additionally, and despite recent advances in empirical methods for predicting information dissemination [35,39], we lack a general parametric modeling framework to predict user involvement in a reinforced process, for instance, a petition accompanied by an active campaign to gather signatures by mobilizing people online. For instance, the mobilization of people through an online campaign might involve several sources of reinforcement: social media, traditional news media, and word-of-mouth or viral advertising.
In this paper, we present new forecasting models for online content dissemination that are able to take into account several elements: seasonality, aging effects, self-excitation, and external influence (e.g., in the form of social media postings). Our main contribution, beyond presenting a combined parametric model that has better predictive power than the state of the art, is being able to incorporate a time series of related observations to produce a more accurate and earlier prediction, and to further enhance the interpretability of the results.
We evaluate our models by using them to predict the number of signatures an online petition will gather over time. Online petitions are a representative of a broad class of online phenomena involving active public mobilization, and thus represent a relevant scenario for testing our methods. The setting we consider might generalize to the active spread of ideas or memes, in the sense that it goes beyond passive diffusion. People promoting online petitions and people who sign petitions tend to actively encourage others to sign, instead of passively expecting that people simply learn about these petitions through a contagion process. Often, such promoters use external platforms for dissemination; thus, it is crucial to capture the external signals. Our contributions. In this work, we present models for user behavior with respect to online petitions. We make the following contributions: • we analyze thousands of online petitions from one of the largest petitions sites on the web (Sections 3, 4); • we present a model to predict user involvement in a reinforced manner combining seasonality, aging, selfexcitation, and external evidence as a continuous signal; this model has easily interpretable parameters (Section 5); • we show that our model is more accurate in both shortterm and long-term predictions of user involvement, when compared with state-of-the-art methods (Section 6). The rest of the paper is organized as follows. We start with an overview of related work in Section 2. We describe our process for collecting petition data in Section 3, and the insights we gained in Section 4. We present our new predictive model and compare it to existing models in Section 5. We experimentally evaluate the models and discuss them in Section 6. Finally, we summarize our results and outline future work in Section 7.

RELATED WORK
In this section, we position our paper with respect to prior work on popularity prediction for the web and for online petitions.

Popularity prediction on the web
Predicting the popularity of user generated content on the web has been studied extensively [36]. Many different settings have been considered; common content types include online videos [23], online news [4], social bookmarking sites [22], social networking services [39], and crowdfunding campaigns [9], among others. Most works on this topic tackle one of three main tasks: (i) classify as successful/unsuccessful, meaning trying to predict whether a particular piece of content will exceed a certain popularity threshold or not; (ii) predict the overall popularity, i.e., predict the final number of views or votes a piece of content will receive; and (iii) time series forecasting, i.e., modeling the popularity dynamics over time. Regardless of the specific task, two main types of approaches are observed: featurebased and model-based. Feature-based techniques rely on a set of (hand-)crafted features extracted from a single or multiple sources, for the purpose of classification or regression. Model-based techniques assume a specific parametric model for the process that drives the phenomenon; they are usually harder to formulate, but often produce better insight into the studied phenomenon. We summarize these approaches and include references for each one in Table 1.
This paper goes beyond analyzing "meme"-like content that spreads virally, and study a phenomenon that involves active promotion; hence, we need to consider external signals. External information is used by previous work adopting feature-based approaches that extend Szabo and Huberman [35] (such as [4]), but not in model-based methods, as we do in this work. Our approach is based on modeling the conditional mean of a Hawkes process, as Kobayashi and Lambiotte [20] suggested. However, we extend their model  with a flexible aging that includes a raise and a decay, and allows both for internal dynamics (self excitation) and external factors (social media, front page). Moreover, each external factor is modeled as a continuous effect on the signature dynamics, rather than a series of individual external shocks. To the best of our knowledge, we are the first to present a model that captures interaction between multiple platforms in a model-based framework and with easily interpretable parameters.

Analyzing the dynamics of online petitions
Signature acquisition in online petitions is a complex and multi-dimensional problem. From the perspective of online activism, it is not only important to predict whether a petition will gain the required number of signatures or not, and what the final number of signatures will be, but also to start from valid assumptions about how the number of signatures evolves over time, and how external factors shape this evolution. Understanding these factors can help the organizers of these petitions to further enhance the engagement of the public with their campaigns.
Hale et al. [15] describe a temporal analysis of 8,000 petitions and discuss early signs of success (e.g., a large number of signatures during the first days). However, it remains unclear why some petitions become popular and others do not, or what are the factors that can lead to an increase  in popularity. Huang et al. [18] analyze "power" users on petitions platforms and how user involvement changes over time on a petitions platform. Proskurnia et al. [30] study the effect of petition success on user involvement in public online campaigns [31,10]. In contrast, we link social media and petitions together to model their evolution considering multiple factors, including external influence. Online petitions can be compared to crowdfunding campaigns, as both efforts work towards obtaining a given level of support over a bounded period of time. Etter et al. [9] study various prediction techniques for crowdfunding campaigns on Kickstarter. An et al. [2] analyze investor activity on Kickstarter and make recommendations based on their activity on Twitter. Unlike these works, we focus on signature rate dynamics using co-evolving time series information, and we do not limit ourselves to signals from social media, but also utilize further available information, including the effect of being featured on the front page.

DATA COLLECTION
Our study is based on petitions obtained from The Petition Site, 1 one of the top-3 sites of its type according to Alexa. 2 The Petition Site allows anyone to create an online petition and to gather signatures. There are 14 categories in which petitions can be started, including Environment and Climate, Education, Health, and Human Rights. Petitions have a headline (e.g., "Help stop the Taiji dolphin slaughter"), the name of the person or entity to whom the petition is addressed (e.g., "International Marine Trainers Association"), the name of the person who creates the petition, dates of opening and closing of the signature gathering, and a description and/or letter describing the contents of the petition. Petitions also include a target number of signatures, decided by its author; we consider that petitions that reach this target are successful, otherwise they have failed.
We collect two kinds of information on those petitions: list of signatories and tweets pointing to the petitions. The entire data collection pipeline is illustrated in Figure 1. The overall characteristics of the collection are shown in Table 2. Petitions data. Petitions data were obtained using a custommade web crawler and scraper to collect petitions created after August 1st, 2016 across all the topics. The resulting petitions garnered around 85 million signatures from about 5 million unique users. While there are old petitions in the data we collected-some dating back to 2003-we decided to focus solely on petitions that started after August 1st and were active for at least 10 days. These petitions comprise 85% of the total number of signatures in the entire collection. We additionally removed five outlier petitions having unattainable goals (requiring more than 1 billion signatures). Each petition has a web page including public information about the people who signed the petition. Each signer is authenticated on the platform by providing an e-mail address, whose ownership must be verified before the signature is recorded. Once the e-mail address is verified, signers may chose to remain anonymous (listing only the signature timestamp on the website), or to disclose more information (such as their first name and country of residence). Additionally, we collected hourly data for top 10 petitions promoted on the front page starting August 1st.
Twitter data. In addition, we used Twitter's streaming API to collect all tweets containing a link to any URL containing "thepetitionsite.com." Tweet collection was conducted from August 1st, 2016 through October 1st, 2016, collecting over 250K tweets. Table 2 shows that the median number of signatures collected by successful and failed petitions are significantly different (p 0.001). We also observe that successful petitions have more modest goals than failed ones; indeed, the goals of successful petitions are 10 times smaller than the goals of failed petitions (target of 4.3K signatures in successful petitions, vs. target of 43.8K in failed ones), while the successful petitions collect about 9 times more signatures (51.9K signatures in successful petitions, vs 5.6K in failed ones). Both successful and failed petitions have similar timespans, 50 and 42 days on average, respectively.

DATA ANALYSIS
The majority of people include their first name and country, but signatories of failed petitions are almost twice as likely to remain anonymous (2.3% anonymous signatures in successful petitions vs 4.4% in failed ones); they might be less willing to be publicly associated to these petitions. We also observe that petitions that are successful have on average more activity on Twitter: they are three times more likely to have tweets (90% vs 27%), and have an average number of tweets that is more than twice the number of tweets failed petitions receive (83.3 vs. 37.1).
The cumulative distribution of signatures for over 4,000 petitions is shown in Figure 2 (left). From the figure, we observe that over 70% of the failed petitions did not reach 1,000 signatures, while nearly all successful petitions obtained at least 1,000 signatures and over 20% of the successful petitions reached over 100,000 signatures.
As previous works [15,34], we observe that the higher the number of signatures a petition receives early on, the more likely it is to gain the required number of signatures. Figure 2 (right) shows the distribution of the number of signatures for the first 3 hours of a petition. Almost all failed petitions acquire less than 10 signatures during the first 3 hours, but this does not guarantee failure: almost 60% of the successful petitions also acquire less than 10 signatures during their first 3 hours. As a result, a significant part of the successful petitions are indistinguishable from failed petitions during the first hours and, thus, it is not trivial to make an accurate prediction on whether they will succeed or not using only this data. Observations done using the first 24 hours of each petition, omitted for brevity, show a similar lack of separation between successful and failed petitions.  To understand the behavior of different classes of petitions, we clustered the petitions' time series using Dynamic Time Warping [12] into four clusters (we experimented with values from 2 to 30 clusters, and found that the inter-cluster distance stabilizes at about 4 clusters). The corresponding centroids are shown in Figure 3. Each cumulative distribution function for the petition signatures has been rescaled to the unit interval and to have the same number of time bins. Again, we observe that successful petitions tend to gather a large share of their signatures early on.

Circadian Cycles and External Influence
In this section we observe two key characteristics of the time series of signatures that we subsequently use for building our prediction model.
Circadian cycles. We binned the petition signatures and corresponding tweets into 10 minute time intervals. In addition, we aligned the petition signatures and tweets with the corresponding time of the day in the users' country. Both activities clearly follow a circadian rhythm, with the signature activity showing a stronger circadian pattern than the tweets. In particular, we can observe a peak (at around 10am) in signature activity as shown in Figure 4.  External effects. In order to estimate whether social media and being featured on the front page affect the signatures, we performed a Granger causality [14] study between signature time series, social media and front page appearances. We examined a random sample of 30 petitions from each cluster in Figure 3 with their corresponding tweets and their presence in the front page of The Petition Site (as detailed in Section 5.2). Specifically, we ran the algorithm to discover the latent network structure for point processes from Linderman and Adams [24], which determines the influence of a time series on the prediction of another time series, e.g., whether signatures affect tweets or vice versa. As a result, we discovered that for the cluster containing more successful petitions, Granger causality from Twitter to the number of signatures can be observed in 90% of the cases. This fraction is lower for the remaining clusters that have less probability of success: 72%, 35%, and 20% respectively. This suggests that Twitter can accelerate the signatures early in the lifetime of a petition. We confirm this later, in Section 6.3, by showing that it mostly influences our predictive capability early in the petition lifetime. Interestingly, in the case of petitions that were promoted to the front page of The Petition Site, we identified cases where signatures influenced the front page time series and vice versa equally. We further study the front page effect in Section 4.3.

Matching Twitter Users and Signers
The main goal of this subsection is to establish a clearer connection between signatures and social media postings (tweets) beyond Granger causality. We performed a one-to- one matching between Twitter accounts and the names of petition signers/owners. Information about signers is represented in a structured format on the petitions platform. We adapted the method by Goga et al. [13] with matching parameters set according to our data. In particular, we used the following attributes to match the profiles: (1) signer full name and Twitter name/user name, (2) signer location and Twitter user location, (3) signer petitions and tweeted petitions. We tried various combinations of these three matching dimensions, and found that using all of them resulted in the maximum number of unambiguously matched users.
The main idea behind the matching is to investigate user patterns while signing the petition, specifically whether people post a tweet after signing, or sign after posting a tweet. This fine-grained matching further allows us to trace the number of followers that signed the petition and retweeted it. Overall, we were able to match 3,157 accounts (out of 37K unique Twitter users). On average, each signer was matched to 1.47 Twitter accounts (with the maximum number of matches being 45); 2,641 accounts were matched one-to-one to Twitter accounts in a non-ambiguous manner; these are the ones included in Table 3. The first observation from this table is that most people who sign a petition and post a tweet first sign the petition, and then tweet. The absolute difference in minutes between user sign/tweet behavior can be depicted with the following sparkline: , where red line correspond to the case when a petition was signed an tweeting simultaneously and on the left of the red line we have users that first tweet and then sign. About 80% of the users perform signing and tweeting almost at the same time. In particular, 74% of users that sign and tweet almost simultaneously, tweet less than 10 minutes after signing a petition. We note that no matching scheme across websites is perfect, and this particular one might have false positives (some of the signer profiles had several identical matches on Twitter), however, we believe it provides relevant insight on the interaction between these platforms.

Front Page Effect
We identified 75 petitions that were promoted to the front page, and measured whether petitions that are promoted to the front page are already on track to be successful, and if promoting those petitions causes their success. The short answer corroborates the results of the Granger causality analysis of Section 4.1: yes to both. To arrive to this answer, we used a standard tool from observational experiments, a matching study, where we matched these 75 petitions featured on the front page with 75 similar petitions that were not featured on the front page. First, we computed the number of signatures that each of the 75 petitions promoted to the front page obtained before it got promoted at time t * S . Second, we matched each petition promoted to the front page with one that is within a 10% range of the number of signatures but was not promoted (¬FP) at time t * S . On average petitions appear on the front page after 27 hours (79 hours median) and remain for 14 days (6 days median). Statistics of these two samples are compared in Table 4. Table 4 strongly suggests that the petitions that are promoted are not randomly chosen. Failed petitions constitute about 75% of our sample, and hence a petition chosen uniformly at random should have about 25% success rate. In comparison, the matched ¬FP set has a success rate above 80%. However, the same observations also confirm that being promoted on the front page has a drastic effect on these petitions. Beyond ensuring success (as the success rate of promoted petitions is 100%), it significantly increases the number of signatures received. For example, after only 2 days of being promoted on the front page, petitions gained almost twice as much signatures as ¬FP.

PETITIONS MODELING
In this section, we introduce new methods to model the evolution of the number of signatures. Our models take into account circadian rhythms, aging effects, self-excitation, and external signals that influence the signature rate over time. Experimentally, these signals correspond to postings related to each petition on a social media platform, and the position in which a particular petition was present on the front page of the petitions site.
First, we introduce a new deterministic model that mimics the circadian nature of the underlying phenomenon we are studying and that includes information aging and selfexcitation. Next, we extend this model by incorporating the external influence of social media and front page display, describing an end-to-end prediction pipeline.

Circadian Rhythm and Aging
The engagement of users with petitions, this is, the signature rate over time, exhibits two important temporal characteristics: circadian cycles and temporal decay. Circadian cycles are visible as daily oscillations in the signature rate, as we showed in Figure 4; they affect all petitions and remain stable within a particular time zone. Decay is expected due to the aging of the petition; sometimes the signature rate starts to decrease immediately, while in other cases it increases and then decreases. Based on these observations, we propose a model called Circadian rhythm with Rise and Decay (CRD). We discretize the time using a time step δt = 1(h), while the signature rate (number of signatures between t and t + 1) is described aŝ where t is the time since the birth of petition p, a p is the intensity, b p is the amplitude of the oscillation, φ p its phase (with respect to an oscillation cycle of T = 24h), τ p is the decay parameter, and k p describes the initial rise in the petition activity. Parameters are fitted by minimizing the square error E p = T train t=1 {ŝ p (t) − s p (t)} 2 , using Levenberg-Marquardt's algorithm [26]. The parameter range of τ s is restricted to 0.5 < τ p < 75 hours similarly to Kobayashi and Lambiotte [20].

Self-Excitation and External Influence
The CRD model is extended to incorporate self-excitation and external influence that comes from two sources. The external influence we model comes from two sources. The first one is social media, and is expressed as n sm (t), the number of social media exposures at time t (the number of tweets multiplied by the average number of the authors' followers). The second one is being featured on the front page of The Petitions Site, expressed as the rank in the front page n srank (t) that contains 10 petitions at a time. An arbitrary value of n srank = 1, 000 were chosen for petitions not featured in the home page, which are the majority. The signature ratê where T mem = 10h is the size of a memory window indicating the number of time steps to be used in the estimation, and memory kernels c self , c sm , c front are, respectively, the relative importance of self-excitation, the external influence from social media, and the impact of being featured on the front page of The Petitions Site over time. The memory kernels are determined by minimizing the squared error after fitting CRD parameters a p , b p , k p , τ p and φ p .

EXPERIMENTS
In our experiments, we consider two main prediction tasks: short-term T tot = 72 (3 days) and long-term T tot = 168 (1 week) prediction. We vary the size of the input that is available to each model T train (from 12 hours to 71 or 167 hours respectively).

Metrics
Two metrics were used for calculating prediction performance of different prediction models.

Symmetric Median Absolute Percentage Error (SMAPE)
measures the median hourly deviation between the predicted and actual time series signature counts for a predicted period over N petitions: where,ŝ p (t) and s p (t) are the predicted and actual numbers of signatures of the p-th petition between t and t+1. We use median to reduce the effect of outliers, similarly to previous works on web predictions [20,39].
Cumulative Symmetric Median Absolute Percentage Error (CSMAPE) measures the median deviation between the predicted and actual cumulative signature counts for a predicted period over N petitions: whereŜ p (T train , T tot ) and S p (T train , T tot ) are the predicted and actual number of signatures of the p-th petition in the prediction period (T train , T tot ], respectively.

Baselines
We compared our methods against three state-of-the-art baselines.
Linear Regression. We trained the linear regression model proposed by Szabo et al. [35], which is a standard method for popularity prediction. The logarithm of the cumulative number of signatures S(T ) at time T is fitted by a linear function log S(T ) = α T + log S(T train ) + σ T T . Parameter α T , σ T are obtained by minimizing the squared error of the prediction on a training set, and T is a Gaussian random variable with zero mean and unit variance.
SVM with self-excitation and SVM with social media.
A strong and simple baseline to predict complex time series is SVM regression with the Gaussian radial basis function (RBF) [7]. Similarly to our model, SVM with self-excitation and SVM with social media are given time series s p (t − i) and n sm (t − i) for a time window T mem = 10 respectively. The best performing parameters for the model determined experimentally for our case are C = 1000 and γ = 0.1, where C is the soft margin penalty parameter and γ is the kernel coefficient.

Reinforced Poisson Process (RPP)
The RPP model has been used for modeling the cumulative number of citations to journal papers published by the American Physical Society [33]. The signature rate λ t is expressed as λ t = cf γ (t)r α (R t ), where c represents the attractiveness, f γ (t) ∝ t −γ (γ > 0) describes the aging, and the reinforcement function r α (R t )(α > 0) models the "rich gets richer" phenomenon. The parameters c, γ, α are determined by maximizing the likelihood function [11,20].

Prediction
We train linear regression and SVM models for each input size T train and prediction length T tot − T train . As training data, we use 70% of the petitions selected uniformly at random. We train the model to predict signature rate occurring at an arbitrary hour in the future, as well as the cumulative number of signatures up to that point, using hourly signature s p (t) and tweet n sm (t) rates from the training dataset. We then test the prediction on the rest of the petitions. These experiments are performed 10 times, and we report their average performance. Estimation of the parameters of our model is performed in two steps. First, we estimate the parameters of seasonality and aging using the plain CRD model for each petition. Second, we train a linear regression model either with self-excitation c self (i) or social media c sm (i) component separately, using the results of the previous step. The latter we estimate it on the training set using Eq. 1, since the information about future postings on the social media is not known. Figure 7 shows the hourly average of social media exposures as well as its estimation by CRD model. Upon prediction we reestimate parameters a and b of Eq. 1 based on the actual social media exposures. Further, we utilize the predicted values as n sm (t) in Eq. 2.
Prediction accuracy. Figure 5 shows an example of an actual time series for signatures and the result of predictions with our models and the baselines. We show the advantage of incorporating information from social media in terms of generating a prediction that follows more closely the actual evolution of the number of signatures. Note that our models significantly outperform the baselines. We systematically evaluate all models using introduced metrics in Figure 6, which shows the results of predicting the number of signatures for up to 3 (upper plots) or 7 days (bottom plots). The x-axis corresponds to the amount (in hours) of training data each method receives. We observe that the performance of the SVM-based methods is the lowest, linear regression and reinforced Poisson process have intermediate performance, and the performance of CRD, CRD with social media and CRD with self excitation are the highest. The latter two behave similarly, except when little training data is available, at the very beginning of a petition. In that case, CRD with social media is better than CRD with self excitation. Given the size of the entire collection, the average improvement of considering front page information for 75 petitions is relatively small. However, among 150 petitions described in Section 4.3, the front page effect brings an improvement of about 4% on average in terms of SMAPE for the prediction of up to 3 days, with respect to CRD with social media and front page effect in which c front is forced to be 0.

Analysis of Estimated Parameters
This subsection describes the analysis of the estimated parameters of the CRD models as well as its external influence functions.
Circadian Rhythm and Aging. As a by-product of modeling each petition using the Circadian with Rise and Decay (CRD) model given in Eq. 1, we obtain a distribution for each parameter across all petitions. These distributions are shown in Figure 9, where we separate failed petitions from successful ones, as well as a special case of successful petitions, which are the ones promoted on the front page.
As expected, we observe that the intensity parameter a, which corresponds to the offset in the signature rate, is higher for successful petitions that for unsuccessful ones. Interestingly, the amplitude parameter b shows that the oscillations of the series are larger for failed petitions, perhaps because failed petitions are more localized within a single time zone. The growth parameter k, which influences the day at which a petition reaches its peak, shows that successful petitions tend to be more popular early on in comparison with failed petitions, and that the peak of the petitions that are promoted on the front page happens later in time-likely at the moment when the petition ranks the highest on the front page. The decay parameter τ can be much larger for successful petitions, meaning that they sustain interest for a longer period of time (in the model this appears as e −t/τ ). Finally, most of the petitions have a similar shift of the circadian rhythm, given by phase parameter φ, since most of them are created in the USA and signed by people in the same country, in time zones that are close to each other (the distributions are almost equal so they are omitted from the figure).
Self-Excitation vs External Influence. Our model uses a time window of size T mem hours, which allows to incorporate information from the recent past in its estimation of the future. Each of the coefficients for the influence of selfexcitation c self (i), social media c sm (i), and front-page effect c front (i) can be seen as a time-indexed vector reflecting the importance of different moments of the recent past for each specific influence across successful petitions. If we are predicting the popularity on t + 1 hour, the influence function corresponds to the vector of size T mem that contains the impact of each prior hour t − i of past signatures, social media exposures, or front-page rank, where i = 0, 1, . . . , T mem . The centroids of these vectors are shown in Figure 8.
Several interesting observations can be made from Figure 8. First, self-excitation seems to be largely memoryless, with the immediately preceding step being the most influential element. Second, social media (Twitter in this case) has an influence that can last up to four hours for the successful petitions, and peaks about 2 hours after posting; this means that posting at time t mostly affects the signature rate between times t + 1h and t + 3h. Failed petitions are less affected by social media and only within 1 hour after the posting. Third, being featured on the frontpage significantly boosts signature rate for up to 3 hours, in agreement with our observations from Section 4.3. In relative terms, a post on social media has a stronger (external) impact on future number of signatures than adding one signature (self-excitation), and being featured on the front page has a stronger effect than social media activity.

CONCLUSIONS
Online user engagement is a complex phenomenon, challenging us to understand interdependent activities across websites that are less studied than those happening on a particular website. In this paper, we studied an important form of engagement, signing an e-petition, and modeled two external influences: activity on social media, and promotion to front page. We demonstrated significant improvement in modeling and predicting engagement when those influences are taken into account. In addition, we showed that the circadian rhythm of human activity, and the fact that interest decays over time, also need to be considered. We analyzed the effect of social media and found it to be impactful in two ways. First at a micro level, as demonstrated by the matching of people signing a petition and then posting about it shortly afterwards. Second at a macro level, where we analyzed the effect of Twitter on the signature rate using a Granger causality test, and showed significant improvement in prediction accuracy when using social mediaimprovements that are particularly important to reduce the amount of time/data needed to perform an accurate prediction. We were also able to determine that the effect of Twitter posts lasts for about 5 hours and peaks at about 1-3 hours since posting. These findings are relevant beyond online petitions, as many campaigners in social media (e.g., promoting brands, causes, or candidates) also perform similar activities in order to boost user engagement.
Specifically for online petitions, we showed that successful petitions tend to peak early and continue receiving attention for longer time. In other words, it is not just about having a "strong start," but about being able to sustain this engagement day after day. Petitions can be boosted by activity on social media, and by featuring them prominently to a large audience of potential signatories, as demonstrated by the front page effect that we have modeled and measured. These findings are probably relevant for people running other types of campaigns, and may be particularly important for crowd- funding campaigns. In general, running a successful online campaign requires sustained attention and punctual interventions. In that context, interpretable models that can provide actionable insight about how a campaign is evolving are vastly more useful than opaque models, even if the latter were to provide small advantages in terms of prediction accuracy.
Future Work. We believe that this paper is an important step towards better modeling and predicting how reinforced information spreads online. It can be extended in a number of ways. In terms of new methods, it would be interesting to explore how the effects of several petitions on each other could be modeled, and how social media communities and influencers, defined both topically and through network structures, could be incorporated into our models. Moreover, impact functions could be represented through parametric distribution functions. In terms of enhancing the prediction accuracy, further sources of social media, and new features, could easily be incorporated into our model. Since we are modeling the petitions at an individual level, it might also be interesting to build and compare our model to a batch model and apply it over specific clusters of petitions. Finally, a prediction using a stochastic Hawkes process might be compared to the deterministic one presented in this paper.