Affinity Paths and information diffusion in social networks

José Luis Iribarren and Esteban Moro arbol
Social Networks 33, 134-142 (2011)  [pdf]

Abstract
Widespread interest in the diffusion of information through social networks has produced a large number of Social Dynamics models. A majority of them use theoretical hypothesis to explain their diffusion mechanisms while the few empirically based ones average out their measures over many messages of different contents. Our empirical research tracking the step-by-step email propagation of an invariable viral marketing message delves into the content impact and has discovered new and striking features. The topology and dynamics of the propagation cascades display patterns not inherited from the email networks carrying the message. Their disconnected, low transitivity, tree-like cascades present positive correlation between their nodes probability to forward the message and the average number of neighbors they target and show increased participants’ involvement as the propagation paths length grows. Such patterns not described before, nor replicated by any of the existing models of information diffusion, can be explained if participants make their pass-along decisions based uniquely on local knowledge of their network neighbors affinity with the message content. We prove the plausibility of such mechanism through a stylized, agent-based model that replicates the Affinity Paths observed in real information diffusion cascades.

The speed and reach of forwarded emails, rumors, and hoaxes in electronic social networks

large_spain_5We have just published an experimental/theoretical work on the speed of information diffusion in social networks in Physical Review Letters. Specifically we have studied the impact of the heterogeneity of human activity in propagation of emails, rumors, hoaxes, etc. Tracking email marketing campaigns, executed by IBM Corporation in 11 European countries, we were able to compare their viral propagation with our theory (see below the campaigns details).

The results are very simple. Let me give you an example: the typical time between two emails sent by the same person is around 1 day. Traditional models of information diffusion will then yield to an infection speed of 1 day. However, some email computer viruses spread widely in a matter of hours (minutes, sometimes), while some viral propagation (for example the Veuve-Clicquot hoax) last for years. How can that occur? The reason is that traditional models are not correct because they neglect the large heterogeneity in the frequency of human activity: the average time between emails (1 day) does not actually represent the collectivity. In fact, most of us respond very quickly to emails, but some take a lot of time to do it. This fact (known and discovered previously by others) has a profound consequence in the way information spreads:

  1. When information spreads “successfully”, in the sense that it propagates and reaches most of the collectivity (i.e. it surpasses the tipping-point), its propagation speed of is determined by the people that have higher activity.
  2. However, when information reaches just a small fraction of the population (below the tipping-point), its propagation is controlled by those who take a lot of time to respond/forward and the spreading is very slow.

This phenomenon, as explained in our paper, has consequences for viral marketing, fads and hoaxes diffusion or opinion dynamics because the speed of their messages propagation depends strongly on the size of the sub-communities of very active and not-so active people. For example, in our campaigns (which were below the tipping-point yet successful from a viral marketing perspective), endogenous propagation of the commercial message lasted for months while the average time between getting the message and forwarding was only 1 day. We also found that messages do not “go viral”: They are viral because of the diffusion mechanism they use, but their spreading success largely depends on the social network propensity and heterogeneous behavior.

Finally, our work has some consequences for the way we model and understand human dynamics, since it shows that there is no such a thing as a typical time scale in the human dynamics. This is in sharp contrast with epidemic models, information diffusion models, etc. in which the heterogeneity in human activity and frequency is usually neglected, in favor of a more homogeneous picture of the activity of humans.

About the empirical data:
The viral marketing campaigns were conducted by IBM using the typical “refer-a-friend” mechanism which led to the endogenous diffusion of information. The campaigns’ offerings were promoted at the IBM. homepage where initial participants heard about them. Their primary marketing objective was to generate subscriptions to the company’s on-line newsletter. Subscriptions were entered through a form located in the campaign main web page (a.k.a. registration page). Additionally, a viral propagation mechanism accessible through a button located at the registration page was available to foster the message propagation. The button caption enticed visitors to recommend the page to friends and colleagues by offering, as additional incentive for people to forward the page, tickets for a prize draw to win a laptop computer. More technical details about the campaign can be found at Appendix D of the arXiv version of our paper

Press coverage:

  • ‘Infectious’ people spread memes across the web, New Scientist (12/08/09)
  • Email hoaxes are like viruses, The Inquirer (10/08/09)
  • The flow of viral video, ABC News (8/08/09)
  • New model for social marketing campaigns details why some information ‘goes viral’, PhysOrg (6/08/09)
  • Los perezosos frenan los rumores en Internet, ABC.es (14/8/09)
  • Party people spread viral internet memes, ComputerWeekly (14/8/09)
  • Desvelan las claves de la difusión de la información en las redes sociales, PlataformaSINC.es (7/9/09)
  • Nuevas claves para la difusión de información en las redes sociales, Noticias Madri+d (7/9/09)
  • The probability of going through a bad patch

    We’ve heard it: people that invest on the stock market or that gamble in lotteries, casinos, etc usually say “I’m going through a bad patch” (or bad spell). That is, they have been losing money for a while, but hey! better times are ahead and there’s no reason to quit. Are they sure? Are better times ahead? How close is “ahead” to today? Let’s work through a specific example to see how far is “ahead”. Suppose we play a fair game: we toss a coin and with probability 1/2 we get $1 (heads) and with probability 1/2 we lose $1 (tails). We play the game n times and compute our capital C(n) up to time n. If our initial capital is zero, then we expect that our capital fluctuate around zero as the coin-tossing game goes on. Sometimes we will be in the “winning area”, where our capital is positive C(n) > 0. However, we can also be in the “losing area” in which our capital is negative C(n) < 0. If we are going through a bad patch (being in the losing area) we expect that waiting long enough we will recover and come back to the winning area.

    But this is incorrect. Let me show you why: let’s use some mathematics. Suppose that x_i is the gain (+$1) or lose (-$1) in toss i of the coin. Since our coin is fair, then x_i is a random number which takes +1 or -1 with equal probability (1/2). Thus the capital up to time n is the sum of those random numbers

    \displaystyle{C(n) = \sum_{i=1}^n x_i}

    C(n) is then the sum of n equally distributed random numbers. In other contexts, C(n) is also know as a random walk. We can apply the law of large numbers and the central limit theorem to know something about C(n). For example, the expected value of C(n) is

    E[C(n)] = 0

    as expected, since it is a fair game. Thus we have equal probability of being winning or losing at time n. However, C(n) fluctuates wildly around zero and in fact

    Var[C(n)] = n

    Thus our capital at time n is mostly in an interval of area \sqrt{n} around zero, as shown in the next graph.

    cnThe graphs shows 4 realizations of the game (colors) and the lines are the \sqrt{n} areas in which our capital is mostly expected. As we can see in the “red” game, we starting losing money, but after a while we recover and went back to the “winning area”. Now the question is: what is the probability that we are in the winning area? Specifically, what is the probability P(\alpha) that we are in the winning area (C(n) > 0) for a fraction \alpha of the total n turns? The naive reasoning in the introduction will tell us that since C(n) is fluctuating around zero we expect that the probability will be peaked and 1/2 and thus half of the time we will be in a bad patch and half of the time we will be going through a good spell. Thus, if we are going through a bad patch, we have only to wait to come back to black numbers. However, this is not true. The probability P(\alpha) can be worked out (although not trivially) to get

    \displaystyle{P(\alpha) = \frac{1}{\pi\sqrt{\alpha(1-\alpha)}}}pdfa

    in the limit n\to \infty, which is known as the arc-sine law (since the cumulative distribution of P(\alpha) is the arc-sin function). As the plot in the right shows the probability is peaked at 1 and 0 (actually, it diverges there!). Thus, for most of the realizations of the game we are most of the time in the winning area or in the losing area. This means that our naive reasoning above does not work: if you expect to recover from a bad patch, your chances are very small. This is obvious if we look at the colored figure above: the orange and black trajectories do not change from one winning/losing area to the other and, apart from the initial steps of the game, they remain in the winning/losing areas forever. The explanation for this behavior is that the first return time of C(n) to zero is oftenly large. Actually, its expected time is infinite, which means that once you get into the positive/negative area you remain (mostly) there.

    Note however, that there is no paradox in what we have found and the fact that E[C(n)] = 0, since P(\alpha) is symmetric around \alpha = 1/2 and thus if we play the game a large number of times, on average, we have the same chances of winning and losing. But not for an individual game in which mostly we will be in a bad or good patch forever.

    What is the moral? Simple: if you get into a bad patch, leave the game. Because chances to recover from a bad patch are small.

    Waiting for the bus

    Although the public transportation system in Madrid is very good, I don’t usually take the bus or the train to move around. But sometimes my car decides I should take public transportation (bus or train). One of things I always found intriguing is that I always wait way more for the bus than what is expected according to the frequency quoted by the transportation companies at the bus or train stop. For example: if the frequency of the buses is 4 per hour, you should expect a bus each 15 minutes on average, although fluctuations around this value may occur due to traffic conditions. Thus, arriving at the bus stop at random will get you an average of 15/2 = 7.5 minutes of wait. I have the impression I always wait 15 minutes instead. How can this be? Is it that I am so unlucky I always arrive at the bus stop when the previous bus has just left?

    The answer to this question is the famous waiting time paradox in queueing theory: in short, let’s assume that the times between two buses are given by the variable T. In the ideal world, T will be a deterministic variable. In the example above, T = 15 minutes. However, due to traffic conditions and other variables, T is distributed around 15 minutes. For example, the following graphic shows the “lateness”of non-frequent buses in Great Britain, extracted from the Bus Punctuality Statistics GB 2007 report 

    lateness1

    As we can see, although most of the buses were late only less than a couple of minutes, but there is an significant fraction of them that were late more than 5 minutes. And even some of them came earlier than expected! Assuming that our buses come at times T drawn from the above distribution, let’s see how a day will look like from the bus stop:

    diabus

    Buses arrived at the bus stop (vertical lines) with time between them given by T_i. After my car breaks down, I arrive at the bus stop at the time given by the vertical arrow. Thus I have to wait a time \tau. The question is: what is the average value of \tau? Given that my car breaks at random, I can assume that my arrival time is completely at random, uncorrelated with bus time tables. However, if the time I get to the bus stop is random, I have more probability to get to the bus stop in an interval in which the time between buses T is bigger. Specifically, the probability that I arrive at an interval with time between buses T is 

    \displaystyle{\frac{T P(T)}{\overline{T}}}

    where \overline{T} is the average value of time between buses. Given the interval T, the waiting time \tau is equally distributed (since my arrival is random), and thus we have to multiply the above probability by 1/T. Finally, we average over all possible \tau \leq T, giving

    \displaystyle{Q(\tau) = \int_\tau^\infty \frac{TP(T)}{\overline T} \frac{1}{T}dT = \frac{1}{\overline T}\int_\tau^\infty P(T) dT}

    which give us the distribution of waiting times. The average value of \tau is the given by

    \displaystyle{\overline \tau = \int_0^\infty \tau Q(\tau) d\tau = \frac{\overline{T^2}}{2\overline T^2} = \frac{\overline T}{2}\left(1 + \frac{\sigma_T^2}{\overline{T}^2}\right) }

    where \sigma_T^2 is the variance of the times between buses. This equation is the main result. It says:

    • If buses come at perfect and regular time intervals, then \sigma_T = 0 and the waiting time is what we expect: \overline{\tau} = \overline{T}/2. That is if we expect a bus each 15 minutes, I will wait (in average) 7.5 minutes.
    • However, in real world, buses do not arrive at perfect times and then \sigma_T > 0. Thus, waiting time is always greater than \overline{T}/2. In fact, the distribution of lateness above shows that there is a large fraction of buses with long delays and then \sigma_T could be very big, controlling the waiting time.

    In fact, the GB Bus Punctuality Report shows that in average, waiting time exceeds 40% the expected time \overline{T}/2. That is: due to the variance in times between buses, you (and me) end up waiting a time greater than the average time between buses. And that is my feeling: if trains/buses come each 15 minutes, I end up waiting 15 minutes, not 7.5.

    In a ideal world, bus timetables would quote both the frequency of the buses \overline{T} and their variance \sigma_T, so we can estimate the waiting time. Unfortunately, they just tell us part of the story…

    Brown’s observations on Brownian motion

    Robert Brown (1773–1858)

    Robert Brown

    In 1828, Robert Brown published the manuscript entitled “A brief account of microscopical observations made in the months of June, July and August, 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies” in the Edinburgh new Philosophical Journal [download it in pdf format here]. He suspended some of the pollen grains of the species Clarkia pulchella in water and examined them closely, only to see them “filled with particles” of around 5 µm diameter  that were “very evidently in motion”. He was soon satisfied that the movement “arose neither from currents in the fluid nor form its gradual evaporation, but belonged to the particle itself”. Brown’s work was the first comprehensive observation of a phenomena called Brownian motion which remained unexplained until the beginning of the 20th century by Bachelier and most notably by Einstein in his famous paper in 1905. Brownian motion is the most basic description of the dynamics of a particle, price, etc. under the influence of external noise. 

    microscopio-brown

    Picture of the microscope used by Brown

    A typical mistake found in books, encyclopedias and articles (even in the Nature journal and even by the great Giorgio Parisi) is that Brown observed the motion of the pollen grains themselves. This might be the most clear example of a propagated mistake in the scientific literature, since it is obvious from the very title that the particles he observed were “in the pollen grains”. In fact, using the explanation of the motion by Einstein is easy to convince ourselves that the pollen grains were too big to wander around enough to be observable: Einstein in 1905 published a paper (original in german) in which he derived the famous Einstein relation in kinetic theory

    \displaystyle{D = \frac{k_B T}{6 \pi \eta r}}

    which relates the diffusion constant of the Brownian motion of a particle D with the radius of the particle r and the viscosity of the medium in which the particle is moving \eta. The diffusion constant D is also proportional to the temperature T and k_B is the Boltzmann constant. In order to observe the motion by eye, the particle should move considerably in a matter of seconds. We can evaluate the size of the movement by looking at the root-mean-square fluctuations in the position which are given by

    \langle x^2(t)\rangle = D t

    For a pollen grain the radius is around r \simeq 100 \mu m and in water at T = 25ºC we get that \langle x^2(t)\rangle = 19 \mu m in … one day!! This slow pace motion is hard to observe with today optical microscopes by eye, not to say with the microscope used by Brown (see the above picture).

    To observe the Brownian motion by eye, the particles should be smaller. If we assume that the radius should be small enough so that the RMS fluctuations in one second are greater or of the order of the particle size, we get that

    r \leq 1 \mu m

    that is, of the same length of the particles observed by Brown inside the pollen grain. In the following video you can see this molecular motion in an experiment made a research group in the Universidad Complutense de Madrid (Luis Dinis, Julio Serna, Rodrigo Soto and Ricardo Brito) using polystyrene spheres of 0.75-0.89\mu m diameter in water. The observations are made with an optical microscope using a 60x objective. The Brownian motion is evident.

    More information about Brown’s observations and related work:

    Update: Julio Serna (thanks) sent me this interesting reference: