Relationship mining

2010 January 11

Each day trillions of emails, phone calls, comments on blogs, twitter messages, exchanges in online social networks, etc. are done. Not only the number of communications has increased, but also each of these transactions leaves a digital trace that can be recorded to reconstruct our high-frequency human activity. It is not only the amount and variety of data that is recorded what is important. Also its high-frequency character and its comprehensive nature have allowed researchers, companies and agencies to investigate individual and group dynamics at an unprecedented level of detail and applied them to client modeling, organizational analysis or epidemic spreading [1].

However, for technical or privacy reasons only the existence but not of the content of those exchanges is known. Thus we can quantify the intensity and frequency of the interaction but not its type. For decades, social science has measured relationships between individuals in the currency of tie strength, introduced by Granovetter [1]. Weak ties (loose acquaintances) can help to disseminate ideas and/or innovations between different groups, help to find a job or new information; while strong ties (family, trusted friends) hold together organizations and social groups and can affect emotional health. Despite its success to explain these phenomena, tie strength of human relationships is vaguely defined in most large-scale social empirical work. Specifically, relationships are generally quantified by the intensity or duration of communication, although they are known to have significant drawbacks as tie strength predictor [3,4]. Multiplexity, rhythm and depth of the communication seem to be better predictors of tie strength than intensity [4]. Incorporating those metrics in the data mining of online communication might improve the definition of relationships between individuals and in turn transform our understanding of individual dynamics and its impact in our lives, organizations and society [5]. The challenge is to unveil social relationships in social media and not just mere interactions between individuals, which in general over-represent the real structure of a social group [6] (see figure). And this is of paramount importance to understand the propagation of ideas, opinions, commercial messages, etc. in social networks, since most links declared in social networks might be meaningless from a relationship point of view.

undressing1

Undressing the social network: considering all e-mail interactions in a academic social network (left) yields to a highly dense and connected social network, while strong interactions (based on the individual relative frequency of communication) render the social group sparser and disconnected

References

  1. D. Lazer et al. Computational Social Science, Science 323, 721 (2009)
  2. M. S. Granovetter, The Strength of Weak Ties, The American Journal of Sociology 78(6), 1360 (1973)
  3. P. V. Marsden, and K. E. Campbell Measuring Tie Strength Social Forces 63(2), 482 (1990).
  4. E. Gilbert and K. Karahalios, Predicting Tie Strength with Social Media, presented in CHI 2009.
  5. C. T. Butts, Revisting the Foundations of Network Analysis, Science 325, 414 (2009)
  6. B. A. Huberman, D. M. Romero, and F. Wu, Social networks that matter, First Monday 14(1) (2009).

Note: This article appears in the Catalog of the exhibition “Culturas del Cambio: Átomos Sociales y Vidas Electrónicas” in the Center Arts Santa Mónica. Thanks to  Josep Perelló for his kind invitation to contribute

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Meneame
  • Technorati

La ciencia española no necesita tijeras

2009 October 7
by admin

3973473121_e76fde787c_o

La crisis y su efecto en los presupuestos del año 2010 han servido para poner a prueba el compromiso del gobierno de cambiar de modelo productivo y aumentar el gasto en I+D+i. En especial, sufren recortes los gastos del ministerio de Ciencia e Innovación (hasta un 17%), el capítulo 7 (las subvenciones a investigadores) un 17% y los presupuestos de algunos OPIs dependientes del Ministerio con un 15% menos de media. La situación es tal que hasta los propios ministros del gobierno creen que la situación, de continuar, es cuando menos preocupante.

Pero la ciencia no necesita estas tijeras presupuestarias. La principal razón es que la ciencia en este país y en general el I+D+i es un tejido que todavía no está maduro. Llevamos creciendo a buen ritmo en los últimos años y no lo estamos haciendo mal, pero la inversión y los planes en ciencia deben de hacerse a largo plazo, en especial en la contratación de personas. Los que llevamos algún tiempo en esto sabemos que el dinero disponible no siempre es el mismo y podemos ajustarnos el cinturón como antes lo habremos hecho muchos de nosotros, pero como dice Margarita Salas, quienes van a sufrir más este recorte de prespuesto es la gente joven. En definitiva, el futuro de la I+D+i. La gente que tiene que llegar y superarnos y hacerlo mucho mejor que nosotros. Si ya costaba convencer a alguien de que hiciera ciencia en este país, ¿cómo vamos a hacerlo ahora cuando leen en los periódicos que no hay dinero?

Finalmente si el Gobierno y los partidos políticos han estado de acuerdo en hacer un Plan E de miles de millones de euros que nos ha llenado de bonitos carteles nuestra geografía y han mejorado tanto las rotondas de este país, ¿cómo es posible que no se pueda llegar a un acuerdo para crear un Plan I que garantize el crecimiento en I+D+i para llegar a los niveles europeos?

Esta es mi contribución a la iniciativa “La Ciencia Española NO Necesita Tijeras”, promovida por La Aldea Irreductible

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Meneame
  • Technorati

The speed and reach of forwarded emails, rumors, and hoaxes in electronic social networks

2009 August 4
Comments Off

large_spain_5We have just published an experimental/theoretical work on the speed of information diffusion in social networks in Physical Review Letters. Specifically we have studied the impact of the heterogeneity of human activity in propagation of emails, rumors, hoaxes, etc. Tracking email marketing campaigns, executed by IBM Corporation in 11 European countries, we were able to compare their viral propagation with our theory (see below the campaigns details).

The results are very simple. Let me give you an example: the typical time between two emails sent by the same person is around 1 day. Traditional models of information diffusion will then yield to an infection speed of 1 day. However, some email computer viruses spread widely in a matter of hours (minutes, sometimes), while some viral propagation (for example the Veuve-Clicquot hoax) last for years. How can that occur? The reason is that traditional models are not correct because they neglect the large heterogeneity in the frequency of human activity: the average time between emails (1 day) does not actually represent the collectivity. In fact, most of us respond very quickly to emails, but some take a lot of time to do it. This fact (known and discovered previously by others) has a profound consequence in the way information spreads:

  1. When information spreads “successfully”, in the sense that it propagates and reaches most of the collectivity (i.e. it surpasses the tipping-point), its propagation speed of is determined by the people that have higher activity.
  2. However, when information reaches just a small fraction of the population (below the tipping-point), its propagation is controlled by those who take a lot of time to respond/forward and the spreading is very slow.

This phenomenon, as explained in our paper, has consequences for viral marketing, fads and hoaxes diffusion or opinion dynamics because the speed of their messages propagation depends strongly on the size of the sub-communities of very active and not-so active people. For example, in our campaigns (which were below the tipping-point yet successful from a viral marketing perspective), endogenous propagation of the commercial message lasted for months while the average time between getting the message and forwarding was only 1 day. We also found that messages do not “go viral”: They are viral because of the diffusion mechanism they use, but their spreading success largely depends on the social network propensity and heterogeneous behavior.

Finally, our work has some consequences for the way we model and understand human dynamics, since it shows that there is no such a thing as a typical time scale in the human dynamics. This is in sharp contrast with epidemic models, information diffusion models, etc. in which the heterogeneity in human activity and frequency is usually neglected, in favor of a more homogeneous picture of the activity of humans.

About the empirical data:
The viral marketing campaigns were conducted by IBM using the typical “refer-a-friend” mechanism which led to the endogenous diffusion of information. The campaigns’ offerings were promoted at the IBM. homepage where initial participants heard about them. Their primary marketing objective was to generate subscriptions to the company’s on-line newsletter. Subscriptions were entered through a form located in the campaign main web page (a.k.a. registration page). Additionally, a viral propagation mechanism accessible through a button located at the registration page was available to foster the message propagation. The button caption enticed visitors to recommend the page to friends and colleagues by offering, as additional incentive for people to forward the page, tickets for a prize draw to win a laptop computer. More technical details about the campaign can be found at Appendix D of the arXiv version of our paper

Press coverage:

  • ‘Infectious’ people spread memes across the web, New Scientist (12/08/09)
  • Email hoaxes are like viruses, The Inquirer (10/08/09)
  • The flow of viral video, ABC News (8/08/09)
  • New model for social marketing campaigns details why some information ‘goes viral’, PhysOrg (6/08/09)
  • Los perezosos frenan los rumores en Internet, ABC.es (14/8/09)
  • Party people spread viral internet memes, ComputerWeekly (14/8/09)
  • Desvelan las claves de la difusión de la información en las redes sociales, PlataformaSINC.es (7/9/09)
  • Nuevas claves para la difusión de información en las redes sociales, Noticias Madri+d (7/9/09)
  • Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    Market impact and trading profile of large trading orders in stock markets

    2009 August 4

    Esteban Moro, Javier Vicente, Luis G. Moyano, Austin Gerig, J. Doyne Farmer, Gabriella Vaglica, Fabrizio Lillo and Rosario N. Mantegna
    Submitted to PRE (2009) [pdf]

    read more…

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    Impact of Human Activity Patterns on the Dynamics of Information Diffusion

    2009 August 4

    J. L. Iribarren and E. Moro
    Physical Review Letters 103, 038702 (2009) [pdf]

    read more…

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    Giving a talk

    2009 July 24
    by admin

    img_1372Giving a good talk is not an easy task, but with time and practice you get to learn how to communicate (hopefully I’ve learned too!!). There are a number of places on the web with advices to give a good talk. But I like Paul N. Edwards’s short manual about how to give an academic talk. My experience as audience in many talks tell me that the most important things are (quoting Paul’s manual):

    1. Presentation are not journal articles. Think of a talk as a series of 5 minutes presentations (one per transparency) with a general guideline.
    2. Each transparency is an idea unit. And the title of the transparency must be the summary of the idea
    3. Move, don’t stand still.
    4. Make eye contact, specially in the introduction, the key point of your talk and in the end of the talk.
    5. Focus on main points, skip technical details unless you are asked to give them.
    6. Do not put a lot of graphs per transparency. A graph is a lot of cryptic information for the audience and you must fully explain it, so more than a graph per transparency is too much for the audience.
    7. Do not write in the transparency what you are going to say. Transparencies are not to be read, but to complement your speak.
    8. Plan for disaster: have your presentation in different formats and in a usb thumb drive, a CD-ROM just in case.
    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    Ph.D. offer… interested?

    2009 June 5
    by admin

    Our research group is looking for Ph.D. candidates. Here is the announcement

    mosaicoWe offer contracts to work for a Ph.D. within the project MOSAICO (Modelling, Analysis and Simulations of Complex Systems). Candidates must have a degree in physics, math or related disciplines with outstanding marks. Info on the research lines is available from http://www.gisc.es and work will be carried out at Universities Complutense or Carlos III de Madrid. Work will begin on October 1st, 2009.
    Interested candidates must send a CV indicating expliciting their marks to contratos.mosaico.2009@gmail.com before July 15, 2009.

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    The probability of going through a bad patch

    2009 April 14

    We’ve heard it: people that invest on the stock market or that gamble in lotteries, casinos, etc usually say “I’m going through a bad patch” (or bad spell). That is, they have been losing money for a while, but hey! better times are ahead and there’s no reason to quit. Are they sure? Are better times ahead? How close is “ahead” to today? Let’s work through a specific example to see how far is “ahead”. Suppose we play a fair game: we toss a coin and with probability 1/2 we get $1 (heads) and with probability 1/2 we lose $1 (tails). We play the game n times and compute our capital C(n) up to time n. If our initial capital is zero, then we expect that our capital fluctuate around zero as the coin-tossing game goes on. Sometimes we will be in the “winning area”, where our capital is positive C(n) > 0. However, we can also be in the “losing area” in which our capital is negative C(n) < 0. If we are going through a bad patch (being in the losing area) we expect that waiting long enough we will recover and come back to the winning area.

    But this is incorrect. Let me show you why: let’s use some mathematics. Suppose that x_i is the gain (+$1) or lose (-$1) in toss i of the coin. Since our coin is fair, then x_i is a random number which takes +1 or -1 with equal probability (1/2). Thus the capital up to time n is the sum of those random numbers

    \displaystyle{C(n) = \sum_{i=1}^n x_i}

    C(n) is then the sum of n equally distributed random numbers. In other contexts, C(n) is also know as a random walk. We can apply the law of large numbers and the central limit theorem to know something about C(n). For example, the expected value of C(n) is

    E[C(n)] = 0

    as expected, since it is a fair game. Thus we have equal probability of being winning or losing at time n. However, C(n) fluctuates wildly around zero and in fact

    Var[C(n)] = n

    Thus our capital at time n is mostly in an interval of area \sqrt{n} around zero, as shown in the next graph.

    cnThe graphs shows 4 realizations of the game (colors) and the lines are the \sqrt{n} areas in which our capital is mostly expected. As we can see in the “red” game, we starting losing money, but after a while we recover and went back to the “winning area”. Now the question is: what is the probability that we are in the winning area? Specifically, what is the probability P(\alpha) that we are in the winning area (C(n) > 0) for a fraction \alpha of the total n turns? The naive reasoning in the introduction will tell us that since C(n) is fluctuating around zero we expect that the probability will be peaked and 1/2 and thus half of the time we will be in a bad patch and half of the time we will be going through a good spell. Thus, if we are going through a bad patch, we have only to wait to come back to black numbers. However, this is not true. The probability P(\alpha) can be worked out (although not trivially) to get

    \displaystyle{P(\alpha) = \frac{1}{\pi\sqrt{\alpha(1-\alpha)}}}pdfa

    in the limit n\to \infty, which is known as the arc-sine law (since the cumulative distribution of P(\alpha) is the arc-sin function). As the plot in the right shows the probability is peaked at 1 and 0 (actually, it diverges there!). Thus, for most of the realizations of the game we are most of the time in the winning area or in the losing area. This means that our naive reasoning above does not work: if you expect to recover from a bad patch, your chances are very small. This is obvious if we look at the colored figure above: the orange and black trajectories do not change from one winning/losing area to the other and, apart from the initial steps of the game, they remain in the winning/losing areas forever. The explanation for this behavior is that the first return time of C(n) to zero is oftenly large. Actually, its expected time is infinite, which means that once you get into the positive/negative area you remain (mostly) there.

    Note however, that there is no paradox in what we have found and the fact that E[C(n)] = 0, since P(\alpha) is symmetric around \alpha = 1/2 and thus if we play the game a large number of times, on average, we have the same chances of winning and losing. But not for an individual game in which mostly we will be in a bad or good patch forever.

    What is the moral? Simple: if you get into a bad patch, leave the game. Because chances to recover from a bad patch are small.

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    Being part of “best of 2008″

    2009 March 30
    by admin

    njpbest2008I got wonderful news today. Our paper “Specialization and herding behavior of trading firms in a financial market” (pdf) has been selected by the Editorial Board of New Journal of Physics as part of the Journal’s Best of 2008. According to their site, “Best of 2008″ is a compilation of articles selected by the Editorial Board and staff team on the basis of criteria including referee endorsements, readership and citation levels and simple broad appeal. All articles are permanently free to read.

    Thanks to NJP for this boost and congrats to their editorial board and staff for their work. Hope to make it to 2009 too.

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati

    The use of statistics

    2009 February 24
    by admin

    Mark Twain (1924) probably had politicians in mind when he reiterated Disraeli’s famous remark (”There are three kinds of lies: lies, damned lies and statistics”). Scientists, we hope, would never use data in such a selective manner to suit their own ends. But, alas, the analysis of data is often the source of some exasperation even in an academic context. On hearing comments like ‘the result of this experiment was inconclusive, so we had to use statistics’, we are frequently left wondering as to what strange tricks have been played on the data.

    D. S. Sivia in Data Analysis: A bayesian Tutorial

    Share and Enjoy:
    • Digg
    • del.icio.us
    • Facebook
    • LinkedIn
    • Meneame
    • Technorati