Joe Mosby's blog

Notes on ACM survey of social network analysis techniques

Social Network Analysis: A Survey on Tools, Process, and Application, Singh et al., ACM Computing Surveys, April 2024

The ACM Computing Surveys provide a way to quickly get up to speed in a particular subject area. Social network analysis is not a new area of focus for me (I did a talk on it back in 2019), but I'm excited to dive back in after nearly five years out of practice. Truth be told, I was a little surprised to see this topic in an ACM Survey, as I had been operating under the assumption that much of the research had dried up after the major platforms starting locking down their data. I was dead wrong.

SNA can analyze a variety of practical aspects of multidisciplinary applications, including finding subject matter experts in a particular area, finding patterns in clickstreams on the web, finding faculty at different universities that have emerging communities of interest, mapping communities of expertise in various fields, mapping interactions among blogs on different topics, mapping an executive’s network based on email flows, revealing how infections spread among patients and staff in the hospital, accelerating diffusion by identifying opinion leaders, and more...

A social network is represented by a graph G(V, E) where V denotes the set of nodes (our persons, countries, companies, words, or whatever else is the core entity that's interacting in the network) and E denotes the edges (connections) between nodes. As a result, traditional graph analytics techniques apply neatly to the problem of social network analysis. Graph processing is a nontrivial application: entire supercomputers have been built just to focus on it.

Social network graphs have three sets of properties that are of interest for research:

Our research verticals here are really exciting. We don't just get into studying the graph for the sake of the graph itself, we want to see how these graphs influence behavior. There are four key areas of research interest that attract most of our attention.

Information dissemination is the way information moves throughout a network. When COVID-19 hit, this was a core area of interest for governments at all levels. Agencies wanted to get pandemic information out in the fastest possible way, and one of the lowest cost ways to do this was to pursue targeted influence campaigns to disseminate their information across social networks.

Information dissemination has a few different models used to measure information flow, but they all coalesce upon an idea that nodes all have a probability of spreading a particular message. This starts to look a lot like infectious disease research: based on certain properties of the message/disease and the nature of the social network, the message/disease will spread.

Link prediction is the task of predicting a future link between nodes. As a general idea, if I have three friends, Alice, Bob, and Charlie, and Devon also shares those three friends, there may be a high likelihood that I'll become friends with Devon soon enough. Or not! Alice, Bob, and Charlie might be the types of people who never connect other friends, and there are ways to inspect that.

Influence maximization is the mechanism of identifying the most influential individuals within a network: those whose behaviors are most likely to change the behaviors of other nodes in the network. This is a hard computational problem. The survey shows several attempts to efficiently execute this sort of research, all of which moved towards optimization techniques as the network increased in size and complexity.

Community detection is one of the oldest areas of research, and one that can rely the most on traditional graph analytic techniques as it's primarily concerned with the structure of the network rather than how information flows within the network. Graph partitioning, random walks, clustering algorithms all get a play here.

The survey continues on to look at data collection techniques, research applications, and challenges in the general landscape, all of which are fairly apparent: there is still a lot of value in understanding and exploiting social networks, the big tech companies are aware of their network value, and they are not too excited about sharing their data. So let's pass over those parts and dig into the research verticals, because that's where the fun problems are.

Information dissemination

The probabilistic threshold model is one of the most-studied models of information dissemination. Its most well-known manifestation is the notion of early adopters versus late majority in innovation adoption.

Threshold models posit that our proclivity to adopt new innovations / accept new information is not an inherent characteristic in an individual, but rather a proclivity that is influenced by the people around us. Some individuals only need one person in their network to adopt a new technology before they're willing to try it themselves. Others need everyone in their network to take on new technology before they're willing to do so themselves. This behavior manifests in other activities beyond technology adoption: family planning, recycling, and collective opinions all show similar behaviors.

The threshold model influenced the later cascade model. With the cascade model, we attempt to find a Minimum-Sized Influential Node Set: the smallest set of nodes that will proceed to have an influence on every other node in the network. The mechanics are much the same as the threshold model: it's simply an extension of this idea to focus on a specific influential set.

The epidemic model uses techniques from disease transmission study to inform the idea of information transmission. In this idea, all nodes have a susceptibility to influence with each interaction with new information. It places nodes into one of three states:

  1. Healthy nodes have not confronted the information, though they can become vulnerable if they become connected to at least one infected node.
  2. Vulnerable nodes are in an unknown state. They may have been "infected" by the information, or not. They are at risk, and in an unknown state between healthy and infected. They may revert back to a healthy state, or they may become fully infected.
  3. Infected nodes have become infused with the information. They can no longer revert to a healthy state.

Finally, the triggering model is a more generalized version of the threshold and cascade models. It treats those models as optimization problems, and seeks to find a way to maximize influence throughout the network.

Similarity scoring is the simplest kind of link prediction approach. Different approaches take into account node information from nearby nodes (local similarity-based) versus the entire network (global similarity-based). There is some research being done to determine the correct balance between local and global similarity approaches.

Probabilistic modeling estimates the likelihood of a link given different parameters of the graph and seeks to optimize this function. It takes the idea that social networks seek to enhance their overall connectivity, so links that improve this connectivity are preferred.

Dimensionality reduction uses the same techniques of matrix decomposition, focusing heavily on the structure of a graph as a matrix. It looks for small subsets of the original graph that have vector representations very similar to each other, looking for patterns that would transform these neighborhoods into identical subgraphs.

Influence maximization

Similar to information dissemination and link prediction, research into influence maximization focuses on a few different approaches to find a proper solution. It first came into existence as an area of focus as an attempt to mine the influence value of particular customers.

Approximation-based models pursue a greedy solution for identifying influential users for viral marketing. While these algorithms produce extremely high-quality solutions, their complexity does not allow them to scale well. Beyond a certain size of network, these algorithms become computationally impractical.

Heuristics improve upon the approximation-based models, though they do not guarantee a solution. However, the techniques provide more practical use cases.

Community-based methods also improve upon approximation-based models by reducing the search space of a seed node. Rather than attempt to traverse the entire graph, community algorithms focus on high-degree nodes and the communities around them to increase the likelihood of identifying a particular seed node.

Meta-heuristic methods attempt to improve approximation-based models by identifying correct tradeoffs between influence spread and efficiency. Research in this area has focused on avoiding locally optimal solutions that can arise with heuristic approaches.

Community detection

This class of algorithms seeks to break nodes into subsets, or communities, that achieve a given objective.

Graph partitioning methods divide a network into K different communities, and provide a mechanism for optimizing these subgraphs based on given parameters.

Spectral clustering methods are similar to the dimensionality reduction methods we find for link prediction.

Dynamical methods execute a random walk through the graph, computing node distances and community membership probabilities. These methods then use information dissemination mechanisms to identify specific communities.

Hierarchical clustering methods either rapidly remove edges or rapidly combine nodes to identify overall community structure, looking for structural similarity between the approach and the overall graph.

Density-based methods are the rarest in the research, looking at the density of different subgraphs to identify both communities and outliers.

Applications

The authors close their survey by providing a table of applications of their work, which look at the data format, volume, source, network type, research verticals, temporal nature of the graph, and challenges addressed.