↞ My work

Visualizing Malicious Twitter Users

An exploration of the following / follower patterns of large networks (~10,000 edges) of malicious Twitter users.

Context: University of Texas at Dallas research project; advised by Dr. Alvaro Cardenas

Time: 6 weeks

Team: Individual

Role: Conceptual and visualization designprogrammer

Tools: R, Adobe Illustrator

Is there a way to identify spammers on Twitter using visual patterns/network structure?

My dataset of Twitter users had well over 10,000 edges, and therefore discerning a larger pattern from the structure was challenging. To address this issue, I used a new(ish) visualization technique, hive plots, developed by Martin Krzywinski to more effectively visualize large networks:

This project was my first real taste of data visualization, graciously funded through a grant from the Committee on the Status of Women in Computing Research (CRA-W). I ended up presenting this research at the 2014 Grace Hopper Celebration of Women in Computing and BPViz’14: Broadening Participation in Visualization Workshop.

In this project, I had my first taste of:

  • Big Data visualization
  • R statistical computing language
  • Design and usability for security

For the entire report, read below.

Introduction

Spam producing Twitter followers continue to exist as a nuisance on the platform. These accounts send malicious links through Tweets and Direct Messages, follow and unfollow users en masse, and generally add clutter to the Twittersphere.

Malicious links can be used to compromise the computers of honest users, use phishing scams to get their login information, or to defraud them through fake online stores. Malicious users also find ways to broadcast spam links despite filters on malicious links and other measures taken by Twitter. Large amounts of fake accounts still exist thanks to mechanisms easily allowing the creation of large quantities of fake accounts.

Identifying spam accounts is vital to decreasing the amount of clutter on Twitter, decreasing security concerns on these and other related social networks. Malicious users can be identified by monitoring several indicators of abuse such as analyzing the content of their posts or their social structure.

As data collection and analysis grows, the need to present the visualization of large-scale datasets to security analysts is increasing in importance. In this work we focus on the social network patterns that can be used to identify spammers by visualizing the social network structure of Twitter users.

Background and Related Work

The first step to removing malicious users from Twitter involves developing a method for identifying such users. Yang et~al. proposed several indicators based on unique characteristics of spam-producing users to identify suspicious accounts in Twitter [4].

In their work, Yang et~al., visualized portions of their crawled Twitter network to show their dataset [4].Their visualization analysis is preliminary, as visualization research is not part of their work, although they provide a graph representation of the data (Figure 1).

This research involved, utilizing a new visualization approach to answer the following question: What social patterns can be revealed by visualizing the network structure of Twitter accounts?

Approach

A hive plot is a relatively new way of visualizing networks, developed by Martin Krzywinski [3]. Nodes are mapped to two or more axes and positioned radially based on certain properties (centrality, in-degree, etc.).
Edges are drawn between nodes as curved links.

Hive plots were designed to address the limitations of traditional graph visualization approaches which rely on local-optimization algorithms: as a result, depending on the initial conditions of the graph presented, or by plotting isomorphisms of the same graph, the results would be different.

The main advantage of Hive plots is the ability of produce consistently reproducible graph plots, as the plots are defined by the metrics used by the axes. Graph isomorphisms will therefore be mapped to the same Hive plot, allowing for a more rigorous scientific approach to visualization and discovery in graphs.

In our work Hive plots are being used to visualize the network structure of Twitter users. The end goal is to discover what characteristics differentiate malicious users from normal Twitter users.

In the context of visualizing Twitter users, an arc signifies a following/follower account between two users, or nodes. An example of a simple axes assignment in shown in Figure 2.

Two types of social connections were analyzed:
Followings – what types of accounts a particular user is following.
Followers – what types of accounts are following a particular user.

Plotting Challenges

Our goal is to explore multiple graph metrics and their representation and visualization with Hive Plots. The main challenge in plotting Twitter data is scalability. Many tools have been built to draw and analyze hive plots, however, few have been optimized to handle dense graphs with thousands of edges. By attempting plotting large hive plots, we explored the viability of different tools in handling large-scale data sets.

Once plotted, we predicted malicious users would visually reflect some of the graph properties quantitatively derived by Yang et al. [4].

Approach 1: JHive

JHive is a robust tool for analyzing graph data. The graphical user interface eases the process of mining graph data and changing the mapping of nodes to different hive plot axes. However, JHive has a low tolerance for input files greater than about 1,000 connections, rendering it unsuitable for Big Data.

Approach 2: D3.js, and JSON

D3.js, a popular tool for vizualization in JavaScript, has been used to generate hive plots. Seth Brown has even leveraged D3.js to visualize his own Twitter network [1]. However, rendering such large numbers of nodes and links in D3.js overloads the browser at the threshold of approximately 1,000 edges.

Approach 3: HiveR

HiveR is a package in the statistical computing tool, R, designed by Bryan Hanson in order to create hive plots [2]. HiveR begins to lose efficiency processing data formated in the DOT specification at approximately 100,000 edges. Other input types, such as data frames, are slightly more efficient. We looked at different ways to maximize the efficiency of producing graphs with over a million edges within R, which are described in the next section.

Plotting with HiveR

Leveraging the igraph package

The igraph package provided a more efficient way of processing data in R. Instead of importing DOT files directly with the HiveR package directly, we converted our node and edge files into data frames in R for easier processing.

From data frames, we converted the node and edge files into graph objects in R. This sped up the processing times considerably. The processing of data frame to hive plot object was much more efficient.

Sampling the Graph

Despite reducing conversion time by utizing the igraph package in R, plotting the hive plots themselves was still a bottleneck for visualization times. We decided to sample as many nodes as possible multiple times to get a thorough sampling of the dataset.

Also, we combined the followings and follower data sets for both malicious and normal users, creating a large data frame for both. At first, edge However, sampling edge data does not preserve the graph structure. Instead, we began sampling subsets of nodes. We set up a mechanism that sampled a certain subset of nodes, creating a graph with only the edges those nodes were involved in.

We started with sampling 10 nodes (about (1,000 edges) of the data set at large. Plotting the graph took only a few hours. We subsequently increased the number of nodes plotted in increments up to 100 nodes, which took approximately 12 hours to plot each.

Handling unknown users

In our data sets we encountered many users with ids that were not identified in the data sets. We added a level of identification for these unknown users, who may or may not be malicious users. In the final graphs, unidentified users appear as black nodes and their incoming and outgoing edges are also black.

In sampling the entire data set, we only sampled users from a list of users known to be either malicious or non-malicious. In the final graphs, only the followings and followers for identified users were taken to account. Unknown users are identified for completeness.

Results

Initial Results

A subsection of the total set of followers and followings of both normal and malicious users was plotted (Figures 3, 4). Initial comparisons between the two graphs display a propensity for spam accounts to follow other spam accounts.

 

In these data sets, normal and malicious users are visualized separately. We combined normal and malicious users in the final plot to give a better snapshot of the network at large.
figs

Final Results

We generated the final graphs using a sampled network of 50 and then 100 normal and malicious users (Figure 3, 4).

 

As the sample size became larger, some nodes and edges began blocking other nodes and edges. The more sparse types of connections between malicious users and other edges became obscured behind other connections. This formed a significant problem in analyzing the results. In the graphs of 50 sampled nodes, we were able to reorder the edges of the graphs to uncover obscured links.

The visualizations reveal patterns in the underlying structure of the Twitter network including the following

  • Malicious users tend to be source nodes rather than sink nodes.
  • Malicious users tend to not be sinks.
  • The bulk of normal users in the same network do not follow malicious users, however those that do tend to have a low degree.

Conclusions

Despite scalability issues, we managed to plot large samples of the network. Hive plots provide a useful insight into network structures in a single glance. Improvements and further applications of the work could include:

Building an interactive tool for manipulating large hive plot visualizations
Querying the Twitter API to plot users in real time in order to identify potentially malicious users in real time.

In attempting to visualize very large social networks, we made progress toward the project’s two overarching goals:
Adding of a layer of meaningful visual network analysis, allowing faster, repeatable human recognition of interesting network patterns before quantitative analysis of networks}
Explore and uncover useful methods for displaying social network data for security purposes.

Acknowledgments

My research is supported by a Collaborative Research Experience for Undergraduates (CREU) for 2013-2014 sponsored by CRA-W and a UT Dallas Undergraduate Research Scholar Award. The project was supervised by Dr. Alvaro Cardenas. Other collaborators include Junia Valente, Sagar Davasam, and Mitsu Deshpande.

References

[1] S. Brown. Twitter hive plots, July 2012.
[2] B. A. Hanson. The hiver package version 0.2-1. 2011.
[3] M. Krzywinski, I. Birol, S. J. Jones, and M. A. Marra. Hive plots{rational approach to visualizing networks.
Brie ngs in Bioinformatics, 13(5):627{644, 2012.
[4] C. Yang, R. Harkreader, J. Zhang, S. Shin, and G. Gu. Analyzing spammers’ social networks for fun and pro t:
a case study of cyber criminal ecosystem on twitter. In Proceedings of the 21st international conference on
World Wide Web, pages 71{80. ACM, 2012.