Tutorial – KDMiLe 2016


Born as a research project at UC Berkeley in 2009, released as an open source project in 2010, and turned into a Top-Level Apache Project in 2014, with over 1000 volunteers contributing to its source code in current days.

Apache Spark is a fast and general-purpose scalable data processing system. It provides high-level APIs in Java, Scala, Python and R that facilitate both batch processing and interactive data analysis in parallel and distributed in large clusters. It also supports a rich set of higher-level libraries including Spark SQL for structured analytical queries, MLlib for machine learning, Spark Streaming for data-in-motion analyses, and GraphX for graph analytics.

Graph analytics is of major interest across different industry real-world problems and academic research. It has successfully been applied in many cases to mine relationships among nodes.

In this tutorial, we provide an overview on the recently released Spark 2.0 and on graph analytics theory. Then, we especially focus on how to process and mine large graphs using Spark GraphX library. We show how frequently used graph mining tasks, such as link prediction, community detection, and recommendation can easily be implemented in GraphX, applying to large real-world graphs.


Ana Paula Appel – IBM Research Brazil
Renan Souza – IBM Research Brazil

Ana Paula is a Research Staff Member in the recently created IBM Research - Brazil working in Social Data Analytics Group. She joined IBM Brazil in February 2012. Ana Paula has a B.S. in Computer Science from the Federal University of Sao Carlos (UFSCar), a M. Sc. and Ph.D. degree also in Computer Science from the State University of Sao Paulo (USP) under the guidance of Prof. Dr. Caetano Traina Jr. She has an year internship at Carnegie Mellon University (CMU) under the supervision of Prof. Dr. Christos Faloutsos. She also has a post-doc at Federal University of Sao Carlos (UFSCar) working with Prof. Dr. Estevam Hruschka. Ana Paula's research interests are in the field of graph and data mining work, specially applied to social data.

Renan Souza is a Research Software Engineer at IBM Research Brazil and a computer science PhD student at COPPE - Federal University of Rio de Janeiro (UFRJ). He holds a master's (2015) and a bachelor's (2013) degrees in computer science from UFRJ. He took a school year at Missouri State University and a summer internship at Stanford University. His main interests include parallel and distributed data processing, high performance computing, and big data management and analytics.