DataDiversityConvergence 2016 Abstracts


Full Papers
Paper Nr: 1
Title:

Reducing Data Transfer in Parallel Processing of SQL Window Functions

Authors:

Fábio Coelho, José Pereira, Ricardo Vilaça and Rui Oliveira

Abstract: Window functions are a sub-class of analytical operators that allow data to be handled in a derived view of a given relation, while taking into account their neighboring tuples. We propose a technique that can be used in the parallel execution of this operator when data is naturally partitioned. The proposed method benefits the cases where the required partitioning is not the natural partitioning employed. Preliminary evaluation shows that we are able to limit data transfer among parallel workers to 14% of the registered transfer when using a naive approach.

Paper Nr: 2
Title:

Design of an RDMA Communication Middleware for Asynchronous Shuffling in Analytical Processing

Authors:

Rui C. Gonçalves, José Pereira and Ricardo Jimenez-Peris

Abstract: A key component in a distributed parallel analytical processing engine is shuffling, the distribution of data to multiple nodes such that the computation can be done in parallel. In this paper we describe the initial design of a communication middleware to support asynchronous shuffling of data among multiple processes on a distributed memory environment. The proposed middleware relies on RDMA (Remote Direct Memory Access) operations to transfer data, and provides basic operations to send and queue data on remote machines, and to retrieve this queued data. Preliminary results show that the RDMA-based middleware can provide a 75% reduction on communication costs, when compared with a traditional sockets implementation.

Paper Nr: 3
Title:

Design and Implementation of the CloudMdsQL Multistore System

Authors:

Boyan Kolev, Carlyna Bondiombouy, Oleksandra Levchenko, Patrick Valduriez, Ricardo Jimenez-Peris, Raquel Pau and José Pereira

Abstract: The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this paper, we give an overview of the design of a Cloud Multidatastore Query Language (CloudMdsQL), and the implementation of its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational, NoSQL, HDFS) within a single query that can contain embedded invocations to each data store’s native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized.

Paper Nr: 4
Title:

KVFS: An HDFS Library over NoSQL Databases

Authors:

Emmanouil Pavlidakis, Stelios Mavridis, Giorgos Saloustros and Angelos Bilas

Abstract: Recently, NoSQL stores, such as HBase, have gained acceptance and popularity due to their ability to scale-out and perform queries over large amounts of data. NoSQL stores typically arrange data in tables of (key,value) pairs and support few simple operations: get, insert, delete, and scan. Despite its simplicity, this API has proven to be extremely powerful. Nowadays most data analytics frameworks utilize distributed file systems (DFS) for storing and accessing data. HDFS has emerged as the most popular choice due to its scalability. In this paper we explore how popular NoSQL stores, such as HBase, can provide an HDFS scale-out file system abstraction. We show how we can design an HDFS compliant filesystem on top a key-value store. We implement our design as a user-space library (KVFS) providing an HDFS filesystem over an HBase key-value store. KVFS is designed to run Hadoop style analytics such as MapReduce, Hive, Pig and Mahout over NoSQL stores without the use of HDFS. We perform a preliminary evaluation of KVFS against a native HDFS setup using DFSIO with varying number of threads. Our results show that the approach of providing a filesystem API over a key-value store is a promising direction: Read and write throughput of KVFS and HDFS, for big and small datasets, is identical. Both HDFS and KVFS throughput is limited by the network for small datasets and from the device I/O for bigger datasets.

Paper Nr: 5
Title:

Towards Quantifiable Eventual Consistency

Authors:

Francisco Maia, Miguel Matos and Fábio Coelho

Abstract: In the pursuit of highly available systems, storage systems began offering eventually consistent data models. These models are suitable for a number of applications but not applicable for all. In this paper we discuss a system that can offer a eventually consistent data model but can also, when needed, offer a strong consistent one.

Paper Nr: 6
Title:

Towards Performance Prediction in Massive Scale Datastores

Authors:

Francisco Cruz, Fábio Coelho and Rui Oliveira

Abstract: Buffer caching mechanisms are paramount to improve the performance of today’s massive scale NoSQL databases. In this work, we show that in fact there is a direct and univocal relationship between the resource usage and the cache hit ratio in NoSQL databases. In addition, this relationship can be leveraged to build a mechanism that is able to estimate resource usage of the nodes composing the NoSQL cluster.

Paper Nr: 7
Title:

Data Collection Framework - A Flexible and Efficient Tool for Heterogeneous Data Acquisition

Authors:

Luigi Sgaglione, Gaetano Papale, Giovanni Mazzeo, Gianfranco Cerullo, Pasquale Starace and Ferdinando Campanile

Abstract: The data collection for eventual analysis is an old concept that today receives a revisited interest due to the emerging of new research trend such Big Data. Furthermore, considering that a current market trend is to provide integrated solution to achieve multiple purposes (such as ISOC, SIEM, CEP, etc.), the data became very heterogeneous. In this paper a flexible and efficient solution about the data collection of heterogeneous data is presented, describing the approach used to collect heterogeneous data and the additional features (pre-processing) provided with it.

Paper Nr: 8
Title:

Direct Debit Frauds: A Novel Detection Approach

Authors:

Gaetano Papale, Luigi Sgaglione, Gianfranco Cerullo, Giovanni Mazzeo, Pasquale Starace and Ferdinando Campanile

Abstract: Single Euro Payments Area (SEPA) is an initiative of the European banking industry aiming at making all electronic payments across the Euro area as easy as domestic payments currently are. One of the payment schemes defined by the SEPA mandate is the SEPA Direct Debit (SDD) that allows a creditor (biller) to collect directly funds from a debtor’s (payer’s) account. It is apparent that the use of this standard scheme facilitates the access to new markets by enterprises and public administrations and allows for a substantial cost reduction. However, the other side of the coin is represented by the security issues concerning this type of electronic payments. A study conducted by Center of Economics and Business Research (CEBR) of Britain showed that from 2006 to 2010 the Direct Debit frauds have increased of 288%. In this paper a comprehensive analysis of real SDD data provided by the EU FP7 LeanBigData project is performed. The results of this data analysis will conduct to define emerging attack patterns that can be execute against SDD and the related effective detection criteria. All the work aims at inspire the design of a security system supporting analysts to detect Direct Debit frauds.

Paper Nr: 9
Title:

3D Vizualization of Large Scale Data Centres

Authors:

Giannis Drossis, Chryssi Birliraki, Nikolaos Patsiouras, George Margetis and Constantine Stephanidis

Abstract: This paper reports on ongoing work regarding interactive 3D visualization of large scale data centres in the context of Big Data and data centre infrastructure management. The proposed approach renders a virtual area of real data centres preserving the actual arrangement of their servers and visualizes their current state while it notifies users for potential server anomalies. The visualization includes several condition indicators, updated in real time, as well as a color-coding scheme for the current servers’ condition referring to a scale from normal to critical. Furthermore, the system supports on demand exploration of an individual server providing detailed information about its condition, for a specific timespan, combining historical analysis of previous values and the prediction of potential future state. Additionally, natural interaction through hand-gestures is supported for 3D navigation and item selection, based on a computer-vision approach.

Paper Nr: 11
Title:

Big IoT and Social Networking Data for Smart Cities - Algorithmic Improvements on Big Data Analysis in the Context of RADICAL City Applications

Authors:

Evangelos Psomakelis, Fotis Aisopos, Antonios Litke, Konstantinos Tserpes, Magdalini Kardara and Pablo Martínez Campo

Abstract: In this paper we present a SOA (Service Oriented Architecture)-based platform, enabling the retrieval and analysis of big datasets stemming from social networking (SN) sites and Internet of Things (IoT) devices, collected by smart city applications and socially-aware data aggregation services. A large set of city applications in the areas of Participating Urbanism, Augmented Reality and Sound-Mapping throughout participating cities is being applied, resulting into produced sets of millions of user-generated events and online SN reports fed into the RADICAL platform. Moreover, we study the application of data analytics such as sentiment analysis to the combined IoT and SN data saved into an SQL database, further investigating algorithmic and configurations to minimize delays in dataset processing and results retrieval.

Paper Nr: 12
Title:

PaaS-CEP - A Query Language for Complex Event Processing and Databases

Authors:

Ricardo Jiménez-Peris, Valerio Vianello and Marta Patiño-Martinez

Abstract: Nowadays many applications must process events at a very high rate. These events are processed on the fly, without being stored. Complex Event Processing technology (CEP) is used to implement such applications. Some of the CEP systems, like Apache Storm the most popular CEPs, lack a query language and operators to program queries as done in traditional relational databases. This paper presents PaaS-CEP, a CEP language that provides a SQL-like language to program queries for CEP and its integration with data stores (database or key-value store). Our current implementation is done on top of Apache Storm however, the CEP language can be used with any CEP. The paper describes the architecture of the PaaS-CEP, its query language and the algebraic operators. The paper also details the integration of the CEP with traditional data stores that allows the correlation of live streaming data with the stored data.