Research A – Autumn Semester 2013/14


Recent news


Basic information


Deadlines and Meetings

Deadlines
All deadlines are at 15:00, CET (Amsterdam time).
Lectures
Progress Meetings

Topics

Please also consider the topics proposed on http://www.ru.nl/is/ml/education/student_projects/.
Topics marked with a red frame are already taken.

Title: Fencing with security: using iBeacons for things other than advertising
Chosen by: s4346998, s4354559
Supervisor: Roland van Rijswijk-Deij <rijswijk@cs.ru.nl>
Description: When they announced iOS 7, Apple rather quietly introduced an interesting new technology they call "iBeacon". The idea behind this technology is to be a means to provide location information in areas where, e.g., GPS reception is problematic (read: indoors). The technology relies on Bluetooth Low Energy devices called "beacons" that can be placed in an area and are automatically detected and read by the CoreLocation service on an iOS device. Use cases for this technologies that pop up on the web show how this technology could be used for things like advertising, or to provide rich content at sites like museums. The goal for this research project is to investigate how this new technology can be leveraged in security applications. In order to do this, students are encouraged to:
  • Research the state-of-the-art on how location can be used as a security factor
  • Investigate the iBeacon technology
  • See if iBeacon technology can be used for any of the technologies you found while researching the state-of-the-art of location-based security
  • Come up with new applications for location-based security using iBeacon (hint: iBeacons seem to be rather more intelligent than a simple GPS signal, can you leverage this property to construct new security applications)
  • Build a prototype new application. If hardware is available by the time your project is underway, a limited number of iBeacon-compatible devices may be acquired for you to use; a simulation is also acceptable.
Alternative research directions could be to investigate the privacy aspects of this technology.
Literature:
Type of work: 40% researching state-of-the-art, iBeacon technology and Bluetooth technology
30% experiments
30% documenting results

Title: "Balance" in Real-Time Strategy (RTS) games: a causal perspective.
Chosen by: s4024443, s4068459
Supervisor: Tom Claassen<tomc@cs.ru.nl>
Description: In games like Starcraft II and League of Legends players battle online against each other as one of multiple factions/heroes. One of the most hotly debated topics is that of balance: did a player win because of skill or because his race/hero is too strong ("overpowered")? Perhaps surprisingly, this type of question belongs to the field of causal discovery.
Despite the fact that nearly every player on the ladder has an outspoken opinion on the subject, 'imbalance' is surprisingly hard to define, and establish objectively from (even) a huge amount of game results. Not least because the respective match-making systems always try to find opponents such that both teams roughly have an equal chance of winning.
In this project we look at how a large-scale ("Big Data") Bayesian ranking systems works and how player skill can be modeled. We try to find an objective measure for balance, and when/how to infer from matches. We implement and test our methods against results from (simulated) ladder seasons.
Literature:
Type of work: Depends on focus, but will always include a fair amount of theory, experiments, and programming.

Title: Predictive modeling of airfare data.
Supervisor: Syed Saiden Abbas<saiden.abbas@cs.ru.nl>
Description: The price of airline tickets fluctuates over time and this can cause a change in price from a few dollars to a few hundred dollars. The fluctuations are due to a number of factors e.g. number of seats sold, time left for departure, timing of the flight etc. Data mining and machine learning techniques can be used to model this fluctuating behavior of ticket prices based on the historic data of ticket prices. The purpose of this project is to develop a predictive tool using R programming language, which can be used by consumers in decision making while purchasing a ticket. The tool would be able to predict whether to buy a ticket now or to wait for some days.
Literature:
  • T. Wohlfarth,, S. Clemencon, F. Roueff, X. Casellato: A Data-Mining Approach to Travel Price Forecasting. Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on , vol.1, no., pp.84,89, 18-21 Dec. 2011
  • William Groves and Maria Gini: A regression model for predicting optimal purchase timing for airline tickets. October 18, 2011
  • Oren Etzioni, Rattapoom Tuchinda, Craig A. Knoblock, Alexander: To buy or not to buy: mining airfare data to minimize ticket purchase price. KDD03, 2003.

Title: Locales vs. encapsulates: A comparison of Isabelle/HOL and ACL2
Chosen by: s4070968, s4156889
Supervisor: Julien Schmaltz <Julien.Schmaltz@ou.nl>
Description: Isabelle/HOL and ACL2 are two popular theorem proving systems. There are both used for the verification of software and hardware designs, as well as formalizations of mathematics. These two systems provide a way to define "constrained functions", that is, functions which defined by their properties and not with an explicit definition. Isabelle/HOL uses "locales" and ACL2 uses "encapsulates". The question you will answer is how do these two systems compare? What are their common aspects? What are their differences? In which cases is one more appropriate than the others? etc.
Literature: TBD
Type of work: In our research about formalizing communication networks and separation kernels, we use both ACL2 and Isabelle/HOL. Your task will be, using examples, to illustrate the differences and commonalities about the support provided for abstract functions. You will start by formalizing some parts of our formal theory of communication networks in Isabelle/HOL using locale. You will then compare with our own formalisation done in ACL2 using encapsulates.

Title: Data Mining in Formal Mathematics.
Chosen by: s4290755, s0814938
Supervisor: Josef Urban <josef.urban@gmail.com> and Herman Geuvers <herman@cs.ru.nl>
Description: Some large mathematical theories are today fully understandable to computers. An example is the formal proof of Kepler Conjecture (Flyspeck project) by Thomas Hales and his team, written in the HOL Light proof assistant. Flyspeck consist of about 20 thousand named theorems and 100 million unnamed lemmas (atomic proof steps). Just writing out all the lemmas takes about 100GB. The lemmas form an inference graph, which is directed and acyclic. Some of the lemmas are used very often, and some of the lemmas are proved over and over. It would be good if we could detect the most useful and most interesting lemmas automatically. This would help the mathematicians who develop Flyspeck in organizing such large libraries of computer mathematics. And a better-structured library would be also more useful when proving new theorems fully automatically over such library, without human assistance. Such automated proving provides a rigorous metric for evaluating the usefulness of the selected lemmas as awhole.
There are several methods for "lemma-mining", summarized in the paper http://mws.cs.ru.nl/~urban/lm/lmhol.pdf. These methods are mostly still too slow to process the whole Flyspeck corpus in reasonable time. So far only PageRank has reasonable speed, but it needs to be combined with additional clustering ideas, as proposed for example in this paper: http://cseweb.ucsd.edu/~kspham/pub/sigir2008.pdf. Some of the more interesting methods have not been tried even on much smaller corpora which contain only about one million of lemmas.
The goal of the project would be to develop good and fast lemma-selecting methods for big corpora such as Flyspeck. The three initial approaches that can be explored are (i) combining fast distance-based clustering with PageRank, (ii) smarter (possibly imperfect, but fast) update of the lemma-quality data structures after the selection of the best lemma, and also (iii) applying the lemma-quality ideas implemented in the AGInTRater system (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.6840"rep=rep1"type=pdf) on smaller corpora consisting of 1-10M lemmas. The students are welcome to come up with their own research ideas and to evaluate them.
Literature: See links in the description text.

Title: Trusted interfaces for secure devices.
Chosen by: s4064453, s4058038
Supervisor: Wouter Lueks<lueks.cs.ru.nl> and Roland van Rijswijk<rijswijk@cs.ru.nl>
Description: Over the last decade there has been a trend to better protect online trans- actions of all types. To use your online bank-account you typically use a secure token, your bank card, secure SMS services or a combination thereof to authen- ticate yourself and any transactions you make. In these scenarios it is important to see precisely which transactions you authenticate. Preferably, it should be difficult for an attacker to change this information. In the Digital Security group we have been working on the IRMA card. This smart card implements attribute based credentials and allows you to reveal certain information about yourself in a privacy friendly way. The IRMA card, like any other smart card, doesn't have a trusted interface. However, this also means that you have to trust the reader of your card not to read too much information. We think this can be improved by adding a trusted interface. There are many different approaches to adding trusted interfaces, ranging from hardware only to software only. Banks typically use dedicated hardware that only works with their cards, but more general dedicated hardware interfaces also exist. Some mobile platforms are equipped with a trusted execution en- vironment. This environment protects the software running the interface from interference and unauthorized access [1]. Finally, pure software solutions are possible as well.

In this project we like you to do two things:
  1. Make a survey of the different approaches to trusted interfaces.
  2. Develop a prototype that demonstrates one of these approaches. We wish to emphasize that the survey is an essential part of this project.
Literature:
  • van Rijswijk-Deij, R., and Poll, E: Using trusted execution environ- ments in two-factor authentication: comparing approaches. In Open Identity Summit 2013 (OID2013) (September 2013), K. Banz, Ed. To appear. http://www.cs.ru.nl/~rijswijk/pub/TrEE-oids13.pdf

Title: Privacy-friendly solutions for data aggregation and filtering in SmartGrids.
Supervisor: Bárbara Vieira<b.vieira@cs.ru.nl>
Description: C-DAX is an ongoing European Project ( http://cdax.eu) which proposes a cyber-secure data and control cloud for future power distribution networks based on an information-centric networking (ICN) solution as an overlay of IP. The C-DAX solution will provide a distributed data-cloud tailored to the specific needs of SmartGrids. In particular, it is intended to efficiently support of massive integration of renewables and be able to cope with a heterogeneous set of co-existing SmartGrid applications, running on devices and communicating over networks with widely varying capabilities when it comes to communication and computation speeds. Precursors to the C-DAX solution are overlay networking solutions developed at Bell-Labs. In a nutshell, the C-DAX platform consists of two major components: the C-DAX middleware that provides publisher-subscriber interfaces to clients hosting SmartGrid applications and the C-DAX cloud which consists of logically interconnected C-DAX nodes which are responsible for the resolution and delivery of messages exchanged between publishers and subscribers in a resilient, self-configurable, and scalable manner. The main idea of C-DAX is that, instead of applying host- centric and point-to-point communication, it supports group communication that is data-centric (i.e., its concepts are developed around the data being communicated) and topic-based (as the routing of data is based on topic identifiers).

C-DAX addresses several use cases and in some of them each node (in the C-DAX cloud) can be abstracted as being a message broker that receives (encrypted) topic-data from the publishers (aggregator, DNO, prosumers, etc) and forwards that (encrypted) data to the intended subscribers. Several tasks can be carried out at each node to enhance the information awareness in the platform, namely reducing number of topics, unnecessary network traffic, etc. Ideally, primitives such as data aggregation and filtering, should be part of the functionalities of the C-DAX cloud. However, specially in the retail market, costumers privacy is an issue, therefore all the topic-data has to be encrypted (and the C-DAX cloud just treats it as a black box ­ decryption and subsequent encryption is not possible). Performing computations on encrypted data has been a research challenge for several years. Solutions based on homomorphic encryption, searchable encryption and functional en- cryption are proposed in the literature. The goal of this research project is to investigate existing solutions to perform computations (namely aggregation and filtering) on encrypted data and how they can be adapted to the specific context of C-DAX, where a massive generation of data is expected from different measuring devices.
Literature:
  • Bengt Ahlgren, Christian Dannewitz, Claudio Imbrenda, Dirk Kutscher, and Börje Ohlman: A survey of information-centric networking. Communications Magazine, 50(7):26­36, 2012.
  • Dan Boneh, Giovanni Di Crescenzo, Rafail Ostrovsky, and Giuseppe Persiano. Public key en- cryption with keyword search. In Christian Cachin and Jan Camenisch, editors, EUROCRYPT, volume 3027 of Lecture Notes in Computer Science, pages 506­522. Springer, 2004.
  • Dan Boneh, Amit Sahai, and Brent Waters. Functional encryption: definitions and challenges. In Proceedings of the 8th conference on Theory of cryptography, TCC'11, pages 253­273, Berlin, Heidelberg, 2011. Springer-Verlag.
  • Y.J. Kim, J. Lee, G. Atkinson, H. Kim, and M. Thottan. SeDAX: A scalable, resilient, and secure platform for smart grid communications. Selected Areas in Communications, IEEE Journal on, 30(6):1119­1136, 2012.
  • R. Rivest, L. Adleman, and M. Dertouzos. On data banks and privacy homomorphisms. pages 169­177. Academic Press, 1978.

Title: Predicting the effects of gene knockouts from observational data.
Chosen by: s4241746, s4244931
Supervisor: Joris Mooij <j.mooij@cs.ru.nl>
Description: Genes play a vital role in living organisms. In this project, the student will analyze gene expression data measured in yeast cells. The task is to predict which genes will have a strong effect on which other genes in knockout experiments. This will be done with a combination of state-of-the-art causal discovery methods. This project fits well into the "Big Data" theme: even though the dataset itself is not very big (64 samples and about 5000 genes), the causal discovery methods need to iterate through (subsets of) pairs, triples and even higher tuples of genes. Doing this na¨ively is not tractable, and smart optimization and/or parallellization tricks will be needed. If successful, this work may very well lead to a publication.
Literature:
  • Tom Claassen and Tom Heskes. A logical characterization of constraint-based causal discovery. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 135­144, 2011.
  • Doris Entner, Patrik O. Hoyer, and Peter Spirtes. Data-driven covariate selection for nonparametric estimation of causal effects. In Proceedings of the Sixteenth International Conference on Artificial Intel- ligence and Statistics (AISTATS 2013), volume 31 of Journal of Machine Learning Research Workshop and Conference Proceedings, 2013.
Type of work: The work to be performed is a combination of theory (30%), programming (40%) and experiments (30%). The student first familiarizes him-/herself with the current state-of-the-art causal discovery methods [1, 2] for this type of data, implements a suitable combination of these two methods in a suitable language, runs it on the biological data and analyzes the results.

Title: Can Google Trends predict outbreaks of influenza and trading behavior in financial markets (or something else)?
Chosen by: s4241754, s4191552
Supervisor: Tom Heskes <t.heskes@science.ru.nl>
Description: Google Trends reveals how often a particular search-term is entered over time and across various regions of the world. In a seminal Nature paper [1], the authors showed that outbreaks of influenza were correlated to numbers of Google searches for terms related to flu prevention and cure. Another, very recent paper claims that a trading strategy based on the search volume of "debt" would have would have increased the value of a financial portfolio by more than 300% [2]. Both studies indicate that there are these correlations between search volume and (social) behavior, but do this "after the fact" and hence may be susceptible to overfitting (if you search for correlations long enough, you will always find some that are "significant"). One research topic could be to check whether the suggested correlations have real predictive behavior, e.g., by checking whether they also apply to novel data, after the authors published their study. Another challenge could be to come up with another cool example of the predictive power of Google Trends.
Literature:

Title: Blazingly fast Random Forests
Chosen by: s4031253, s3030547
Supervisor: Tom Heskes <t.heskes@science.ru.nl>
Description: Random Forest (RF) is one of the most popular machine learning algorithms these days [1,2]. RF is a so-called ensemble method which aggregates the predictions of randomized decision trees to obtain a predictive model which is both robust and accurate. RF has a number of attractive properties, making it easy-to-use across many different domains. This usability and adaptability has made RF the benchmark-of-choice for Kaggle competitions, world-wide and also locally, by students participating in the Machine Learning in Practice course...
There are various implementations of RF in Weka, R, and Python. The company wise.io now claims to have a version which is orders of magnitude faster than the existing ones, because of [a] "data set design and special implementation strategies that no one else had (or has yet) figured out" [3]. They appear to make quite some profit by selling the software to companies.
Can you take up the challenge and figure out how to speed up RF? Since they make their software (but not code) available, you may also be able to reverse engineer some of their tricks.
Literature:

Title: Side channel analysis in the frequency domain.
Chosen by: s4335880, s3052869
Supervisor: Baris Ege <B.Ege@cs.ru.nl>
Description: Side channel analysis exploits unintended information leakage of devices (timing, power consumption, etc...) and applicable to almost all platforms used in daily life. Most side channel attack papers in the literature applies analysis on time domain, however it has been known that leakages are also present in frequency domain as well. There are quite some papers available online on the internet exploring the topic, and it is a promising area of research attracting increasing attention from researchers all over the world.
Literature: TBD
Type of work: You are expected to do first a literature search on the subject. Then they should implement and demonstrate the attack either on a mathematical software (sage, matlab, and the like...) or a programming language of their chosing.

Title: Power measurement acquisition from an FPGA board.
Supervisor: Baris Ege <B.Ege@cs.ru.nl>
Description: Side channel analysis exploits unintended information leakage of devices (timing, power consumption, etc...) and applicable to almost all platforms used in daily life. One way of doing research in side channel analysis is by using dedicated hardware designed for research. SASEBO is such a board which employs two FPGAs, one for communicating with the host computer, and the other to run the design under evaluation. The evaluation board present in the lab (SASEBO prototype) is well documented but may require knowledge of basic physics and FPGA design.
Literature: TBD
Type of work: This topic is suitable for a student who is looking for a more hands on (practical) project. Aim of the project is to first establish communication between two FPGAs present on the evaluation board. Then, the students are expected to collect power measurements from an FPGA which runs AES.

Title: Tor vs. the NSA
Chosen by: s4062833, s3052273
Supervisor: Anna Krasnova <anna@mechanical-mind.org>
Description: With the recent events proving the fact of global surveillance performed by NSA, the burning problem is how well can existing anonymization tools actually protect privacy against such a powerful attacker? This project is considering one of the most established tools for providing communication anonymity – Tor. Tor has appeared in 2004 and since that time it was heavily studied, attacked and developed. Without initially being designed to be a an anticensorship tool, tor turned out to be often used for this purpose. This motivated many changes to tor, that were aiming at protecting users against powerful censors, like China. This experience of an arms-race with censoring governments has shown, that a powerful censor can go far beyond of what is expected to be achievable. The purpose of this project is to analyse how well can tor protect against a powerful attacker like NSA, whose goal is to spy on individuals.
Literature:

Title: Causal discovery for ADHD data set
Supervisor: Elena Sokolova <e.sokolova@cs.ru.nl>
Description: Attention deficit-hyperactivity disorder (ADHD) is a frequent and highly heritable disorder affecting both children and adults. In order to better understand the correlation between personal characteristics (age, gender, IQ, etc.), brain functioning and disease, Global competition ADHD-200 has been established. This competition collected a large data set, describing 285 patients and 491 healthy controls. The results of the competition applying standard classification techniques are described in. We propose an alternative method for analyzing the data, involving causal modeling. The Bayesian Constraint-based Causal Discovery (BCCD) algorithm is able to find direct and indirect dependencies between variables, determines the strength of the dependencies, and provides a graphical visualization to interpret the results. The idea of the project is to understand, which factors influence ADHD, by learning the causal model of the data.
Literature:
  • Matthew R. G. Brown, Gagan S. Sidhu, Russell Greiner, Nasimeh Asgarian, Meysam Bastani, Peter H. Silverstone, Andrew J. Greenshaw, and Serdar M. Dursun. (2012)ADHD-200 Global Competition: diagnosing ADHD using personal characteristic data can outperform resting state fMRI measurements
  • Claassen T., Heskes T., 2012. A Bayesian Approach to Constraint Based Causal Inference. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence.

Title: Learning state machines from big data
Chosen by: s4048911, s4053060
Supervisor: Sicco Verwer <siccoverwer@gmail.com>
Description: A central challenge in detecting and analyzing malicious behavior on computer networks is how to make sense of the vast amounts of data generated in monitoring such systems. The LEMMA project will develop automated tools to tackle this challenge. In our research, we investigate methods for learning state machines from data streams, such as network traffic. In this project, you will help with our investigation by reviewing the literature on special data-structures for data stream mining, implementing a prototype stream learning algorithm for state machines, and reporting on experiments performed with this prototype.
Literature:
  • B. Balle, J. Castro and R. Gavaldà: Bootstrapping and Learning PDFA in Data Streams. JMLR Workshop and Conference Proceedings Volume 21: ICGI 2012, 21:34-48.
  • J. Schmidt and S. Kramer. Online Induction of Probabilistic Real Time Automata In: Proceedings of the 2012 IEEE International Conference on Data Mining (ICDM 2012)

Title: Big data analysis of control models
Supervisor: Sicco Verwer <siccoverwer@gmail.com>
Description: In the ITALIA project, we investigate and build tools for learning state machine descriptions for communication protocols of real-world software systems. In a recent study, we applied such tools to embedded control software in Oce industrial printers. These models are learned actively, by providing the software with inputs, reading the resulting outputs, and formulating the observed behavior into a hypothesis model. In this way, we obtained a large state machine description of a part of the communication to and from the controller in an Oce printer, which takes place on a central bus. We know this model is incomplete, but have been unable to prove its incompleteness using model-based testing. You will investigate the uses of Big Data analysis for proving this incompleteness. An industrial printer continuously logs enormous amounts of data, including all traffic taking place on the central communication bus. This data contains statistical information that cannot be obtained by actively providing inputs. For instance, it can indicate which states are used more frequently, and by focussing our active learning algorithms on these frequent parts of the model, we hope to find errors more quickly. From the previous study, there already is an implementation connecting the active learner to the control software, and a learned model. Large data logs will be provided by Oce. You will combine the two in order to find errors in the model and/or to direct the active learning software.
Literature:
  • W. Smeenk, D.N. Jansen and F.W. Vaandrager. Applying Automata Learning to Embedded Control Software.
  • Wouter Smeenk. Applying Automata Learning to Complex Industrial Software Master's Thesis, Radboud University Nijmegen, September 2012.

Title: More data and simple models.
Chosen by: s3016439, s4386035
Supervisor: Sicco Verwer <siccoverwer@gmail.com>
Description: Why does more data make simple models more powerful? Intuitively, more data should make larger models and complex algorithms increasingly powerful as these theoretically require more data to converge. However, an important paper reporting on experiments done by Google claims the opposite is true. Given the huge amount of data and processing power available at Google, it is very hard to repeat these experiments. We wonder whether the scientific literature contains any experiments that confirm or contradict this result. You will perform this study of the recent Big Data literature.
Literature: A google query on (for instance): big data simple models, gives many references.

Title: Sudoconnect
Chosen by: s0721964, s4250559
Supervisor: Hans Zantema <h.zantema@tue.nl>
Description: Recently, a new variant of the well-known Sudoku puzzle has been invented in which apart from the usual conditions it is also required that the solution has an underlying graph that is connected. Instances of this new kind of puzzle can be generated based on SAT solving. In the underlying background lots of questions arise, for instance about the possible structure of the underlying graph, not only for standard 9x9 sudoku's, but also for variants and general Latin squares. Both logical reasoning and experimenting with implementation will help to get answers on these questions.
Literature: TBD
Type of work: Programming and experimenting with these programs will be a substantial part of the project.

Title: Social login 4.0 & Using the privacy-friendly IRMA technology online with OpenID Connect
Supervisor: Gergely Alpar <gergely@cs.ru.nl> and Roland van Rijswijk-Deij <rijswijk@cs.ru.nl>
Description: It is becoming more and more common to see web sites where you can log in using your social identity (e.g. your Facebook, Google or Twitter account). Most of these login scenarios are based on OAuth, OAuth2 and - in the near future - Open ID Connect. The problem with many of these logins is that relying parties (the site you log in to) often request a lot of personal data. From a privacy perspective that is undesirable. The IRMA project on the other hand is "privacy-by-design". We differentiate between identifying and non-identifying information about a user (attributes) and put the user at the centre of all interactions. No data is revealed without the user's consent and the system is built to facilitate selective and minimal disclosure of personal information. The goal of this student assignment is to investigate how we can couple IRMA's privacy-friendly approach with OpenID Connect. Students are challenged to analyse how IRMA fits in the OpenID architecture and to build a prototype that demonstrates the use of IRMA credentials in an OpenID Connect identity provider.
Literature: TBD
Type of work: Knowledge of OAuth2 and federated identity management helps, as well as good programming skills. We have OAuth2 software available in several programming languages (e.g. PHP and Java) that can be used as a starting point.


Slides and Course Material