Projects

A short description of some of the projects I worked on. All of them were great fun ;)
(Click on the images to enlarge them.)

Automatization

Classification of TDR-signals

POSITION: IT student, University of Applied Science in Wuerzburg, Germany
KEYWORDS: C, Ethernet, Time Domain Reflectrometry, Neural Networks
SUMMARY: Design and implementation of a software system to detect and localize taps, bridges, short-cuts and breaks in Ethernet network cables utilizing Time Domain Reflectrometry (TDR). TDR measures impedance discontinuities in electrical and optical transmission media by observing the reflected waveforms of test impulses. Artificial neural networks were applied to recognize the typical signal signatures of the aforementioned cable conditions within the reflected wave.

Robot simulation

POSITION: Software developer, Research & Development, Siemens AG, Erlangen, Germany
KEYWORDS: C, Robot, Simulation, Modeling, Windows
SUMMARY: Mathematical modeling of the kinematics and dynamics of multi-axle robots. Design and implementation of a robot simulator and an interpreter for a robot control language. Integration of an adaptive torque controller model based on a neural network within the simulation of a Manutec robot to improve trajectorial accuracy at high velocities. Within this framework also a small C-library for numerical linear algebra was implemented.

Evaluation of training algorithms for neural networks

POSITION: Software developer, Research & Development, Siemens AG, Erlangen, Germany
KEYWORDS: C, Nonlinear Optimization, Neural Networks, Windows
SUMMARY: Implementation and evaluation of numerous nonlinear optimization algorithms as training algorithms for artificial neural networks with the purpose to model and improve the dynamic behavior of tooling machines. This work included the development of a novel training algorithm that combines the advantages of Hooke-Jeeves and RPROP.

Friction compensation

POSITION: Software developer, Research & Development, Siemens AG, Erlangen, Germany
KEYWORDS: C, Simulation, Optimization, CMAC, Tooling Machines
SUMMARY: Development and implementation of a simulator for a tooling machine with an adaptive controller (CMAC) to compensate for path inaccuracies occurring at axes reversal. Varying friction conditions, as well as slackness and torsional effects are compensated using a pre-control approach.

DETAILS: Flatbed tooling machines operate similar to a plotter with two perpendicular axes but with a tool, e.g. a cutter, instead of a pen. While highly accurate in general, path inaccuracies occur when the axes are reversed, caused by varying friction conditions, as well as slackness and torsional effects. In the case of several axes interpolating with one another, this is especially obvious in the case of circular contours (see picture), where one axis moves with maximum translational speed when going from one quadrant to another, while the other axis changes the sign of its velocity.

Such path inaccuracies can be compensated for with the help of frictional pre-control through correction of the rotation speed reference value. When setting up a tooling machine this can be achieved by measuring the actual trajectory of the tool, comparing it to the planned trajectory and manually tuning a correction value. The manual process is, however, time consuming and error prone.

Here we employed a very simple neural network (essentially a distributed look-up-table) named CMCA (Cerebellar Model Articulation Computer) to automatically learn the dependence of the correction amplitude on the acceleration signal. During the learning phase the machine tool performs multiple of passages through zero, e.g. in a circular path. The CMAC optimizes the correction signal using the differences between the actual rotation speed values and the reference values. Ten to fifteen iterations were found to be sufficient for this approximation process.

The resulting contour error at the quadrant transitions is up to ten times smaller with frictional pre-control than without correction pulses. Furthermore, deviations caused by slackness and torsion are compensated for. In addition, greater application flexibility is achieved with neural frictional pre-control, because the shape of the characteristic curve can be adapted to the actual conditions more quickly and accurately than a human being would be capable of doing.

Welding spot diagnosis

POSITION: Software developer, Research & Development, Siemens AG, Nuermberg, Germany
KEYWORDS: C++, Image Processing, Signal Processing, Machine Learning, Windows
SUMMARY: Design and implementation of a quality control system for welding spot diagnosis in car manufacturing lines.

DETAILS: Resistance welding techniques such as spot welding are used in many manufacturing processes, namely the automotive industry. In spot welding, a nugget of weld metal is produced between two base metals at an electrode site. A coordinated application of electric current and mechanical pressure of the proper magnitudes and durations is needed to assure a suitable nugget is formed. For example, the current density and pressure must be high enough to form a nugget, but not so high that molten metal is expelled from the electrode site.

Spot welds are characterized by surface marking resulting from base metal shrinkage, which is caused by a combination of the heat of welding and electrode penetration into the surface of the base metal . The surface mark includes a circular ridge around a spot weld concavity centered around the nugget (see picture). The diameter of the nugget (the "inner diameter" of the spot weld) must meet given specifications.

It is, however, difficult to measure the inner diameter of the spot weld without pulling the weld apart and destroying it. A common method for estimating weld quality is a visual inspection by a human, which is subjective, inconsistent, and labor-intensive. Ultrasonic inspection has also been implemented but requires careful and accurate application of an ultrasonic sensor to the weld.

In this project we developed an automatic, non-destructive method to determine the quality of a spot weld based on image data. In the first step multiple images without spots are taken to provide a background image that is subtracted from the spot image and so allows to roughly separate background from welding spot. After binarization an erosion mask is applied to further clean the image and remove remainders of the background. A summation of the pixels of the welding spot results in a value, which correlates with the inner diameter and therefore the quality of the weld nugget. In general, we found that the darker and bigger the spot weld, the larger the inner diameter, and the lighter and smaller the spot weld, the smaller the inner diameter. The surface brightness and the size of the spot weld are good predictors of the inner diameter of the spot weld.

This method has the advantages of being simple and robust. There is no need to determine the center of the spot weld, its exact position or size in the image in order to compute complex reference characteristics.

Diagnosis of tiles

POSITION: Software developer, Research & Development, Siemens AG, Erlangen, Germany
KEYWORDS: C++, MFC, Image Processing, Signal Processing, Machine Learning, Windows
SUMMARY: Development of the software for an adaptive, general purpose diagnosis system applied to the quality control of tiles based on sound, vibration and surface images. This included the design of a hardware device to measure sound and vibration of tiles within a sound chamber utilizing directional microphones and laser interferometers.

DETAILS: Clay roof tiles are manufactured by firstly forming the raw clay into the desired shape using a two-piece form under high pressure. After drying and coloring the tiles are burnt in tunnel kilns under tightly controlled conditions. Nevertheless, the process of drying and burning can result in cracks, causing the tile to break when exposed to changing weather conditions such as rain, frost and heat. Cracks can be very small and largely within the core of the tile, making the crack almost invisible to the eye. Apart from a visual inspection, tiles therefore also undergo an acoustic test. A human operator hits the tile with a small wooden hammer and listens to the sound. The denser the tile is, the clearer the sound is. This manual acoustic test, however, is extremely tedious, subjective, inconsistent and requires extensive training of operators.

In this project we developed the hard- and software to perform automatic tile testing based on sound, vibration and image. The hardware prototype consisted of a short conveyor belt to transport the tile into a sound chamber equipped with an automatic hammer and sensors such as a microphone, a laser inferometer and a video camera.

The software side consisted of an algorithm that computed the Fast Fourier Transformation of the measured vibration signal and automatically identified frequency ranges most predictive for cracked tiles based on a sample of faultless and defect tiles. The recorded image of the tile was used to ensure the correct shape and dimensions according to specification.

Speaker, language and topic identification

POSITION: Software developer, MEDAV, Uttenreuth, Germany
KEYWORDS: C, Java, HMMs, GMMs, Windows/Unix
SUMMARY: Implementation of a software system for speaker, language and topic identification on audio signals utilizing Hidden Markov Models and Gaussian Mixture Models.

Bioinformatics

Promoter site recognition

POSITION: Research assistant, Queensland University of Technology, Brisbane, Australia
KEYWORDS: Java, Fortran, XML, Windows/Unix
SUMMARY: Development of a phylogenetic footprinting software for promoter site recognition in bacteria. This included the implementation of a library for the analysis of nucleotide and peptide sequences.

Genome browser and pattern description language

POSITION: Research assistant, Queensland University of Technology, Brisbane, Australia
KEYWORDS: Java, Swing, XML, Windows
SUMMARY: Design and implementation of a fast browser for genomic sequences (Diana-B). Evaluation of various description languages for biological patterns and development of a novel, XML-based description language (BioPatML). Implementation of a parser for BioPatML and integration into the genome browser to localize occurrences of interesting biological patterns.

Topology and localization prediction

POSITION: PhD Student Computer Science, University of Queensland, Brisbane, Australia
KEYWORDS: Java, JSP, HMMs, SVMs, CRFs, Windows
SUMMARY: Study of different approaches to model the topology of transmembrane proteins using Hidden Markov Models, Conditional Random Fields and Support Vector Machines. Implementation and evaluation of a predictor for the sub-cellular localization of transmembrane proteins, utilizing the aforementioned models. This work also included a predictor for cleavage site prediction in signal peptides. A web interface to the predictor, using Java Server Pages (JSP), was implemented.

Binding site and protein function prediction

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Python, Java, iGraph, PyGraph, Windows
SUMMARY: Study and evaluation of graph-theoretical centrality measures as predictive indicators for functional sites in protein-RNA interfaces and their correlation with protein function. Graphs were derived from the contacts of amino acids within the tertiary protein structure. Implementation of a web application to predict binding residues in protein-RNA complexes.

Visual framework for sequence analysis

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Python, Matplotlib, NumPy, Windows
SUMMARY: Development of an alignment-free and visual approach (Mosaic) to analyze sequence relationships utilizing n-grams to measure sequence similarity and spectral rearrangement to detect clusters of related sequences. Implementation of a software to visualize similarities of sequence data via affinity matrices and dot plots.

Inference of gene regulatory networks

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Scala, Python, Windows
SUMMARY: Evaluation of supervised, semi-supervised and unsupervised methods with respect to their accuracy in inferring gene regulatory networks from gene expression data.

DETAILS: Mapping the topology of gene regulatory networks is a central problem in systems biology. Experimental methods are capable of determining the nature of gene regulation in a given system, but are time-consuming, expensive and require antibodies for each transcription factor. Accurate computational methods to infer gene regulatory networks are urgently required not only to supplement empirical approaches but also, if possible, to explore data in new, more-integrative ways.

Many computational methods have been developed to infer regulatory networks from gene expression data, predominately employing unsupervised techniques. Several comparisons have been made of network inference methods, but a comprehensive evaluation that covers unsupervised, semi-supervised and supervised methods was lacking. We performed a comprehensive evaluation of supervised, semi-supervised and unsupervised methods and addressed fundamental questions, including which methods are suitable for what kinds of experimental data types, and how many samples these methods require.

The most-important observation from this evaluation is the large variance in prediction accuracies across all methods. We found that a large number of repeats on networks of varying size is required for reliable estimates of the prediction accuracy of a method. On average, unsupervised methods achieve very low prediction accuracies, and are considerably outperformed by supervised and semi-supervised methods. On multi-factorial data the supervised and semi-supervised methods achieved the highest accuracies (see picture); even with as few as 10% of known interactions, the semi-supervised methods still outperformed all unsupervised approaches. There was little difference in prediction accuracy for semi-supervised methods trained on positively labeled data only, compared to training on positive and negative samples.

Inference of protein-protein interaction networks

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Scala, Windows
SUMMARY: Development of a command line application (go2ppi) for the prediction of protein-protein interactions based on protein Gene Ontology annotation using machine learning methods (Random Forests, Support Vector Machines).

DETAILS: Protein-protein interactions (PPI) are essential for many biological processes and similarity in Gene Ontology (GO) annotation has been found to be one of their strongest indicators. GO driven algorithms for PPI inference can be divided into machine learning and semantic similarity approaches. In this project we introduced the concept of inducers as a method to integrate both approaches effectively, leading to superior prediction accuracies.

Protein-protein interactions can be inferred from various sources, ranging from sequence features such as n-gram composition, over phylogenetic relationships, e.g. Interologs, to shared GO annotation. The GO is a hierarchically organized, controlled vocabulary to characterize gene products. It is composed of three ontologies with terms identifying biological processes, cellular components, and molecular functions. Each ontology is represented by a directed acyclic graph, with nodes referring to GO terms and edges describing relationships between terms.

There are two fundamental approaches to exploit GO annotation for protein interaction prediction. The machine learning approach, where the GO annotation of a protein pair is encoded by a feature vector that indicates annotating or shared GO terms, which is evaluated by a machine learning classifier. And the semantic similarity measure approach, where a metric over the GO graph, e.g. the shortest path between terms, is applied to measure the similarity between GO terms. The former approach has the advantage that advanced classification algorithms such as Support Vector Machines or Random Forests can be applied but commonly ignores the information latent within the topology of the GO graph. The latter explicitly exploits topological relationships between GO terms but is typically limited to comparatively simple predictors, e.g. term probabilities.

Proteins annotated by a set of GO terms, describing their biological process, function and cellular location. Interaction between two proteins is essentially inferred from the similarity of their annotation sets. Inducers are mappings of the term sets that annotate two proteins to a feature vector that serves as input to a standard machine learning classifier. The power of the inducer comes from the fact that it takes the topology of the GO into account when performing this mapping. A simple example is the Shortest Path-inducer, which includes all terms along the shortest path or paths between two term sets. The best performing ULCA (Up to Lowest Common Ancestors, see picture) inducer, on the other hand, takes all terms from the annotating term sets up to the lowest common ancestor.

On a high-quality yeast PPI data set a peak prediction accuracy of 0.95 (AUC) was achieved by the ULCA inducer in combination with a Random Forest classifier. We also showed that – in contrast to sequence based inference methods – the cross-species prediction accuracy of the ULCA inducer is excellent. When comparing the three individual ontologies of the GO we found, in agreement with other studies, similarities in biological process and cellular component annotation consistently to be stronger indicators for protein interaction than molecular function annotation.

Inference of lateral genetic transfer networks

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Scala, Windows
SUMMARY: Development of a software (LGTNet) to infer networks of lateral genetic transfer (LGT) from sequence data. In contrast to traditional techniques based on multiple sequence alignments and phylogenetic trees LGTNet is an alignment- and tree-free method. As a consequence, LGTNet is very fast (more than 1000 times faster than a phylogenetic approach) but cannot infer when in time an LGT event happened. It only predicts, which species have exchanged genetic material at some point in time and approximately indicates the sequence regions involved.

DETAILS: Lateral genetic transfer (LGT) is a driving force in prokaryotic evolution and can be harmful to humans due to the passage of antibiotic resistance and pathogenic genes resulting in pathogenic bacteria and viruses. Fast and accurate methods to infer the network of LGT within bacterial communities are therefore of great interest. Current phylogenetic methods for the detection of LGT tend to be complex and time consuming.

In this project, we developed LGTNet, a tree- and alignment-free method to infer LGT networks from sequence data. A comparison with a phylogenetic and another alignment-free method on a large, simulated data set demonstrated that LGTNet is not only faster but also more accurate then those methods. We also applied the new method to a well-studied empirical data set with 27 Escherichia coli and Shigella strains and derived the underlying LGT network (see picture o n right, click to enlarge).

There are three main computational approaches employed to detect LGT. The most common approach is the phylogenetic in which topological incongruences between a gene trees and a reference tree are detected. Parametric methods are based on sequence compositional characteristics and local sequence similarity methods are based on sub-sequence comparison.

Phylogenetic methods require known phylogenies for comparison and are computationally expensive. Parametric methods, rely on parameters such as G+C content or codon usage, require no reference but have difficulty with innate variation such as heterogeneity within genomes and pseudo genes. Local sequence similarity methods require a (local) sequence alignment, which can be difficult to achieve in case of rampant LGT.

LGTNet computes frequency profiles of substring matches to infer LGT events. It collects all locations, where two sequences share a substring and approximates the frequency distribution of locations (of substring matches) by a histogram (frequency profile of matching substrings). Histograms are constructed by dividing sequences into non-overlapping intervals of a given width and counting the number of locations within those intervals.

In an comparison with a phylogenetic (T-REX) and another alignment-free (ALFY) method, LGTNet achived an AUC of 0.69 compared to 0.65 and 0.55 for T-REX and ALFY, respectively. However, LGTNet was more than 1000 times faster than T-REX and is of quadratic complexity.

The most striking characteristic of all methods evaluated was their large variance in prediction accuracy. AUCs range from perfect over not better than chance to perfectly inverted. In practice it remains an open question how accurate an inferred LGT nework actually is.

Characterizing cancer subtypes as attractors of Hopfield networks

POSITION: Research Officer, Institute for Molecular Bioscience, Brisbane, Australia
KEYWORDS: Scala, Python, Windows
SUMMARY: Development of a method to cluster cancer subtypes using Hopfield networks. Cancer is a heterogeneous disease caused by perturbations of the underlying gene regulatory network. Cancer types or subtypes can be distinguished by their gene-expression profiles. Traditionally common clustering methods such as k-means and others are utilized for this purpose. In this study we employed Hopfield networks, where attractors were associated with cancer subtypes.

DETAILS: Cellular function and development are controlled by networks that regulate the expression of genes. Perturbations of gene regulation are associated with many complex diseases including cancer. At any point in time, the current state of a cell can be described by its expression profile and the set of all possible states defines the state space of the cell. Most states are unstable, and the system will converge to a stable state of low energy — an attractor state — when perturbed. Waddington’s epigenetic landscape of cell development is an example of state space models of gene regulatory networks in which attractors represent stable low-energy states that emerge from the dynamics of the underlying network. In such models, attractor states correspond to normal developmental stages of the cell or disease states such as cancer. In this study, we proposed Hopfield networks as a novel approach to efficiently construct large attractor networks from static gene-expression data.

A Hopfield network is a graph composed of nodes and weighted but undirected edges between nodes. Nodes have ternary states and a energy function over the node states describes the dynamic behavior of the network. Attractors are local minima of the energy function E, and the system can be interpreted as relaxing from a higher energy state to a minimum energy state. States or patterns can be established (stored) as attractors of the network using Hebbian learning. We used Hopfield networks to characterize different cancer subtypes as network attractors.

We constructed networks from gene expression data using Hebbian learning and then recalled each sample until it reached its attractor. We did not label patterns by cancer subtype class but instead left it to the learning algorithm to construct attractors that capture the similarities between cancer subtypes - functioning as a clustering method.

Compared with other clustering algorithms, the Hopfield network was not necessarily the most accurate. However, it has other important advantages, including the unification of clustering, feature selection and network inference, and as a modeling framework for epigenetic landscapes. We presented methods to visualize the high-dimensional energy landscape of the model (see picture) and to prune the network to derive sparse networks that are more open to a biological interpretation than traditional clustering or feature-selection methods.

Others

Privacy protection in Google Street View

POSITION: Software Engineer, Google, Mountain View, USA
KEYWORDS: Infrastructure, MapReduce, Large-scale data, C++
SUMMARY: Simplification of legacy-code and complete re-implementation of the infrastructure for large-scale privacy protection in Google Street View. A paper describing the problem and the technology can be found here.

Web application for online assessment of medical students

POSITION: Research Officer, Faculty of Health Science, Medical School, Herston, Australia
KEYWORDS: Python, Django, MySQL, Windows
SUMMARY: Implementation of an online assessment tool for medical students utilizing the Django framework. Students are presented descriptions of clinical cases, select relevant clinical features and provide a diagnosis. The system performs a fuzzy matching of the given, free-text diagnoses to terms stored in the database and automatically assesses the student input.

Statistical analysis of stroke data

POSITION: Research Officer, School for Engineering and Information Technology, Brisbane, Australia
KEYWORDS: Python, Matplotlib, NumPy, MySQL, Windows
SUMMARY: Implementation of a framework in Python to perform numerous statistical analyses of a large stroke data set, maintained within an SQL database. The system utilizes Matplotlib to create various graphs and furthermore outputs tables in LaTeX format.

Mobile phone applications (Android)

POSITION: spare time project
KEYWORDS: Java, Google Android
SUMMARY: Design and implementation of various, educational mobile phone applications for the Google Android platform such as learning software for Japanese Kanji and Kana, the Judo Judo Nage-No-Kata, the Australian Citizenship test, the California Motorcycle permit and a zoomable calendar and organizer software (Ceye).

Zoomable presentation software

POSITION: spare time project
KEYWORDS: Scala, Java, Piccolo2D, Windows
SUMMARY: A zoomable presentation software (Infinity) that allows to evaluate code examples in various languages during the presentation. It also supports formulas in Latex, vector and bitmap graphics, hyper links and live-files. A novel type of user interface was implemented. Very useful for code-centric or scientific presentations.

Document retrieval system

POSITION: spare time project
KEYWORDS: Scala, Swing, Windows
SUMMARY: A little Swing application in Scala that allows the user to reference and annotate scanned documents and other files. The search interface of the application is fault-tolerant and accepts queries with spelling mistakes or arbitrary word order. Software is available at beagle on bitbucket.

Semantic Network library and query language

POSITION: spare time project
KEYWORDS: Scala, Windows
SUMMARY:
Semtric is a Semantic Network library designed for fast loading and saving of network data, for a simple description of semantic networks and a novel, pattern-based query language. Details can be found at Semtric on bitbucket.