Multiple Motif Matching Automata to Explore Genome Language Model

This project aims to explore genome sequence as a recognizable language using a parser to identify language parameters (word, sentence and topic) to give more understanding of the rules governing spontaneous polypeptide folding and functionality.

school Student intake
This project is open for Master and PhD students
traffic Project status

Project status

Potential
Contact
contact_support Contact
Contact name
Dr. Ayman Elgharabawy
Contact position
Postdoctoral Fellow (CSIRO)
Contact email

Content navigation

About

Image
Project

General Overview: -

This project aims to explore genome sequence as a recognizable language using a parser to identify language parameters (word, sentence and topic) to give more understanding of the rules governing spontaneous polypeptide folding and functionality. It also provides well-defined features that can be used in different machine-learning approaches.

Domain Overview: -

The genome parser will be implemented using non-deterministic finite acceptor (NFA) automata. the proposed NFA is based on a multiple motif match algorithm and a summation of the chemical and physical properties of amino acids. The regular expression is generated from the NFA, which has a transition function that computes the accumulated properties. The regular expression is generated. The visualized grammar tree is generated from the regular expression. The Arabidopsis genome reference is used in this project.

Technical Overview: -

The project consists of the following phases,

  • Lexical analysis: Define the alphabet and Tokenizing.
  • Syntactic analysis:

Implementing the genome parser using NFA using different parameter and transition functions

NFA Automata Parameters

   Q, Σ, q0, F, δ

Σ Is the finite alphabet

Q is the state that represents accumulated values for chemical or physical properties or has no value.

  • Identifying Regular Expressions: using one of the following methods                  
  • 1-State element method, 2-Arden's theory
  • Identifying Grammar Tree: Visualizing the grammar tree indued from regular expression.
  • Biological Inference: Labeling the sentences.
 

Members

Principal investigator

Ayman

Postdoctoral Fellow (CSIRO)

Partners & sponsors

CSIRO