eukaryotic gene prediction Flashcards

1
Q

position weight matrices

A
  • line up all possible nts at each position and score
    • common nt at that position → high score
  • search sequence with PWM to find highest scoring sequence
    • sum scores to get score for sequence as potential site
    • above threshold indicates functional site
  • MSA to create PWM
  • any functional site
  • species specific
  • low specificity (multiple transcripts and mechanisms unknown)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

intron splice sites

A
  • PWM:
    • almost always GUAG
    • combine with surrounding patterns
    • generally similar consensus in vertebrates
  • polypyrimidine tracts
    • upstream of 3’ end in higher eukaryotes
  • yeast introns:
    • invariant upstream sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

hidden markov models

A
  • like PWM but considers previous and next base
  • takes into account gaps
  • looks at overall pattern
  • at each position, probability of:
    • insertion, deletion, match (transition)
    • each base (output)
  • move through each position and multiply probabilities
  • pseudocounts
  • idea that sequences can have same function but still vary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

features predicted by HMM

A
  • gene structure
  • exon/intron lengths
  • nt composition
  • motifs
  • start/stop codons
  • splice sites
  • patterns of conservation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

genscan

A
  • uses known genes to creates training sets
    • HMM based
  • species/taxonomic specific gene models
    • search for unknown query sequences
  • focus on GC content
    • gene density and exon/intorn length
    • alter parameters depending on GC content
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

genscan HMM

A
  • start in intergenic region (N state)
  • then promoter (P)
  • 5’ UTR (F)
  • single exon gene (Esngl) or first exon of multi-exon gene (Einit)
  • 3’ UTR (T)
  • polyA tail (A)
  • return to N
  • forward and reverse strand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

genscan intron/exon states

A
  • 3 intron states follow einit depending on frame
  • 3 exon states follow intron states
  • probability of moving to eahc state based on training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

exon size distribution

A
  • normal distribution
    • differs between initial, internal, terminal
  • internal - steep size drop off after 300bp
  • length distribution functions can be used
  • introns have a minimum size and geometric distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MDD

A
  • maximal dependence decomposition
  • captures interdependencies of non-adjacent nucleotides
    • splice sites
  • use dependencies to search sequence and match to MDD tree
    • move from position to next dependent position
    • probabilities of each position
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

weight array model

A
  • weight matrix that captures interdependencies
  • only used for splice sites
    • all other features have WMM
    • all amtrices combined for identification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

genscan promoter identification

A
  • 30% of promoters have no TATA box
    • split prediction model according to this and use weightmodel
  • TATA:
    • 0.7 probability
    • 15bp TATA WMM and 8bp cap site WMM
  • TATA-less:
    • 0.3
    • intergenic-null regions of 40bp
  • genscan doesn’t require promoter identification for gene prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

homology gene prediction

A
  • complements ab initio
  • BLAST search against swissprot and EST data from similar species
    • match to EST confirms exon
      • >90% homology needed as high error rates
    • swissprot - additional confirmation
  • pool to get composite picture
    • genscan may find additional exons even if gene identified otherwise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

phylogenetic footprinting

A
  • can be used for promoter prediction
  • look at same gene in multiple species with common ancestry
    • expect conserved upstream region (promoter)
    • coexpression of genes in the same species (RNAseq)
  • better if more distantly related
    • greater mutaiton outside regulatory region
    • conservation more prominent
    • balance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

repeats

A
  • simple:
    • microsatellites
    • polypurine/pyrimidine tracts
  • complex:
    • LINES/SINES, LTRs, Alus
  • exact sequence is polymorphic - difficult to identify
  • ENCODE - unfinished
    • assign function to all human genome elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly