database searching Flashcards Preview

bioinformatics > database searching > Flashcards

Flashcards in database searching Deck (18)
Loading flashcards...
1
Q

BLAST

A
  • very fast local search program
    • 50x faster than SW
  • local is important
    • may only have one section/domain that is homologous
  • finds ‘words’ in the query that have matches in the database
    • extend to form high scoring pairs that can’t be extended
    • evaluate statistical significance of hsp
  • heuristic
    • tradeoff between accuracy and speed
2
Q

search sensitivity

A
  • psiblast = iterative version of blast
    • increased sensitivity
    • accumulates similar sequences
  • protein searches mroe sensitive
    • no similarity in nucleotide searches (same or not)
3
Q

blast output

A
  • expect value and number of identities most important
  • gaps also
  • short alignments can have high identity but more likely to be due to chance
  • longer sequence = higher bit score
4
Q

bit score

A
  • re-scaled version of raw alignment score
  • independent of search space size
  • query sequence lenght n, sum of all database sequence lengths m, then search space is proportional to N = nm
5
Q

e value

A
  • expected number of matches with same bit score or better that would be produced by random chance
  • takes into account size of search space
    • bigger database increases chance of finding a match
  • indication of real evolutionary relationship between query and match
  • false positive rate
  • greater than 1 indicates no statistical support for the relationship
6
Q

P value

A
  • probability of obtaining a match by chance
    • different to E value (E can be greater than 1)
    • often the same
7
Q

multiple sequence alignments

A
  • proteins form families with related sequences, structures and functions
  • can learn more by aligning multiple sequences from related family members instead of pairwise alignments
  • patterns of conservation reflect structural and functional evolutionary constraints
    • e.g. loops - less important for function, well conserved
  • key functional residues often show strong conservation
    • e.g. ser proteases - conserved triad of ser, asp, his
8
Q

MSA algorithms

A
  • more sequences so slower than pairwise alignments
  • use heuristic methods
    • e.g. clustal
9
Q

clustal

A
  • build guide tree
    • perform all pairwise alignments
    • group similar sequences
      • higher score → neighbouring branches
    • guides other pairwise alignments (more information)
  • align sequences progressively
    • most closely related pairs aligned first
    • align next most closely related sequences to existing alignments
10
Q

MSA programs

A
  • clustal omega
  • T-coffee
    • slow, best for smaller alignments
    • align multiple MSAs
  • MUSCLE
    • fast
    • estimates sequence similarity using short sequence words of n-residues
    • align small words and build up
11
Q

sequence profiles

A
  • weight profile or PSSM
  • captures information in MSA
  • table of residue frequencies at each position of the alignment
    • can include gaps
  • profile size nx21
    • number of sequences irrelevant
  • only considers information in that set of sequences
    • similar rsidues not included are irrelevant
  • can add to BLOSUM
  • use to search PSI-BLAST for mor eremote homologues
12
Q

problems with frequency scores

A
  • new sequence can differ from profile derived from existing set
  • doesn’t include evolutionary knowledge of conservative subsitutions
13
Q

PSSM

A
  • position specific scoring matrix
  • substitution matrix (like BLOSUM) combined with observed frequencies
14
Q

PSI-BLAST

A
  • position specific iterated blast
  • form PSSM with MSA of blast search
    • closely related sequence sonly due to noise
    • search again with this profile
    • add more significant hits to refine profile
    • iterate until no more significant hits found (or for x iterations)
  • bridges sequence space to find more distantly related homologues
  • start with more conservative e value to ensure you get the right sequences
15
Q

PSI-BLAST

low complexity regions

A
  • repeats of a few residues, often a coil
  • limited information - unrelated protein sequences brought in
  • need to be masked
    • SEG program in blast
    • run with and without - no clear default
16
Q

PSI-BLAST

drift

A
  • sequences early on cna disappea rlater
  • rogue matches can pollute the PSSM
  • watch output and stop earlier
  • or stricter E vlaue
17
Q

PSI-BLAST

asymmetry

A
  • searching with sequence A may find B
  • searching with B won’t necessarily find A
18
Q

HMMs

A
  • statistically rigorous representation of protein families (via domains)
  • markov chain
    • series of states connected by a transition probability
    • probability off state i+1 dpeends only on state i
    • probability of a sequence of states = products of probabilities of path that generated them
      • transition and output probabilities
      • delete, match and insert states (transition)