database searching Flashcards

Question 1

Q

BLAST

Answer

A

very fast local search program
- 50x faster than SW
local is important
- may only have one section/domain that is homologous
finds ‘words’ in the query that have matches in the database
- extend to form high scoring pairs that can’t be extended
- evaluate statistical significance of hsp
heuristic
- tradeoff between accuracy and speed

Question 2

Q

search sensitivity

Answer

A

psiblast = iterative version of blast
- increased sensitivity
- accumulates similar sequences
protein searches mroe sensitive
- no similarity in nucleotide searches (same or not)

Question 3

Q

blast output

Answer

A

expect value and number of identities most important
gaps also
short alignments can have high identity but more likely to be due to chance
longer sequence = higher bit score

Question 4

Q

bit score

Answer

A

re-scaled version of raw alignment score
independent of search space size
query sequence lenght n, sum of all database sequence lengths m, then search space is proportional to N = nm

Question 5

Q

e value

Answer

A

expected number of matches with same bit score or better that would be produced by random chance
takes into account size of search space
- bigger database increases chance of finding a match
indication of real evolutionary relationship between query and match
false positive rate
greater than 1 indicates no statistical support for the relationship

Question 6

Q

P value

Answer

A

probability of obtaining a match by chance
- different to E value (E can be greater than 1)
- often the same

Question 7

Q

multiple sequence alignments

Answer

A

proteins form families with related sequences, structures and functions
can learn more by aligning multiple sequences from related family members instead of pairwise alignments
patterns of conservation reflect structural and functional evolutionary constraints
- e.g. loops - less important for function, well conserved
key functional residues often show strong conservation
- e.g. ser proteases - conserved triad of ser, asp, his

Question 8

Q

MSA algorithms

Answer

A

more sequences so slower than pairwise alignments
use heuristic methods
- e.g. clustal

Question 9

Q

clustal

Answer

A

build guide tree
- perform all pairwise alignments
- group similar sequences
  - higher score → neighbouring branches
- guides other pairwise alignments (more information)
align sequences progressively
- most closely related pairs aligned first
- align next most closely related sequences to existing alignments

Question 10

Q

MSA programs

Answer

A

clustal omega
T-coffee
- slow, best for smaller alignments
- align multiple MSAs
MUSCLE
- fast
- estimates sequence similarity using short sequence words of n-residues
- align small words and build up

Question 11

Q

sequence profiles

Answer

A

weight profile or PSSM
captures information in MSA
table of residue frequencies at each position of the alignment
- can include gaps
profile size nx21
- number of sequences irrelevant
only considers information in that set of sequences
- similar rsidues not included are irrelevant
can add to BLOSUM
use to search PSI-BLAST for mor eremote homologues

Question 12

Q

problems with frequency scores

Answer

A

new sequence can differ from profile derived from existing set
doesn’t include evolutionary knowledge of conservative subsitutions

Question 13

Q

PSSM

Answer

A

position specific scoring matrix
substitution matrix (like BLOSUM) combined with observed frequencies

Question 14

Q

PSI-BLAST

Answer

A

position specific iterated blast
form PSSM with MSA of blast search
- closely related sequence sonly due to noise
- search again with this profile
- add more significant hits to refine profile
- iterate until no more significant hits found (or for x iterations)
bridges sequence space to find more distantly related homologues
start with more conservative e value to ensure you get the right sequences

Question 15

Q

PSI-BLAST

low complexity regions

Answer

A

repeats of a few residues, often a coil
limited information - unrelated protein sequences brought in
need to be masked
- SEG program in blast
- run with and without - no clear default

Question 16

Q

PSI-BLAST

drift

Answer

Study These Flashcards

A

sequences early on cna disappea rlater
rogue matches can pollute the PSSM
watch output and stop earlier
or stricter E vlaue

Question 17

Q

PSI-BLAST

asymmetry

Answer

Study These Flashcards

A

searching with sequence A may find B
searching with B won’t necessarily find A

Question 18

Q

HMMs

Answer

Study These Flashcards

A

statistically rigorous representation of protein families (via domains)
markov chain
- series of states connected by a transition probability
- probability off state i+1 dpeends only on state i
- probability of a sequence of states = products of probabilities of path that generated them
  - transition and output probabilities
  - delete, match and insert states (transition)

database searching Flashcards

(18 cards)