sequence alignment Flashcards

1
Q

bioinformatics resources

A
  • algorithm
    • set of rules to perform an operation
    • same one can be used by different programs
  • program
    • code that implements an algorithm
    • can use stored data, but not its aim (e.g. PSI BLAST)
  • database
    • organised searchable source of biological data
    • aim is to store data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

protein evolution

A
  • duplication can lead to divergent evolution
    • homologous proteins with related sequences and structures
    • often but not always related function
  • analyse by alignment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

alignment features

A
  • identity (:)
  • gap
    • insertion or deletion
  • substitution
    • can be conservative (.)
    • same characteristics e.g. hydrophobicity
  • end gap
    • one sequence longer than the other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

paralogs

A
  • homolog created by gene duplication within a species
  • can result in change of function
    • original copy can maintain function
    • second copy free to mutate and adopt novel function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

orthologs

A
  • homolog created by speciation
  • both species now have a single copy of the same gene
  • only one copy per species
    • less likely to change
    • function needs to be retained
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

requirements of a pairwise protein sequence alignment

A
  • scoring scheme of residue similarity
  • algorithm to establish the alignment
  • aim to combine algorithm and scoring scheme to generate the best alignment in biological terms
  • potential to be extended to database searching
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

scoring schemes

A
  • simplest would be 1 for identity and 0 for different
  • better to include similarity of residues
    • conservative subsitutions indicate more recent changes
    • residues tend to retain chemical properties so that function is modified, not destroyed
  • gaps also indicate increased distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

BLOSUM

A
  • blocks substitution matrix
  • aligned segments of protein families (blocks)
  • blosum62:
    • clustered sequences in blocks where pairwise identity >62%
    • most widely used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

blosum62

A
  • substitution matrix
    • score for changing one residue to another
    • represents chemical similarity
  • e.g. cys - disulfide formation means high conservation
    • presence in both sequences indicate similarity
    • high score (9)
  • low negative score if properties change
    • e.g. hydrophobic to charged
  • empirical
  • gaps considered later
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

affine gap penalty

A
  • penalise insertions/deletions
  • penalty = o + el
    • o = gap opening constant
    • e = gap extension constant
    • l = length of gap extension
  • o>e
    • gap introduction is the major event
    • extending the gap is minor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

protein domains

A
  • protein seqeucnes formed of domains
  • each domain originates from a different homologous family
  • domains are the evolutionary unit
  • methods need to take this into account
    • don’t have to align whole sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

local vs global alignment methods

A
  • different algorithms
  • part or all of a query can match part or all of a database sequence
  • gaps may be needed to get a suitable alignment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

dotplot

A
  • used to assign identities
  • one sequence on each axis
  • assign dot where they match along the diagonal
  • best path has the highest number of dots
  • need closely related sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

needleman wunsch algorithm

A
  • dynamic programming
  • maximises similarity score to give maximum match
    • largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions
  • finds best global alignment
  • iterative matrix method
    • 2D array of all possible pairs of residues (bases or amino acids)
    • one sequence on each axis
    • all possible alignments represnted by paths through the array
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

NW similarity values

A
  • Sij = numerical value assigned to every cell in the array
  • depends on similarity of the 2 residues
  • value of 1 indicates identity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

NW alignment construction

A
  • add Sij from left to right along a path through the array to get cumulative alignment score
  • best alignment has highest score = maximum match
    • maximum match always in outer row/column
    • work backwards from here to construct alignment
    • gaps can be inserted
17
Q

smith waterman algorithm

A
  • compares segments of all possible lengths instead of looking at each sequence as a whole
    • local alignment
  • choose whichever maximises similarity measure
  • allow for gaps
  • variation on NW
    • dynamic programming