wget -O 1B.fasta "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B"
../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm 1B.fasta | perl accuracy.pl 1B
wget -O- -q "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B" | ../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm - | perl accuracy.pl 1B
ls ../fa/*.fa | sed -e 's/.*\/\(.*\).fa/wget -O- -q "http:\/\/sssdb.no-ip.info:81\/sssdb\/tools\/motifseqs.php?motif=\1" \| ..\/hmmer\/binaries\/hmmpfam -E 300 ..\/fa\/motif.hmm - \| perl accuracy.pl \1/'
people catagorize proteins by sequences into families. (HHpred use SCOP, PDB, SMART)
that's why only we can do this.
because of low sequence similarity, blast can't find sequences in same motif, but HMM can.
testing using motif 3D, which has 12 sequences from 3 superfamilies,
sequence similarity between sequences in different superfamilies are lower then 4%,
we train the model using two superfamilies, the model can correctly identify the sequence from third superfamily.
it means,
1) we can correctly identify sequences with similar SSS but low sequence identity.
2) this method can find not only similar sequences but similar structures.
for 1A, training set, 37 sequences from 1 family 6 domain.
testing set, 1092 sequences, we got 1071 correct, which is 98% accuracy.
notes: we lower the threadhole to 300
error detection
1bih A:5-98 (3D) represent 1bih B:307-395
but HMM tells me 1bih B:307-395 is 3A
so I check the structure, it does belongs to 3A.
because 1bih B:307-395 has a longer loop and consider two strands as one, it becomes 3A,
and because scop catagorize proteins not by SSS as we do, we have different catagorization.
thus, we can only use 703 structures to test our method, but not all of the sequences.
1 8 7 3 4
0' 2 6 5
1 3 7 6
2 9 8 4 5
even when it can't find a correct motif, it will suggest several motifs which contain subsets of strands of the real motif.
then we can use alignment and idea of homology modeling to construct the correct motif.