kisterae: March 2007

note070314


ls  sed -e 's/^\(.*\).fa/..\/hmmer\/binaries\/hmmpfam motif.hmm \1.fa > ..\/result\/\1.result/' sh

tar cf - whatever  ssh remotehost " ( cd /some/path ; tar xf - ) "
ssh -l ysc212 access3.cims.nyu.edu "( cd /tmp/mark ; tar cf - hmm ) "  tar xf -

tar cf - hmm  gzip  ssh -l ysc212 access3.cims.nyu.edu " ( cd /tmp/mark ; cat > hmm.tar.gz ) "



ls  sed -e 's/^\(.*\).fa/http:\/\/localhost:81\/sssdb\/tools\/motifseq10foldtest.php?motif=1A&fold=0\/

ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/wget -O \1.fatr0 " http:\/\/192.168.1.100:81\/sssdb\/tools\/motifseq10foldtrain.php?motif=\1\&amp;fold=0"/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/muscle\/muscle3.6\/muscle -in \1.fatr0 -out \1.fatr0a/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmbuild -g -n \1 -A motif.hmm \1.fatr0a/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam motif.hmm \1.fate0/'

ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam -E 300 motif.hmm \1.fate0  perl ..\/fasta\/accuracy.pl \1/'  sh  perl ../fasta/totalaccuracy.pl
32/71

yesrm *.fat*
rm motif.hmm
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/wget -O \1.fatr0 " http:\/\/192.168.1.100:81\/sssdb\/tools\/motifseq10foldtrain.php?motif=\1\&fold=0"/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/muscle\/muscle3.6\/muscle -in \1.fatr0 -out \1.fatr0a/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmbuild -g -n \1 -A motif.hmm \1.fatr0a/'
ls ../fa/*.fa  sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam -E 300 motif.hmm \1.fate0  perl ..\/fasta\/accuracy.pl \1/'  sh  perl ../fasta/totalaccuracy.pl


for ($cx=0;$cx<10;$cx++){
  `yesrm *.fat*`;
  `rm motif.hmm`;
  `ls ../fa/*.fa  sed -e 's/^..\\/fa\\/\\(.*\\).fa/wget -q -O \\1.fatr$cx "http:\\/\\/192.168.1.100:81\\/sssdb\\/tools\\/motifseq10foldtrain.php?motif=\\1\\&fold=$cx"/'sh`;
  `ls ../fa/*.fa  sed -e 's/^..\\/fa\\/\\(.*\\).fa/wget -q -O \\1.fate$cx "http:\\/\\/192.168.1.100:81\\/sssdb\\/tools\\/motifseq10foldtest.php?motif=\\1\\&fold=$cx"/'sh`;
  `ls ../fa/*.fa  sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/muscle\\/muscle3.6\\/muscle -quiet -in \\1.fatr$cx -out \\1.fatra$cx/'ish`;
  `ls ../fa/*.fa  sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/hmmer\\/binaries\\/hmmbuild -g -n \\1 -A motif.hmm \\1.fatra$cx/'sh`;
  print `ls ../fa/*.fa  sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/hmmer\\/binaries\\/hmmpfam -E 300 motif.hmm \\1.fate$cx  perl ..\\/fasta\\/accuracy.pl \\1/'  sh  perl ../fasta/totalaccuracy.pl`;
 
}

10fold
354/723

suggest 1B=1A
database wrong g6rlx.2
d1dyza_ suggest we should also train alpha helixes.
suggest d1nbqa2 should be 1F instead of 10C
this two suggest how HMM works: d1dyza_ & d1smaa1 & d1ea9c1
d1bf5a2 suggest we should see 7&8 as one (by seeing the sequence only, it knows 5&8 h-bonds together?)
three sheets like: d1ikna2(7F), d1nfia2(4F)
it can also suggest a way of transformation.
although HMM correctly identify d1tfpa_ is 38A, it also suggest d1tfpa_ is closed to 7J, as I comment in our database
Model    Description                                    Score    E-value  N
-------- -----------                                    -----    ------- ---
38A.fa                                                  357.4   4.8e-106   1
7J.fa                                                   276.8      9e-82   1


detect 3 sheets, suggest g1pvc.1 is 6E, even g1pvc.1, d1du5a_ belongs to different fold:
NC: g1pvc.1: 6E.fa  346.5   9.2e-103        1
6E: d1du5a_, d1kwna_

notes 070313

wget -O 1B.fasta "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B"
../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm 1B.fasta | perl accuracy.pl 1B

wget -O- -q "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B" | ../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm - | perl accuracy.pl 1B

ls ../fa/*.fa | sed -e 's/.*\/\(.*\).fa/wget -O- -q "http:\/\/sssdb.no-ip.info:81\/sssdb\/tools\/motifseqs.php?motif=\1" \| ..\/hmmer\/binaries\/hmmpfam -E 300 ..\/fa\/motif.hmm - \| perl accuracy.pl \1/'

people catagorize proteins by sequences into families. (HHpred use SCOP, PDB, SMART)
that's why only we can do this.

because of low sequence similarity, blast can't find sequences in same motif, but HMM can.
testing using motif 3D, which has 12 sequences from 3 superfamilies,
sequence similarity between sequences in different superfamilies are lower then 4%,
we train the model using two superfamilies, the model can correctly identify the sequence from third superfamily.
it means,
1) we can correctly identify sequences with similar SSS but low sequence identity.
2) this method can find not only similar sequences but similar structures.

for 1A, training set, 37 sequences from 1 family 6 domain.
testing set, 1092 sequences, we got 1071 correct, which is 98% accuracy.
notes: we lower the threadhole to 300

error detection
1bih A:5-98 (3D) represent 1bih B:307-395
but HMM tells me 1bih B:307-395 is 3A
so I check the structure, it does belongs to 3A.
because 1bih B:307-395 has a longer loop and consider two strands as one, it becomes 3A,
and because scop catagorize proteins not by SSS as we do, we have different catagorization.
thus, we can only use 703 structures to test our method, but not all of the sequences.

1 8 7 3 4
0' 2 6 5

1 3 7 6
2 9 8 4 5

even when it can't find a correct motif, it will suggest several motifs which contain subsets of strands of the real motif.
then we can use alignment and idea of homology modeling to construct the correct motif.

d1orsa1: strand 7 should be 7&7'

found by using HMM,

d12e8l1 should be 1A, but HMM says 1B, after check, it's because 1B should be 1A.

HMM is correct, we are wrong.

note 070312

NBTree:86%/77%
J48:95%/79%
ADTree:74%/
VotedPerceptron:67%/
bayesnet:70%/
VFI:68%/
OneR:66%/
DecisionTable:79%/
IB1:/78%
KStar:99.9%/81.45
LWL:69%/
IBk:/79%

Structure prediction server (give you MultipleAlignment, SSE, R-R Contact, 3D)
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
Warnning, take at least 2 hours.

emboss in umdnj
http://siriusc.umdnj.edu/emboss/

Muscle: Good Alignment Server (better then ClustalW)
JalView: Good Alignment Editer ()

5 4 8 9 2
6 7 3 1

E,J,J,NJ,E,NJ,J,NE
1 3 7 6
2 9 8 4 5

adtree => excel
sum>2.5

1F 27

TP=14=14/27=52%
FP=13=13/27
TN=653=653/693
FN=13=13/693

correct=14+653=667/693=96%

TP=14/(27-8sao)=74%
: -0.142
(1)ENE = NE: 0.12
(2)looplen <>= 2.5: -0.132
(3)nENE = NE: -0.124
(3)nENE = E: 0.767
(1)ENE = E: -0.732
(4)str2len <>= 3.5: -0.104
Legend: -ve = JP, +ve = NJP

kisterae

Wednesday, March 14, 2007

note070314

Tuesday, March 13, 2007

notes 070313

Monday, March 12, 2007

note 070312