kisterae

My Photo
Name:
Location: New York, New York, United States

我叫江奕賢啦

Thursday, July 23, 2009

emboss on adcluster issue

Issue: When running showdb and preg (emboss program) on adcluster, run directly it runs fine, but when using qsub, it failed.

Reason: when running using qsub, it runs on another machine, which doesn't have databases directory linked to proper location on the file system on that machine.
/opt/Bio/EMBOSS/share/EMBOSS/data/databases link to /ifs/data/c2b2/af_lab/shares/emboss

Solution:
in users home directory, put a .embossrc file contains following information:

DB swissprot [
type: P
comment: "SWISSPROT protein sequences"
method: gcg
format: swiss
dbalias: sw
dir: /ifs/data/c2b2/af_lab/shares/emboss/gcgswissprot
file: swissprot*
]

This will point the swissprot database to proper location.
Thus showdb will show that database and preg will work on correct database.

Labels: , ,

Sunday, April 29, 2007

some analysis about barrals with sandwich's HMM

g1rin.1: 13B.fa 121.5 5.1e-35 1
looks more like a sandwich to me then barral.

g1ghw.1: 1F.fa 121.2 6.2e-35 1
looks like barral, but has interlock.

d1eu3a1: 20B.fa -46.2 1.8e+02 1
a barral looks like sandwich.
but I agree it's a barral, since I didn't saw a sandwich in this angle.

d3vub__: 20B.fa -49.6 1.8e+02 1
two barral share one wall, like "8" shape

d1jb3a_: 20B.fa -42.4 1.8e+02 1
barral
1 2 3 5 4

4 1 2'
5 3 2

d3chbd_: 20B.fa -44.5 1.8e+02 1
motif 20B only has one structure which doesn't looks like a regular sandwich.
d3chbd_ looks like a distorted barral, which looks like a distorted sandwich also.

d1jzua_: 20B.fa -36.6 1.8e+02 1
jelly roll, probably because of the strands length, it almost form a sandwich like structure.

3V
a motif doesn't have interlock.

d1qb5d_: 3V.fa -44.2 1.8e+02 1
there's one sheet which looks like sandwich's sheet.

d1g7sa1: 3V.fa -45.3 1.8e+02 1
probably because the additional sheets in barral

some of these structures have alpha helix, so our HMM probably can't recognize it perfectly.

d1ewna_: 2G.fa -22.0 1.8e+02 1
have lots of alpha helix

d1ja1a1: 2G.fa -10.2 1.8e+02 1
have lots of alpha helix in this domain

d1f20a1: 2G.fa -13.9 1.8e+02 1
have lots of alpha helix in this domain

d1i4ua_: 2G.fa -29.0 1.8e+02 1

the training set we train HMM has already been modified before grouping, but the sequences we train it doesn't.

Friday, April 27, 2007

predict barral structures using sandwich motif HMMs

d1ep3b1: 2G.fa -40.7 1.8e+02 1
d1et9a1: 2G.fa -35.3 1.8e+02 1
d1eu3a1: 20B.fa -46.2 1.8e+02 1
d1bw3__: 2G.fa -39.8 1.8e+02 1
d1ne8a_: 2G.fa -33.6 1.8e+02 1
d1nnxa_: 2G.fa -41.6 1.8e+02 1
d1iiua_: 2G.fa -34.5 1.8e+02 1
d1kt7a_: 2G.fa -36.1 1.8e+02 1
d1rbp__: 2G.fa -36.6 1.8e+02 1
d1l6ma_: 2G.fa -30.8 1.8e+02 1
d1ewna_: 2G.fa -22.0 1.8e+02 1
d1nrga_: 2G.fa -40.3 1.8e+02 1
d1ow1a_: 2G.fa -39.1 1.8e+02 1
d1uapa_: 2G.fa -42.5 1.8e+02 1
d1ueab_: 2G.fa -33.1 1.8e+02 1
d3vub__: 20B.fa -49.6 1.8e+02 1
d1m1fa_: 2G.fa -46.8 1.8e+02 1
d1ub4a_: 2G.fa -38.5 1.8e+02 1
d1qoia_: 2G.fa -37.9 1.8e+02 1
d1a33__: 2G.fa -38.9 1.8e+02 1
d1qnga_: 2G.fa -42.3 1.8e+02 1
d1ista_: 2G.fa -29.6 1.8e+02 1
d1ihga2: 2G.fa -37.3 1.8e+02 1
d2cpl__: 2G.fa -37.1 1.8e+02 1
d1dywa_: 2G.fa -33.8 1.8e+02 1
d1h0pa_: 2G.fa -35.2 1.8e+02 1
d2rmca_: 2G.fa -31.4 1.8e+02 1
d1cyna_: 2G.fa -37.8 1.8e+02 1
d1liua1: 2G.fa -46.1 1.8e+02 1
d1sr3a_: 2G.fa -37.9 1.8e+02 1
d1jb3a_: 20B.fa -42.4 1.8e+02 1
d1xe1a_: 24A.fa -42.4 1.8e+02 1
g1rin.1: 13B.fa 121.5 5.1e-35 1
d1k0ha_: 2G.fa -51.4 1.8e+02 1
g1ghw.1: 1F.fa 121.2 6.2e-35 1
d3chbd_: 20B.fa -44.5 1.8e+02 1
d1ssxa_: 2G.fa -38.9 1.8e+02 1
d1rz0a_: 2G.fa -29.8 1.8e+02 1
d1i8da1: 2G.fa -45.6 1.8e+02 1
d1ogia1: 2G.fa -33.5 1.8e+02 1
d1jb9a1: 2G.fa -37.7 1.8e+02 1
d1gawa1: 2G.fa -37.7 1.8e+02 1
d1fnc_1: 2G.fa -38.6 1.8e+02 1
d1qfza1: 2G.fa -31.1 1.8e+02 1
d1sm4a1: 2G.fa -32.0 1.8e+02 1
d1qb5d_: 3V.fa -44.2 1.8e+02 1
d1ja1a1: 2G.fa -10.2 1.8e+02 1
d1f20a1: 2G.fa -13.9 1.8e+02 1
d1mabb2: 2G.fa -47.3 1.8e+02 1
d1w0jd2: 2G.fa -45.2 1.8e+02 1
d1kmha2: 2G.fa -48.2 1.8e+02 1
d1skyb2: 2G.fa -51.1 1.8e+02 1
d1w0ja2: 2G.fa -52.3 1.8e+02 1
d1skye2: 2G.fa -38.8 1.8e+02 1
d1kmhb2: 2G.fa -47.0 1.8e+02 1
d1o65a_: 2G.fa -38.1 1.8e+02 1
d1i0ra_: 2G.fa -43.2 1.8e+02 1
d1eova1: 2G.fa -34.8 1.8e+02 1
d1n0ua1: 2G.fa -44.0 1.8e+02 1
d1ci0a_: 3V.fa -36.7 1.8e+02 1
d1f60a2: 2G.fa -45.8 1.8e+02 1
d1kzla1: 2G.fa -34.1 1.8e+02 1
d1n08a_: 2G.fa -47.2 1.8e+02 1
d1nb0a_: 2G.fa -38.3 1.8e+02 1
d1epaa_: 2G.fa -36.7 1.8e+02 1
d1jzua_: 20B.fa -36.6 1.8e+02 1
d1a1x__: 20B.fa -51.2 1.8e+02 1
d1jnpa_: 2G.fa -44.9 1.8e+02 1
d1i4ua_: 2G.fa -29.0 1.8e+02 1
d1kxoa_: 2G.fa -33.8 1.8e+02 1
d1nf3c_: 2G.fa -39.8 1.8e+02 1
d1qz8a_: 2G.fa -44.5 1.8e+02 1
d1r2ma_: 2G.fa -40.1 1.8e+02 1
d1g7sa1: 3V.fa -45.3 1.8e+02 1
d1qfja1: 2G.fa -47.3 1.8e+02 1
d1kjwa1: 2G.fa -44.8 1.8e+02 1
d1jj2b_: 2G.fa -22.2 1.8e+02 1
d1qvca_: 2G.fa -35.2 1.8e+02 1
d1qx4a1: 2G.fa -45.3 1.8e+02 1
d2cnd_1: 2G.fa -37.1 1.8e+02 1
d1fx7a3: 2G.fa -46.3 1.8e+02 1
d1g3wa3: 2G.fa -43.7 1.8e+02 1
d1vl7a_: 2G.fa -42.2 1.8e+02 1
d1prtf_: 2G.fa -46.1 1.8e+02 1
d1krha1: 2G.fa -44.7 1.8e+02 1
d1a8p_1: 2G.fa -35.0 1.8e+02 1
d2qila1: 2G.fa -44.2 1.8e+02 1
d1pc0a_: 2G.fa -44.6 1.8e+02 1
d1skqa1: 2G.fa -47.9 1.8e+02 1
d1skqa2: 2G.fa -34.5 1.8e+02 1
d1d1na_: 2G.fa -47.7 1.8e+02 1
d1qnua_: 2G.fa -44.2 1.8e+02 1
d1cqxa2: 2G.fa -36.1 1.8e+02 1
d1udza_: 2G.fa -35.7 1.8e+02 1
d1c0aa1: 2G.fa -39.1 1.8e+02 1
d1n9wa1: 2G.fa -43.7 1.8e+02 1
d1l0wa1: 2G.fa -46.1 1.8e+02 1
d1kk1a1: 2G.fa -34.9 1.8e+02 1
d1kk1a2: 2G.fa -37.1 1.8e+02 1
d1d2ea2: 2G.fa -45.4 1.8e+02 1
d1efca2: 2G.fa -43.2 1.8e+02 1
d1eft_2: 2G.fa -37.1 1.8e+02 1
d1exma2: 2G.fa -39.7 1.8e+02 1
d1bbua1: 2G.fa -37.9 1.8e+02 1
d1e1oa1: 2G.fa -33.3 1.8e+02 1
d1fdr_1: 2G.fa -33.6 1.8e+02 1
d1fmta1: 2G.fa -41.2 1.8e+02 1
d1gvha2: 2G.fa -36.3 1.8e+02 1
d1v9ta_: 2G.fa -37.5 1.8e+02 1
d1dar_1: 2G.fa -46.7 1.8e+02 1
d1ixra2: 2G.fa -48.4 1.8e+02 1
d1usca_: 2G.fa -27.0 1.8e+02 1
d1dnla_: 2G.fa -40.3 1.8e+02 1
d1d2ea1: 2G.fa -40.9 1.8e+02 1
d1efca1: 2G.fa -37.6 1.8e+02 1
d1eft_1: 2G.fa -39.7 1.8e+02 1
d1exma1: 2G.fa -39.7 1.8e+02 1
d1cuk_3: 2G.fa -46.8 1.8e+02 1
d1ddga1: 2G.fa -29.3 1.8e+02 1
/119

next step: check why barrals similar to sandwich's motif 2G.
simple findings:
2G*107
20B*5
24A*1
13B!
1F!
3V*3

Sunday, April 01, 2007

other paper predict EP/NEP

This paper they try to predict edge and non-edge
ß Edge strands in protein structure prediction and aggregation
Their prediction accuracy is 78%
The attributes they use, are (1) bulge score, (2) charge score, (3) hydrophobicity, (4) hydrophobic moment, (5) pattern of polarity, and (6) strand length.

Wednesday, March 14, 2007

note070314


ls sed -e 's/^\(.*\).fa/..\/hmmer\/binaries\/hmmpfam motif.hmm \1.fa > ..\/result\/\1.result/' sh

tar cf - whatever ssh remotehost " ( cd /some/path ; tar xf - ) "
ssh -l ysc212 access3.cims.nyu.edu "( cd /tmp/mark ; tar cf - hmm ) " tar xf -

tar cf - hmm gzip ssh -l ysc212 access3.cims.nyu.edu " ( cd /tmp/mark ; cat > hmm.tar.gz ) "



ls sed -e 's/^\(.*\).fa/http:\/\/localhost:81\/sssdb\/tools\/motifseq10foldtest.php?motif=1A&fold=0\/

ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/wget -O \1.fatr0 " http:\/\/192.168.1.100:81\/sssdb\/tools\/motifseq10foldtrain.php?motif=\1\&fold=0"/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/muscle\/muscle3.6\/muscle -in \1.fatr0 -out \1.fatr0a/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmbuild -g -n \1 -A motif.hmm \1.fatr0a/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam motif.hmm \1.fate0/'

ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam -E 300 motif.hmm \1.fate0 perl ..\/fasta\/accuracy.pl \1/' sh perl ../fasta/totalaccuracy.pl
32/71

yesrm *.fat*
rm motif.hmm
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/wget -O \1.fatr0 " http:\/\/192.168.1.100:81\/sssdb\/tools\/motifseq10foldtrain.php?motif=\1\&fold=0"/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/muscle\/muscle3.6\/muscle -in \1.fatr0 -out \1.fatr0a/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmbuild -g -n \1 -A motif.hmm \1.fatr0a/'
ls ../fa/*.fa sed -e 's/^..\/fa\/\(.*\).fa/..\/hmmer\/binaries\/hmmpfam -E 300 motif.hmm \1.fate0 perl ..\/fasta\/accuracy.pl \1/' sh perl ../fasta/totalaccuracy.pl


for ($cx=0;$cx<10;$cx++){
`yesrm *.fat*`;
`rm motif.hmm`;
`ls ../fa/*.fa sed -e 's/^..\\/fa\\/\\(.*\\).fa/wget -q -O \\1.fatr$cx "http:\\/\\/192.168.1.100:81\\/sssdb\\/tools\\/motifseq10foldtrain.php?motif=\\1\\&fold=$cx"/'sh`;
`ls ../fa/*.fa sed -e 's/^..\\/fa\\/\\(.*\\).fa/wget -q -O \\1.fate$cx "http:\\/\\/192.168.1.100:81\\/sssdb\\/tools\\/motifseq10foldtest.php?motif=\\1\\&fold=$cx"/'sh`;
`ls ../fa/*.fa sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/muscle\\/muscle3.6\\/muscle -quiet -in \\1.fatr$cx -out \\1.fatra$cx/'ish`;
`ls ../fa/*.fa sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/hmmer\\/binaries\\/hmmbuild -g -n \\1 -A motif.hmm \\1.fatra$cx/'sh`;
print `ls ../fa/*.fa sed -e 's/^..\\/fa\\/\\(.*\\).fa/..\\/hmmer\\/binaries\\/hmmpfam -E 300 motif.hmm \\1.fate$cx perl ..\\/fasta\\/accuracy.pl \\1/' sh perl ../fasta/totalaccuracy.pl`;

}

10fold
354/723

suggest 1B=1A
database wrong g6rlx.2
d1dyza_ suggest we should also train alpha helixes.
suggest d1nbqa2 should be 1F instead of 10C
this two suggest how HMM works: d1dyza_ & d1smaa1 & d1ea9c1
d1bf5a2 suggest we should see 7&8 as one (by seeing the sequence only, it knows 5&8 h-bonds together?)
three sheets like: d1ikna2(7F), d1nfia2(4F)
it can also suggest a way of transformation.
although HMM correctly identify d1tfpa_ is 38A, it also suggest d1tfpa_ is closed to 7J, as I comment in our database
Model Description Score E-value N
-------- ----------- ----- ------- ---
38A.fa 357.4 4.8e-106 1
7J.fa 276.8 9e-82 1


detect 3 sheets, suggest g1pvc.1 is 6E, even g1pvc.1, d1du5a_ belongs to different fold:
NC: g1pvc.1: 6E.fa 346.5 9.2e-103 1
6E: d1du5a_, d1kwna_

Tuesday, March 13, 2007

notes 070313

wget -O 1B.fasta "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B"
../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm 1B.fasta | perl accuracy.pl 1B

wget -O- -q "http://sssdb.no-ip.info:81/sssdb/tools/motifseqs.php?motif=1B" | ../hmmer/binaries/hmmpfam -E 300 ../fa/motif.hmm - | perl accuracy.pl 1B

ls ../fa/*.fa | sed -e 's/.*\/\(.*\).fa/wget -O- -q "http:\/\/sssdb.no-ip.info:81\/sssdb\/tools\/motifseqs.php?motif=\1" \| ..\/hmmer\/binaries\/hmmpfam -E 300 ..\/fa\/motif.hmm - \| perl accuracy.pl \1/'

people catagorize proteins by sequences into families. (HHpred use SCOP, PDB, SMART)
that's why only we can do this.

because of low sequence similarity, blast can't find sequences in same motif, but HMM can.
testing using motif 3D, which has 12 sequences from 3 superfamilies,
sequence similarity between sequences in different superfamilies are lower then 4%,
we train the model using two superfamilies, the model can correctly identify the sequence from third superfamily.
it means,
1) we can correctly identify sequences with similar SSS but low sequence identity.
2) this method can find not only similar sequences but similar structures.

for 1A, training set, 37 sequences from 1 family 6 domain.
testing set, 1092 sequences, we got 1071 correct, which is 98% accuracy.
notes: we lower the threadhole to 300

error detection
1bih A:5-98 (3D) represent 1bih B:307-395
but HMM tells me 1bih B:307-395 is 3A
so I check the structure, it does belongs to 3A.
because 1bih B:307-395 has a longer loop and consider two strands as one, it becomes 3A,
and because scop catagorize proteins not by SSS as we do, we have different catagorization.
thus, we can only use 703 structures to test our method, but not all of the sequences.

1 8 7 3 4
0' 2 6 5

1 3 7 6
2 9 8 4 5

even when it can't find a correct motif, it will suggest several motifs which contain subsets of strands of the real motif.
then we can use alignment and idea of homology modeling to construct the correct motif.

d1orsa1: strand 7 should be 7&7'

found by using HMM,

d12e8l1 should be 1A, but HMM says 1B, after check, it's because 1B should be 1A.

HMM is correct, we are wrong.

Monday, March 12, 2007

note 070312

NBTree:86%/77%
J48:95%/79%
ADTree:74%/
VotedPerceptron:67%/
bayesnet:70%/
VFI:68%/
OneR:66%/
DecisionTable:79%/
IB1:/78%
KStar:99.9%/81.45
LWL:69%/
IBk:/79%

Structure prediction server (give you MultipleAlignment, SSE, R-R Contact, 3D)
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
Warnning, take at least 2 hours.

emboss in umdnj
http://siriusc.umdnj.edu/emboss/

Muscle: Good Alignment Server (better then ClustalW)
JalView: Good Alignment Editer ()


5 4 8 9 2
6 7 3 1

E,J,J,NJ,E,NJ,J,NE
1 3 7 6
2 9 8 4 5

adtree => excel
sum>2.5

1F 27

TP=14=14/27=52%
FP=13=13/27
TN=653=653/693
FN=13=13/693

correct=14+653=667/693=96%

TP=14/(27-8sao)=74%
: -0.142
(1)ENE = NE: 0.12
(2)looplen <>= 2.5: -0.132
(3)nENE = NE: -0.124
(3)nENE = E: 0.767
(1)ENE = E: -0.732
(4)str2len <>= 3.5: -0.104
Legend: -ve = JP, +ve = NJP