Biological sequences compression challenges:


1. Amino acid (Protein) sequence compression challenge:

This benchmark ranks open source compressors on the Amino acid (Proteins) sequences corpus by total compressed size. The corpus consists of 9 files. Four files are from the data-compression.info. The characteristics of the files are the following:

FileName Species Size Cardinality
BT Bos taurus 12,845,466 24
EC Escherichia coli 1,308,765 21
EP Enterococcus phage 4,184 20
HI Haemophilus influenzae 509,519 20
HS Homo sapiens 3,295,751 19
LC Lactobacillus casei 809,301 20
MJ Methanococcus jannaschii 448,779 20
SA Staphylococcus aureus 796,785 20
SC Saccharomyces cerevisiae 2,900,352 20

Provide a open source compressor and decompressor (send it to: [pratas@ua.pt]), that is able to efficiently encode and decode the complete corpus. The compression of the files is individual.
Current benchmark for open source programs (bits peer amino acid):

Programs HI MJ HS SC EC EP BT LC SA
Gzip 4.671 4.587 4.605 4.639 4.679 4.686 4.521 4.655 4.646
Bzip2 4.324 4.269 4.255 4.299 4.324 4.485 4.254 4.302 4.300
7za 4.293 4.206 4.028 4.130 4.267 4.592 3.212 4.273 4.258
lzma -9 4.238 4.141 4.021 4.029 4.229 4.422 3.208 4.188 4.197
paq8h -8 4.118 4.015 3.922 3.957 4.092 4.328 3.170 4.091 4.080
paq8l -8 4.104 3.999 3.901 3.942 4.077 4.300 3.144 4.076 4.061
AC 4.100 3.997 3.785 3.876 4.037 4.323 3.055 4.055 4.056



2. DNA sequence compression challenge:

This benchmark ranks open source compressors on the DNA sequences corpus by total compressed size. The corpus consists of 17 files. The characteristics of the files are the following:

FileName Species Type Size Cardinality
OrSa Oriza sativa Eukaryota, plant (rice) 43,262,523 4
HoSa Homo sapiens Eukaryota, animalia 189,752,667 4
GaGa Gallus gallus Eukaryota, animalia (chicken) 148,532,294 4
DaRe Danio rerio Eukaryota, animalia (fish) 62,565,020 4
DrMe Drosophila miranda Eukaryota, animalia (insect) 32,181,429 4
EnIn Entamoeba invadens Eukaryota, amoebozoa 26,403,087 4
ScPo Schizosaccharomyces pombe Eukaryota, fungi 10,652,155 4
PlFa Plasmodium falciparum Eukaryota, protozoan 8,986,712 4
EsCo Escherichia coli Bacteria 4,641,652 4
HePi Helicobacter pylori Bacteria 1,667,825 4
AeCa Aeropyrum camini Archaea 1,591,049 4
HaHi Haloarcula hispanica Archaea 3,890,005 4
YeMi Yellowstone lake mimivirus Virus, mimivirus 73,689 4
BuEb Bundibugyo ebolavirus Virus 18,940 4
AgPh Aggregatibacter phage S1249 Virus, phage 43,970 4

Provide a open source compressor and decompressor (send it to: [pratas@ua.pt]), that is able to efficiently encode and decode the complete corpus.