Biological sequences compression challenges:
1. Amino acid (Protein) sequence compression challenge:
This benchmark ranks open source compressors on the Amino acid (Proteins) sequences corpus by total compressed size. The corpus consists of 9 files. Four files are from the data-compression.info. The characteristics of the files are the following:
FileName | Species | Size | Cardinality |
---|---|---|---|
BT | Bos taurus | 12,845,466 | 24 |
EC | Escherichia coli | 1,308,765 | 21 |
EP | Enterococcus phage | 4,184 | 20 |
HI | Haemophilus influenzae | 509,519 | 20 |
HS | Homo sapiens | 3,295,751 | 19 |
LC | Lactobacillus casei | 809,301 | 20 |
MJ | Methanococcus jannaschii | 448,779 | 20 |
SA | Staphylococcus aureus | 796,785 | 20 |
SC | Saccharomyces cerevisiae | 2,900,352 | 20 |
Provide a open source compressor and decompressor (send it to: [pratas@ua.pt]), that is able to efficiently encode and decode the complete corpus. The compression of the files is individual.
Current benchmark for open source programs (bits peer amino acid):
Programs | HI | MJ | HS | SC | EC | EP | BT | LC | SA |
---|---|---|---|---|---|---|---|---|---|
Gzip | 4.671 | 4.587 | 4.605 | 4.639 | 4.679 | 4.686 | 4.521 | 4.655 | 4.646 |
Bzip2 | 4.324 | 4.269 | 4.255 | 4.299 | 4.324 | 4.485 | 4.254 | 4.302 | 4.300 |
7za | 4.293 | 4.206 | 4.028 | 4.130 | 4.267 | 4.592 | 3.212 | 4.273 | 4.258 |
lzma -9 | 4.238 | 4.141 | 4.021 | 4.029 | 4.229 | 4.422 | 3.208 | 4.188 | 4.197 |
paq8h -8 | 4.118 | 4.015 | 3.922 | 3.957 | 4.092 | 4.328 | 3.170 | 4.091 | 4.080 |
paq8l -8 | 4.104 | 3.999 | 3.901 | 3.942 | 4.077 | 4.300 | 3.144 | 4.076 | 4.061 |
AC | 4.100 | 3.997 | 3.785 | 3.876 | 4.037 | 4.323 | 3.055 | 4.055 | 4.056 |
2. DNA sequence compression challenge:
This benchmark ranks open source compressors on the DNA sequences corpus by total compressed size. The corpus consists of 17 files. The characteristics of the files are the following:
FileName | Species | Type | Size | Cardinality |
---|---|---|---|---|
OrSa | Oriza sativa | Eukaryota, plant (rice) | 43,262,523 | 4 |
HoSa | Homo sapiens | Eukaryota, animalia | 189,752,667 | 4 |
GaGa | Gallus gallus | Eukaryota, animalia (chicken) | 148,532,294 | 4 |
DaRe | Danio rerio | Eukaryota, animalia (fish) | 62,565,020 | 4 |
DrMe | Drosophila miranda | Eukaryota, animalia (insect) | 32,181,429 | 4 |
EnIn | Entamoeba invadens | Eukaryota, amoebozoa | 26,403,087 | 4 |
ScPo | Schizosaccharomyces pombe | Eukaryota, fungi | 10,652,155 | 4 |
PlFa | Plasmodium falciparum | Eukaryota, protozoan | 8,986,712 | 4 |
EsCo | Escherichia coli | Bacteria | 4,641,652 | 4 |
HePi | Helicobacter pylori | Bacteria | 1,667,825 | 4 |
AeCa | Aeropyrum camini | Archaea | 1,591,049 | 4 |
HaHi | Haloarcula hispanica | Archaea | 3,890,005 | 4 |
YeMi | Yellowstone lake mimivirus | Virus, mimivirus | 73,689 | 4 |
BuEb | Bundibugyo ebolavirus | Virus | 18,940 | 4 |
AgPh | Aggregatibacter phage S1249 | Virus, phage | 43,970 | 4 |
Provide a open source compressor and decompressor (send it to: [pratas@ua.pt]), that is able to efficiently encode and decode the complete corpus.