Page tree
Skip to end of metadata
Go to start of metadata

A list of file formats commonly used in molecular biology:

NameFile EndingDescription
AB1  In DNA sequencing, chromatogram files used by instruments from Applied Biosystems
ACE  A sequence assembly format
ASN.1 Abstract Syntax Notation One, is an International Standards Organization (ISO) data representation format used to achieve interoperability between platforms.NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and PubMed records.
BAM  Binary compressed SAM format
BCF  Binary compressed VCF format
BED  The browser extensible display format is used for describing genes and other features of DNA sequences
CAF  Common Assembly Format for sequence assembly
EMBL  The flatfile format used by the EMBL to represent database records for nucleotide and peptide sequences from EMBL databases
FASTA *.FASTAThe FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).
FASTQ *.FASTQThe FASTQ file format, for sequence data with quality. Sometimes also given as QUAL.
GCPROJ  The Genome Compiler project. Advanced file format for genetic data to be designed, shared and visualized.
GenBank *.gbThe flatfile format used by the NCBI to represent database records for nucleotide and peptide sequences from the GenBank and RefSeq databases
GFF  The General feature format is used for describing genes and other features of DNA, RNA and protein sequences
GTF  The Gene transfer format is used to hold information about gene structure.
NCBI ASN.1 Structured ASN.1 format used at National Center for Biotechnology Information for DNA and protein data
NEXUS  The Nexus file encodes mixed information about genetic sequence data in a block structured format.
NeXML XML format for phylogenetic trees
NWK  The Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas and useful to hold phylogenetic trees.
PDB  Structures of biomolecules deposited in Protein Data Bank. Also used for exchanging protein/nucleic acid structures.
PHD  Phred output, from the basecalling software Phred
SAM  Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
SBML  The Systems Biology Markup Language is used to store biochemical network computational models
SCF  Staden chromatogram files used to store data from DNA sequencing
SFF  Standard Flowgram Format
SRA Format used by the National Center for Biotechnology Information Short Read Archive to store high-throughput DNA sequence data
Stockholm  The Stockholm format for representing multiple sequence alignments
Swiss-Prot  The flatfile format used to represent database records for protein sequences from the Swiss-Prot database
VCF  Variant Call Format, a standard created by the 1000 Genomes Project that lists and annotates the entire collection of human variants (with the exception of approximately 1.6 million variants).

© Text 2015 Wikipedia

What do you think?

About the author

View full profile Jérôme Lutz from Berlin & Munich, Germany

I like to share the great things I discover daily while researching and working in the field of Synthetic Biology.

When I talk to people about it, they often refer to Science Fiction. However, when I send them links to this wiki and they read through those pages, they start understanding that this is real and it's happening right now.