Querying Genomes

Getting a Genome

We use the species common name to get the genome instance we want.

>>> import os
>>> from ensembldb3 import HostAccount, Genome
>>> account = HostAccount(*os.environ['ENSEMBL_ACCOUNT'].split())
>>> human = Genome(species='human', release=85, account=account)
>>> print(human)
Genome(species='Homo sapiens'; release='85')

Note

The positions employed on Ensembl’s web-site, and in their MySQL database differ from those used internally by ensembldb3 – Ensembl starts counting from 1, ensembldb3 starts counting from 0. In all cases where you are querying ensembldb3 objects directly inputting nucleotide positions you can indicate you are using Ensembl coordinates by setting ensembl_coord=True. If you are explicitly passing in a ensembldb3 region, that argument has no effect.

Getting a Gene

Note

A Gene is a type of region. All region types have .location and .seq attributes.

By biotype

We can search for all genes of a specific biotype. We will use the limit argument to reduce the amount of results to be processed.

>>> genes = human.get_genes_matching(biotype="protein_coding", limit=2)
>>> for gene in genes:
...     print(gene)
Gene(species='Homo sapiens'; biotype='protein_coding'; description='RING1 and YY1...'; stableid='ENSG00000281766'; status='KNOWN'; symbol='RYBP')
Gene(species='Homo sapiens'; biotype='protein_coding'; description='forkhead box O6...'; stableid='ENSG00000281518'; status='KNOWN'; symbol='FOXO6')

By gene symbol

We query for the BRCA2 gene for humans.

>>> genes = human.get_genes_matching(symbol='BRCA2')
>>> for gene in genes:
...     if gene.symbol == 'BRCA2':
...         print(gene)
Gene(species='Homo sapiens'; biotype='protein_coding'; description='BRCA2, DNA repair...'; stableid='ENSG00000139618'; status='KNOWN'; symbol='BRCA2')
Gene(species='Homo sapiens'; biotype='LRG_gene'; description='BRCA2, DNA repair...'; stableid='LRG_293'; status='KNOWN'; symbol='BRCA2')

By Ensembl Stable ID

We use the stable ID for BRCA2.

>>> gene = human.get_gene_by_stableid(stableid='ENSG00000139618')
>>> print(gene)
Gene(species='Homo sapiens'; biotype='protein_coding'; description='BRCA2, DNA repair...'; stableid='ENSG00000139618'; status='KNOWN'; symbol='BRCA2')

Matching a description

We look for breast cancer related genes that are estrogen induced.

>>> genes = human.get_genes_matching(description='breast cancer anti-estrogen')
>>> for gene in genes:
...     print(gene)
Gene(species='Homo sapiens'; biotype='protein_coding'; description='breast cancer anti-estrogen...'; stableid='ENSG00000137936';...

We can also require that an exact (case insensitive) match to the word(s) occurs within the description by setting like=False.

>>> genes = human.get_genes_matching(description='breast cancer anti-estrogen',
...                                  like=False)
>>> for gene in genes:
...     print(gene)
Gene(species='Homo sapiens'; biotype='protein_coding'; description='breast cancer anti-estrogen...'; stableid='ENSG00000137936'; status='KNOWN'; symbol='BCAR3')...

Gene attributes

Stable ID, Symbol, Biotype, etc..

>>> print(gene.stableid, gene.symbol, gene.biotype, gene.status)
ENSG00000262117 BCAR4 lincRNA KNOWN

Location

>>> gene.location
Coordinate(Human,chro...,16,11819828-11828845,-1)
>>> print(gene.location)
Homo sapiens:chromosome:16:11819828-11828845:-1
>>> print(gene.location.coord_name)
16
>>> print(gene.location.start)
11819828
>>> print(gene.location.strand)
-1

The gene sequence

This is an attribute of the gene.

>>> gene.seq
DnaSequence(GATTCTT... 9017)

Getting a Transcript

This is done via the Gene.

Canonical transcript

We get the canonical transcripts for BRCA2.

>>> brca2 = human.get_gene_by_stableid(stableid='ENSG00000139618')
>>> transcript = brca2.canonical_transcript
>>> print(transcript)
Transcript(species='Homo sapiens'; coord_name='13'; start=32315473; end=32400266; length=84793; strand='+')

Get the CDS for a transcript

>>> transcript = brca2.canonical_transcript
>>> cds = transcript.cds
>>> cds
DnaSequence(ATGCCTA... 10257)
>>> print(cds)
ATGCCTATTGGATCCAAAGAGAGGCCA...

Look at all transcripts for a gene

Done via the Gene.transcripts attribute

>>> for transcript in brca2.transcripts:
...     print(transcript)
Transcript(species='Homo sapiens'; coord_name='13'; start=32315473; end=32400266; length=84793; strand='+')
Transcript(species='Homo sapiens'; coord_name='13'; start=32315504; end=32333291; length=17787; strand='+')...

Transcript exons

We show just for the canonical transcript.

>>> print(brca2.canonical_transcript.exons[0])
Exon(stableid=ENSE00001184784, rank=1)

Get the introns for a transcript

We show just for the canonical transcript.

>>> for intron in brca2.canonical_transcript.introns:
...     print(intron)
Intron(TranscriptId=ENST00000380152, rank=1)
Intron(TranscriptId=ENST00000380152, rank=2)
Intron(TranscriptId=ENST00000380152, rank=3)...

Other region types

Getting a generic genomic region

Genomic regions can be obtained just using by coordinates. They can also be used to get features that lay within themselves.

>>> region = human.get_region(coord_name='1', start=1000000, end=1010000)
>>> print(region)
GenericRegion(species='Homo sapiens'; coord_name='1'; start=1000000; end=1010000; length=10000; strand='+')
>>> region.seq
DnaSequence(GTGGAGC... 10000)
>>> repeats = region.get_features('repeat', limit=2)
>>> for repeat in repeats:
...     print(repeat)
Repeat(coord_name='1'; start=1000268; end=1000292; length=24; strand='+', Score=24.0)
Repeat(coord_name='1'; start=1002182; end=1002202; length=20; strand='-', Score=0.0)

Get repeat elements in a genomic interval

We query the genome for repeats within a specific coordinate range on chromosome 13.

>>> repeats = human.get_features(coord_name='13', start=32305473, end=32315473, feature_types='repeat', limit=2)
>>> for repeat in repeats:
...     print(repeat.repeat_class)
...     print(repeat)
SINE/Alu
Repeat(coord_name='13'; start=32305225; end=32305525; length=300; strand='-', Score=2770.0)
SINE/Alu
Repeat(coord_name='13'; start=32305225; end=32305525; length=300; strand='-', Score=2770.0)

Get CpG island elements in a genomic interval

We query the genome for CpG islands within a specific coordinate range on chromosome 11.

>>> islands = human.get_features(coord_name='11', start=2129111, end=2149604, feature_types='cpg', limit=2)
>>> for island in islands:
...     print(island)
CpGisland(coord_name='11'; start=2137721; end=2141254; length=3533; strand='-', Score=3254.0)
CpGisland(coord_name='11'; start=2143905; end=2144442; length=537; strand='-', Score=652.0)