Getting started

WGAbed is a package that converts of a whole genome alignment file in MAF format to a reference ordered BED format containing all the information in the MAF file but in a more accessible way. The package contains a main script, maf_to_bed.py, which takes care of the file conversion, along with a number of utility scripts for manipulating the resulting .wga.bed file.

Generating a WGAbed file

You can generate a wga.bed file from a .maf file as follows:

./maf_to_bed.py -i data/test.maf.gz -r Greattit -c chr8 | sort -k1,1 -k2,2n | bgzip -c > data/test.wga.bed.gz

Note, you need to specifiy a reference species from those contained in the .maf file and a chromosome to run on.

In this example we then pipe the output to sort to sort by chromosome and then position, before piping the output to bgzip allowing us to index the file with tabix.

tabix -pbed data/test.wga.bed.gz

The WGAbed format

The top of our example file data/test.wga.bed.gz looks like this:

zcat data/test.wga.bed.gz | head
chr8	30876342	30876343	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876342,27029971,6146725,4089858	A,a,A,A	+,-,-,+	170727.0
chr8	30876343	30876344	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876343,27029972,6146726,4089859	A,a,A,A	+,-,-,+	170727.0
chr8	30876344	30876345	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876344,27029973,6146727,4089860	A,t,A,A	+,-,-,+	170727.0
chr8	30876345	30876346	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876345,27029974,6146728,4089861	A,a,A,A	+,-,-,+	170727.0
chr8	30876346	30876347	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876346,27029975,6146729,4089862	T,t,T,T	+,-,-,+	170727.0
chr8	30876347	30876348	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876347,27029976,6146730,4089863	C,a,C,C	+,-,-,+	170727.0
chr8	30876348	30876349	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876348,27029977,6146731,4089864	C,t,C,C	+,-,-,+	170727.0
chr8	30876349	30876350	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876349,27029978,6146732,4089865	A,a,A,A	+,-,-,+	170727.0
chr8	30876350	30876351	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876350,27029979,6146733,4089866	C,a,C,C	+,-,-,+	170727.0
chr8	30876351	30876352	+	Greattit,Chicken,Flycatcher,Zebrafinch	chr8,chr8,chr8,chr8	30876351,27029980,6146734,4089867	T,C,T,T	+,-,-,+	170727.0

The first four columns are as in a conventional BED file, and are followed by the additional columns containing the alignment information. These are described in the table below.

index content description
0 chromosome The chromosome in the reference species genome
1 start position 0-based start position for the site in the reference species genome
2 end position 0-based end position (end of range) in the reference species genome
3 strand The strand for the reference species sequence
4 species A comma delimited list of species in the alignment, ordered reference first and then alphabetically. This column defines the species order of the subsequent columns
5 aligned chromosomes A comma delimited list of chromosomes aligned at the position for each species
6 aligned positions 0-based positions for all species. A ‘?’ is used when a non-reference species is not present within a MAF block and ‘-‘ for INDELS in the alignment
7 sequences The nucleotides aligned at the position in each species. A ‘?’ is used when a non-reference species is not present within a MAF block
8 strands The strands for the nucleotides listed in the sequences column
9 score The score for the MAF alignment block the position is in

Utility scripts

wga_bed_indels.py

This script takes the piped output of maf_to_bed.py and extracts the INDELs. It is possible to specify a number of filtering options, see usage below.

Usage

$ ./wga_bed_indels.py -h
usage: wga_bed_indels.py [-h] [-max_length MAX_LENGTH]
                         [-min_coverage MIN_COVERAGE] [-ref_specific]

optional arguments:
  -h, --help            show this help message and exit
  -max_length MAX_LENGTH
                        Maximum INDEL length to extract
  -min_coverage MIN_COVERAGE
                        Minimum species coverage
  -ref_specific         Restricts output to INDELs only found in reference
                        species, and conserved in non reference species

Example and output

$ ./maf_to_bed.py -i data/test.maf.gz -r Zebrafinch -c chr8 | sort -k1,1 -k2,2n | ./wga_bed_indels.py -min_coverage 4 -max_length 10 -ref_specific 
chr8	4090041	4090042	+	Zebrafinch,Chicken,Flycatcher,Greattit	chr8,chr8,chr8,chr8	4090041,27030121,6146884,30876501	T-,TG,TG,TG	+,-,-,+	170727.0
chr8	4090219	4090220	+	Zebrafinch,Chicken,Flycatcher,Greattit	chr8,chr8,chr8,chr8	4090219,27030264,6147064,30876680	G--,GAG,GAG,GAG	+,-,-,+	170727.0
chr8	4090314	4090315	+	Zebrafinch,Chicken,Flycatcher,Greattit	chr8,chr8,chr8,chr8	4090314,27030361,6147161,30876770	C-,TC,TT,TT	+,-,-,+	170727.0
chr8	4090374	4090375	+	Zebrafinch,Chicken,Flycatcher,Greattit	chr8,chr8,chr8,chr8	4090374,27030423,6147222,30876831	G--,CCA,GAA,GTA	+,-,-,+	170727.0
chr8	4090379	4090380	+	Zebrafinch,Chicken,Flycatcher,Greattit	chr8,chr8,chr8,chr8	4090379,27030430,6147229,30876838	G-------,GTGCTAAT,GTGTTAAT,GTGTTAAT	+,-,-,+	170727.0

non_ref_intersect.py

This script takes a piped wga.bed file and extracts a subset of sets specified by an associated bed file with coordinates from a non-reference species in the alignment. NOTE output is still written as a reference ordered bed.

Usage

$ ./non_ref_intersect.py -h
usage: non_ref_intersect.py [-h] -b BED_NON_REF -q QUERY_SPECIES -c CHROMOSOME

optional arguments:
  -h, --help            show this help message and exit
  -b BED_NON_REF, --bed_non_ref BED_NON_REF
                        .bed file with non refernce spp coordinates to
                        intersect with wga.bed file
  -q QUERY_SPECIES, --query_species QUERY_SPECIES
                        species that the query bed file has coordinates for
  -c CHROMOSOME, --chromosome CHROMOSOME
                        target chromosome

Example and output

$ cat data/test_indel_data.bed | ./non_ref_intersect.py -b data/test_query.bed.gz -q Flycatcher -c chr8
chr8    30876342        30876343        +       Greattit,Chicken,Flycatcher,Zebrafinch  chr8,chr8,chr8,chr8     30876342,27029971,6146725,4089858       A,a,A,A +,-,-,+ 170727.0
chr8    30876343        30876344        +       Greattit,Chicken,Flycatcher,Zebrafinch  chr8,chr8,chr8,chr8     30876343,27029972,6146726,4089859       A,a,A,A +,-,-,+ 170727.0
chr8    30876344        30876345        +       Greattit,Chicken,Flycatcher,Zebrafinch  chr8,chr8,chr8,chr8     30876344,27029973,6146727,4089860       A,t,A,A +,-,-,+ 170727.0
chr8    30876345        30876346        +       Greattit,Chicken,Flycatcher,Zebrafinch  chr8,chr8,chr8,chr8     30876345,27029974,6146728,4089861       A,a,A,A +,-,-,+ 170727.0
chr8    30876346        30876347        +       Greattit,Chicken,Flycatcher,Zebrafinch  chr8,chr8,chr8,chr8     30876346,27029975,6146729,4089862       T,t,T,T +,-,-,+ 170727.0

maf_extract_ref_chr.py

This script extracts all blocks containing a given reference chromosome.

Usage

$ ./maf_extract_ref_chr.py -h
usage: maf_extract_ref_chr.py [-h] -c CHROMOSOME [-H]

Extract a reference chromosome from a WGA MAF file

optional arguments:
  -h, --help            show this help message and exit
  -c CHROMOSOME, --chromosome CHROMOSOME
                        Specify which chromosome to extract
  -H, --header          If specified will print header in output

Example and output

$ zgrep -v ^6 data/test.maf.gz | ./maf_extract_ref_chr.py -c chr8
a score=170727.0
s Chicken.chr8     2932491 551 + 29963013 CAATGGAATTTAGTAGATAAGACTCAGAATCAGATTCTGGGTTTCAAAGTTGAATG-----CAAGGAAGGAAAAAATCCCCACAAGTATATTAGCACAATGTGGTATAAATCCATCATCTATAATGTACTGGATTGCTTTTAACTCTCAGTCTGAAACCAAATCGATTTCAACAGTTGGTGCTGGACAGAGAACAATTACTAACCTTCTTTTTTCTCTTTCCTGCTGCTTAACAGGGCTACAAATAACAACAGAAAATTACTCTCAAAGATCTACCCATGT-------------GAAACCAAAGTAACTTAAATGTTCAAATATTATTTGGATTGCAATGTGTAATGTCTCTCAAAT----AGACAAGAAAAATATGGGCAA-------------------AATA-GTCTATAGTCAGTTTAAGAGTCCCTTCTACATTTGCAATCATGTAAAATGAT---TTAGGTATCAATTTCAAAAC--GTTCAAATACAGGAAAGAAAGTTCCTCAGT------------------------AT-TATTAACATATTGTAGAACATTCGCCATCATCCTGGCATTTCAAAACG---TTAAGTTTAATGAGCTAGCCAAGTAGttatatatt
s Zebrafinch.chr8 23902959 610 - 27993427 CACTGGGAGATGCTGAAAGAGCCCCGGGATGGGAGTGTGGGGT-TAAAGCTGAGTGTGACACAGGGCAGGCCACGAGCCCCTGAGTCCC-------CAGTG--CCACAGATCCAGC-CCTGCAATGCCCTGCCTGCCTTTCACCTCCCAGCCTGCAGCCAAACC-GTGTCAACACTTTGTGCTGGACACAGACCAATTACTAATCTTCTTTCTCTTCTTTCCTGCTGCTTAACAGGCCTACAAATAATGCTTTAAAATTA--CTCTGAGACCTACCCACATAACTTTCAGAGCAGAAACCAAAGTAACTTAAATTATCAAAAATCATTTCTATTTCAAGGTGAAATGTCTTTGAAATACACAGAAAATAAAAATCCTGGTTAAAATTAATCCAGAAATACTAATG-TATTGAGGTCACTGCAAGGATCCTATCTGGATTTG-AATCACGTAAAAGAATCAATTAAGAGTCAAATTTAAAAGAAATTCAAAAACAGGAAAGATAATTTCTCAGTAGGTATTCAAAAGAATTCATACAAATGTATTTAGGTGTTGCAGAACACTTACCATCACTCTAGCACTACCAAACAACTATAAATCTATTTAACTAACCTGTTAAGTGGATTTT
s Flycatcher.chr8 25953491 600 + 32100816 CACTGGAAAACGCTAAAAAAGGCTCAGGATGGGGCCATGGGGG-TGGAGCTGAGTGCCAAGCAGGGCAGGTGAGGAGCCCCTGGCGTCCATTAACACAATGTTCCACAGATCCATC-CCTGTAATGCACTGCCTGCCTTTTACCTCGCAGCCTGAAGCCAAATCAATTTCAACAGTTTGTGCTGGACACAGACCAATTACTaatcttctttctcctctttcctgctgcttAACAGGGCTACAAATAATGGTTTAAAATTACTCTCTGAGACCTACCCACATGACTTTCAGAGCAGAAACCAAAGTAACTTGAATGATCAAAAATCGTTTATACTTCAAGGTGAAATGTCtttgaaattcacagaaaagaaaaatcctggctaaaataaatccagaaataCCAACGTTTTTGTGGTGTCTGCAAGGATCCTATCTGGATTTGCAATCATGTGAAAGAATCAAATAGGagtcaaatttaaaataaattcaaaaacaGGAAAGGGAACTTCTCAGT------------------------ATGTATTCAGGTGTTGCAGAACACTTACCATCATTCTATCACTGCCAAACAACTGTAAATCTATTTAACCAAACAATTGAGTGGATTTT
s Greattit.chr8     447231 593 - 31324166 CACTGGAATATGCTAAAAAAGAATCAGGATGGGACCATGGGGGTTAAAGGTGAGTGTGATACAGGGAAGATGAAGAGCCCCTGGTGTCGATTAACACAATGTACCATAGATCCATC-GCTCTAATGCATTGTCTGGCTTTTACCTCGCAGCCTGAAGCCAAATCAATTTCAACAGTTTGTGCTGGACAGACACCAATTACAAA-------TTACCTCTTTCCTGCTGCTTAACAGGGCTACAAAAAATGGTTTAAAATTACTCTCTAAGACCTACCCACATGACTTTTAGAGCAGAAACCAACGTAACTTAAATGATCAAAAACTGTTTCTATTTCAAGGTGAAATGTCTTGGAAATTCATGGAAAAGAAAAATCCATGCTAAAATTAATCCAGAAATAGAAATG-TTTTCTGGTCACTGTAAGGAACTTAACTGGATTTGCAATCATAAAAAAGAATCAATTAGGAGTCAAATTTAAAATAAATTTAAAAATAGGAAAAGGAACTTCTCAGT------------------------ATGTATTCAGGTATTGCAGAACACTTAACATCATTCTGGCACTGCCAAACAACTATACGTCTATTTAACTAACTAACTAAGTGGATTTT

polarise_wga_ref_indels.py

This script returns off wga bed lines that contain the specified INDEL type.

Usage

$ ./polarise_wga_ref_indels.py -h
usage: polarise_wga_ref_indels.py [-h] -indel_type
                                  {non_indel,ambig_indel,deletion,insertion}

optional arguments:
  -h, --help            show this help message and exit
  -indel_type {non_indel,ambig_indel,deletion,insertion}
                        type of indel to output

Example and output

To extract all ref specific insertions:

$ ./maf_to_bed.py -i data/test.maf.gz -r Zebrafinch -c chr8 | sort -k1,1 -k2,2n | ./polarise_wga_ref_indels.py -indel_type insertion
chr8    4089946 4089971 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4089946,27030055,6146813,30876430       TTTGTATGAATTCTTTTGAATACCT,T------------------------,T------------------------,T------------------------   +,-,-,+ 170727.0

To extract all ref specific deletions:

$ ./maf_to_bed.py -i data/test.maf.gz -r Zebrafinch -c chr8 | sort -k1,1 -k2,2n | ./polarise_wga_ref_indels.py -indel_type deletion
chr8    4090041 4090042 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090041,27030121,6146884,30876501       T-,TG,TG,TG     +,-,-,+ 170727.0
chr8    4090219 4090220 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090219,27030264,6147064,30876680       G--,GAG,GAG,GAG +,-,-,+ 170727.0
chr8    4090314 4090315 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090314,27030361,6147161,30876770       C-,TC,TT,TT     +,-,-,+ 170727.0
chr8    4090374 4090375 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090374,27030423,6147222,30876831       G--,CCA,GAA,GTA +,-,-,+ 170727.0
chr8    4090379 4090380 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090379,27030430,6147229,30876838       G-------,GTGCTAAT,GTGTTAAT,GTGTTAAT       +,-,-,+ 170727.0

To extract all ambiguous INDELs:

$ ./maf_to_bed.py -i data/test.maf.gz -r Zebrafinch -c chr8 | sort -k1,1 -k2,2n | ./polarise_wga_ref_indels.py -indel_type ambig_indel
chr8    4089892 4089896 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4089892,27030005,6146759,30876376       TAGT,A---,CAGT,TAGT     +,-,-,+ 170727.0
chr8    4089943 4089945 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4089943,27030053,6146810,30876427       AC,A-,AC,AC     +,-,-,+ 170727.0
chr8    4090000 4090003 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090000,27030085,6146843,30876460       TTT,C--,ttt,TTT +,-,-,+ 170727.0
chr8    4090022 4090026 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090022,27030105,6146865,30876482       ATTG,A---,TTTG,ATTG     +,-,-,+ 170727.0
chr8    4090076 4090077 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090076,27030157,6146920,30876537       A-,C-,AA,A-     +,-,-,+ 170727.0
chr8    4090080 4090100 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090080,27030161,6146925,30876541       TAGTATTTCTGGATTAATTT,T-------------------,TGGtatttctggatttattt,TTCTATTTCTGGATTAATTT       +,-,-,+ 170727.0
chr8    4090120 4090125 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090120,27030182,6146965,30876581       TGTGT,T----,tgtga,CATGA +,-,-,+ 170727.0
chr8    4090187 4090201 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090187,27030245,6147032,30876648       CTGCTCTGAAAGTT,C-------------,CTGCTCTGAAAGTC,CTGCTCTAAAAGTC       +,-,-,+ 170727.0
chr8    4090269 4090277 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090269,27030316,6147116,30876732       AAAGAAGA,AAAGAAGG,aaagaaga,A-------       +,-,-,+ 170727.0
chr8    4090361 4090362 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090361,27030409,6147209,30876818       G-,AT,G-,C-     +,-,-,+ 170727.0
chr8    4090407 4090413 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090407,27030465,6147264,30876873       GTGTCA,G-----,GCTTGG,GTATCA     +,-,-,+   170727.0
chr8    4090424 4090425 +       Zebrafinch,Chicken,Flycatcher,Greattit  chr8,chr8,chr8,chr8     4090424,27030477,6147281,30876890       A-,GA,A-,AA     +,-,-,+ 170727.0