Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

classification table from jplace files

Hi Lucas,

Thanks for your help on our previous questions. :) I have something a bit different this time. Its my understanding that the .jplace files produced by EPA and pplacer contains multiple placements for each sequence. Is it possible to have genesis produce a table containing the likelihoods of all placements for each sequence from a .jplace file? Just a simple 3 column table containing sequence ID, placement, and likelihood.

Comments

  • One more thing, would it be possible to add another column for taxonomy? I suppose taxonomy can be pulled from the tree via the placement node?

    And I guess likelihood weight ratio would be more informative here than likelihood...
  • Hi Charles,

    that is a simple task. Maybe you want to try it on your own - would be a good exercise to start working with the genesis API ;-)
    A good starting point is the placement tutorial: http://doc.genesis-lib.org/placement.html
    There, the information you want is written with a bit more surrounding text, but it is easy to output a table instead.

    If you want, I can also do it - it's just a few lines. Just a couple of questions:
    • The column "placement" is supposed to hold the edge id, right? Do you want to use the same ids as the previous program (http://support.genesis-lib.org/discussion/7/creating-lists-of-taxa-names#latest) uses?
    • Do  you want the table to be tab-separated, or which format?
    • As for the taxonomy, what exactly do you want to achieve and in which format do you have the taxonomy? Currently, I am working on a representation of taxonomies like "Animalia;Vertebrata;Mammalia;Carnivora", but this is not yet ready for releasing.

    So long
    Lucas

  • Hi Lucas,

    Thanks for the quick reply. I looked into coding this myself based on the placement tutorial, but since I've never coded C++ before, the farthest I got was reading the jplace file... It should be a simple script though so I think I would learn a lot from just reading the code.

    In regards to your questions
    • Yes, placement would hold the edge ID. I think just the number itself would be fine.
    • A tab-delimited text file is fine.
    • Ideally, the taxonomy you described would be great. However, just the name of the sequence closest to each placement within the tree would suffice as well.
    Thanks for your help Lucas!
  • edited June 2016
    Hi Charles,

    all right, then here is your chance to learn some C++ ;-)

    You find the code here: https://gist.github.com/lczech/24c12a912e4eb0ed38d517d887c42db5
    Works as always, put it in the "apps" directory and call "make update".

    The program expects a jplace filename and an output filename as parameters. The tab-separated columns of the output file then contain:
    • The first name of the Pquery (if there are none, a default name).
    • A counter telling you which Placement of this Pquery it is (e.g., for a Pquery with 5 placement positions, this value will go from 1 to 5 in consecutive lines).
    • The edge index (not the edge_num property of the jplace file, but an internal id - this is the same number as used the other code from a couple of weeks ago).
    • The like_weight_ratio value of that placement. They are sorted, so that the more likely ones come first.
    • The name of the leaf node of the reference tree that is closest to the placement. If there are multiple leaves with the same distance, only one of them is written. The distance is simply measured as the number of nodes between the placement and the leaf node - that is, the branch lengths are ignored. If you want branch length instead, let me know.
    • The distance (in nodes) between the placement and that leaf node of the tree. Example: A placement that sits on an edge with a leaf node has a value of 0. Placements on inner branches then have higher numbers.

    If you do not need certain columns, simply comment out their lines in the code.

    Let me know whether this works as expected.

    So long
    Lucas


    PS: I recently updated the demo for extracting placements in clades so that the file format for the clades is simpler now. See here: http://support.genesis-lib.org/discussion/4/files-and-syntax-for-extract-clade-placements#latest
    Maybe you want to switch to that.

  • Fantastic. This is exactly what I need. Thanks a lot Lucas! I am sure C++ will find its way into my head... eventually. ;)

    In regards to extracting clade placements, we are finding that many edges contain only very few placements, which becomes very inefficient for post OTU clustering. In fact, we are getting many more OTUs than if we cluster everything at the beginning due to oversplitting by edges. So I was wondering if it would be possible to set a threshold for the minimum number of placements before a clade is extracted? If an edge does not contain enough placements to satisfy the threshold number, the closest edge would be added until the threshold is satisfied. I know this request is probably a lot more complicated to code than the ones before...
  • Hi Charles,

    yes, if you do not have that many query sequences, I'd expect oversplitting if you use EPA as a form of OTU clustering. We are actually currently working on a comparison of exactly this approach (including a downstream analysis with PTP - http://sco.h-its.org/exelixis/web/software/PTP/) and SWARM (https://github.com/torognes/swarm). I'll keep you updated. You might also want to look into SWARM for OTU clustering - it's pretty cool!

    As for your suggestion to use a threshold and adding/merging the closest edge: This is not straightforward. For example, what do you do if two neighbouring edges have one placement each? Which one do you add to which? What you are proposing is basically a clustering approach in itself - this would not just be a bit more coding, but a whole new algorithm...

    But why do you want to use the branch-wise approach in the first place?
    Didn't the extraction of queries in clades work for you? That
    way, you'd avoid the oversplitting, while getting meaningful assignments
    of the queries to clades.

    Best
    Lucas
  • Hi Lucas,

    We were originally hoping to use placement as a coarse sorting method prior to OTU clustering in the hopes that we could reduce the number of OTUs. The extraction of queries in clades worked, but many clades contained few placements, which could only then yield a minimum of a single OTU. Its great that you guys are working on something like this. I'll leave this idea on the backburner and focus on other things then. :)
  • Hi Charles,

    your idea sounds interesting. However, evolutionary placement was not developed as a method for OTU clustering, so it doesn't surprise me that you got unsatisfying results. Usually, you'd want to use some OTU clustering method first, and then do placement on the OTUs. I'd highly recommend using SWARM for this purpose - this will drastically reduce your data set size.

    By the way, why did you have problems with data set size in the first place? How many reads/amplicons/sequences are you working with? And what questions are you trying to answer (that is, what form of result do you want to obtain)?

    Lucas
  • Hi Lucas,

    Yes, our original pipeline involves OTU clustering first followed by taxonomic assignment. We were hoping to try and implement a pipeline describe in Zhang et al. 2013, which involves placement of sequence reads onto a reference tree first followed by species delimitation of query sequences placed at each branch of the tree using the Poisson tree processes (PTP) model.

    I have tried SWARM on a few of our data sets but CROP (98% similarity threshold) produces less OTUs. But even with CROP, we observe oversplitting of OTUs since we still have many OTUs being assigned the same taxonomy. Our largest data set has about 700,000 reads. We are metabarcoding vertebrate DNA collected by insects and there are obvious errors in the taxonomic assignments so we are working on various ways to improve the pipeline.

    Zhang, J., Kapli, P., Pavlidis, P. & Stamatakis, A. (2013) A general species delimitation method with applications to phylogenetic placements. Bioinformatics, 29, 2869–2876.
  • edited June 2016
    Hi Charles,

    well, speaking from a computer science perspective, if your data tells you that you have that many OTUs, then you should not try to artificially enforce fewer. In particular, SWARM gives you "natural" OTUs. That means, it clusters all reads that are microvariations of each other (due to sequencing errors etc), and separates the ones that are more distant (thus, probably a different organism).

    So, making your goal "to have as few OTUs as possible" is probably not a good idea. There can be many organisms in your data that get the same taxonomic assignment in the end - but if you first put all of them into one OTU, you will miss all their diversity!

    Lastly, if you really have to enforce fewer OTUs, you can try the SWARM option "d" with a higher value (2 or 3 maybe).

    Lucas

    PS: Fun fact, I am sitting here with the inventor of SWARM, and he confirmed this.
  • Hi Charles and Doug,

    we recently released a significantly faster and more accurate version of
    our species delimitation tool PTP:

    Preprint: http://biorxiv.org/content/early/2016/07/14/063875
    Web-Service: http://mptp.h-its.org/
    Documentation & Download: https://github.com/Pas-Kapli/mptp

    Cheers,
    Lucas
  • We just tried mPTP on a dataset of ~4000 COI sequences from beetles (Sanger seqs, full barcode length of 658 bp, collected over a whole year.).  Both GMYC and PTP were giving us around 1000 OTUs.  mPTP gives us around 560 OTUs.  Quite a difference.  

    It's such a difference that we've decided to go the other way for now and place all 4000 haplotypes on our tree of the beetles (will probably take a week, using raxml-epa).  

    This means that when we calculate PD between samples, it will be weighted for abundance.  We'll also analyse with the mPTP OTUs.

  • Hi Doug,

    I just spoke with the mPTP guys in our group. It is expected that it gives you fewer counts than GMYC and PTP. mPTP penalizes if there are too many parameters, i.e., too many species - thus it yields more robust, but lower numbers.

    If your goal is to compare samples, then either of the tools should be fit for the job. They might give you different absolute numbers, but when using those numbers to compare different samples, the relative number is informative. Example: PTP might give you a number twice as high as mPTP, but it does so consistently over all samples. So, when comparing those number to each other, this should give similar results.

    What I do not quite understand: If you used those tools, that means you already have a tree for all of them. Now, if you want to run EPA with your reads, which reference tree do you use for that?

    Lucas

  • Hi Lucas,

    We have an ML tree of the beetles (built from complete and partial mitogenomes) with 16001 tips.  this is from colleagues.  Happily, all the sequences have the barcode region of the COI gene, and our query seqs are COI barcodes.  

    This is the first time i try to do this kind of analysis, so i'm happy to have a decent tree to work with.  Since I last wrote, we successfully ran raxml-epa and pplacer with this dataset.  We notice some different placements, so we have to explore why this is the case.  

    Doug
  • Hi Doug,

    that's interesting. How long are the query sequences you are using? If they are short (<150bps), you might get sequences that change position if the phylogenetic signal is not strong enough for placing them. Also, EPA and pplace work slightly different. So, the placements won't be exactly the same, but at least should be on nearby branches for the same sequence (that is, if the the signal is strong enough).

    If you are interested, I can prepare some scripts for genesis for comparing placement files, i.e., how many placements ended up on different branches and how far those branches are away from each other. Let me know if this would help.

    Lucas
  • Hi Lucas,

    I need to set up a personal system for monitoring these pages!  our queries are full barcodes (658 bp).  i imagine that they should have pretty good signal, and my first guess is that the differences derive from different parameter choices.  we have not done any kind of systematic comparison between the epa and pplacer results, so it would be very interesting to have an 'app' to compare the placements.  it would be useful.  thank you in advance.

    doug
  • Hi Doug,

    I just released genesis v0.10.0, which also contains a demo program for comparing two jplace files: http://doc.genesis-lib.org/demos_compare_jplace_files.html

    It should cover your case. However, you probably don't need to worry about the differences between EPA and pplacer too much - unless you find huge differences in their results. In that case, let me know, so that we can investigate.

    So long
    Lucas

    PS: In the forum settings, you can select to get an email notification for new posts. Maybe this helps ;-)
  • Hi Lucas

    Thanks very much for this.  (I set my notification preferences to get emails, but it didn't work, and nothing is in my spam folder.  i will keep checking directly. )

    doug
Sign In or Register to comment.