Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

placement masses vs number

Hello!

I'm using the visualize_placements demo, and I would like to know what placement mass exactly means. The first part of the code seems to count the number of placement per edge, so basically this is what I expected. However, the output seems to show the accumulated likelihood weight ratio (ALWR), instead. Which one is the correct?

Thank you very much!


Alicia

Comments

  • Hi Alicia,

    the demo visualizes the distribution of the likelihood weight ratios, as this is the more "detailed" view on the data. If by "the first part of the code" you mean this function: https://github.com/lczech/genesis/blob/master/doc/demos/visualize_placements.cpp#L65
    then this is exactly where this summation is done.

    If you simply want to visualize the "number of placements" per edge, you probably mean that you just want to show where the most probable placement is, right?
    This basically means that you get rid of all the extra information provided by the like_weight_ratio. Thus, in order to get this visualization, you can filter out all but the most probable placement per pquery and then set its mass to 1, so that it counts as exactly 1 placement on 1 branch.

    This can be achieved by some filtering of the samples after reading them: After reading the jplace file (https://github.com/lczech/genesis/blob/master/doc/demos/visualize_placements.cpp#L276), add those lines:

    // Remove all but the most probable placement.
    filter_n_max_weight_placements( sample, 1 );

    // Normalize the mass, so that the remaining placement gets a mass of 1.
    normalize_weight_ratios( sample );

    Is that what you were looking for?

    So long
    Lucas
  • Hi Lucas, 

    I agree that the distribution of likelihood weight ratios are very informative, but it might lead to misinterpretations. For example, a branch with the highest LWR might be formed by tones of placements with 0.00001 LWR. So, I prefer to filter the lowest placements for each pquery (done by keep-factor option in pplacer) and then, see which branches contain the highest amount of placements

    For that reason, I would like to visualize the "number of placements", but all the placements of a pquery, not only the most probable one. So, if a pquery has 7 placements, each of them with 1 count. 

    I'm not sure wether this approximation is the best one, because I like the idea of using the likelihood. The problem is that a branch with high LWR might not be the most probable one at the end, because it might be formed by hundreds of placements with very low LWR, right?

    A. 


  • edited October 2016
    Hi Alicia,

    okay, it seems I should go a bit more into detail ;-)

    You said "a branch with the highest LWR might be formed by tons of placements with 0.00001 LWR". Well, let's assume "a ton" of placements are 1000. Then, this branch would only get a total weight of 1000 x 0.00001 = 0.01. Assuming that your reference tree is well suited for your query sequences, you will (hopefully) find that for many sequences, the most probable placement position has a likelihood weight ratio (LWR) of greater than 0.9 (maximum is 1.0). So, a single one of those "well placed" sequences (i.e., with high confidence / high LWR) puts already 90 times more mass on the branch than those 1000 low-LWR placements.

    Using the `--keep-factor` in pplacer is mostly useful to keep your output files small. Those low-weight placement do not contribute to most of the downstream analyses anyway, so it is okay to filter them out. In other words: When running the placement algorithm (pplacer), there is a total mass of 1.0 for each sequence to distribute over all the branches of the tree. It is a probability distribution of how likely the sequence sits on each branch. So, the LWRs of one sequence over all branches of the tree always sum up to 1.0. If you just keep the more probable ones, you are throwing away a bit of that proability mass, so you will end up with a sum of something like 0.95 instead - but again, this is okay, because the 0.95 still represent the most probable placement positions.

    So, if a pquery "has 7 placements", all this means is that the algorithm threw away all other possible placements because of their low probability. It does not automatically mean that the sequence has 7 likely placement positions. It could well be that only one or two are actually probable (as indicated by their LWR), and the rest have a low probability/LWR. If you now decide to "pump up" all those 7 positions to a weight of 1.0, you are heavily overestimating those low probabilities. So, a pquery where 7 placements were kept will count more than double as much as one where only 3 were kept - although both represent only one sequence. You will thus simply measure how many placements were kept by the algorithm for a pquery - which is kind of arbitrary and does not represent it's "distribution" over the tree.

    In that light, I think your final question does not really make sense. What is "the most probable" branch, if not the one with the most placement mass?

    Does this help? Let me know whether I understood your questions correctly and whether you need more clarification.

    Best
    Lucas
  • Hi Lucas, 

    Thank you soooo much for this replay. 

    Another PhD student and I were having so many discussions about whether to consider likelihoods or only number of placements... 
    I know exactly what you mean now and yes, definitely we should consider placement mass. Your example about tones of placements vs 7 positions was very clarifying :)

    Thank you very muuuch, and sorry for this annoyance. 


    A. 
  • Glad I could help :-)

    Lucas
Sign In or Register to comment.