It looks like you're new here. If you want to get involved, click one of these buttons!

I'm using the visualize_placements demo, and I would like to know what placement mass exactly means. The first part of the code seems to count the number of placement per edge, so basically this is what I expected. However, the output seems to show the accumulated likelihood weight ratio (ALWR), instead. Which one is the correct?

Thank you very much!

Alicia

## Comments

the demo visualizes the distribution of the likelihood weight ratios, as this is the more "detailed" view on the data. If by "the first part of the code" you mean this function: https://github.com/lczech/genesis/blob/master/doc/demos/visualize_placements.cpp#L65

then this is exactly where this summation is done.

If you simply want to visualize the "number of placements" per edge, you probably mean that you just want to show where the most probable placement is, right?

This basically means that you get rid of all the extra information provided by the like_weight_ratio. Thus, in order to get this visualization, you can filter out all but the most probable placement per pquery and then set its mass to 1, so that it counts as exactly 1 placement on 1 branch.

This can be achieved by some filtering of the samples after reading them: After reading the jplace file (https://github.com/lczech/genesis/blob/master/doc/demos/visualize_placements.cpp#L276), add those lines:

filter_n_max_weight_placements( sample, 1 );

// Normalize the mass, so that the remaining placement gets a mass of 1.

normalize_weight_ratios( sample );

Is that what you were looking for?

So long

Lucas

okay, it seems I should go a bit more into detail ;-)

You said "a branch with the highest LWR might be formed by tons of placements with 0.00001 LWR". Well, let's assume "a ton" of placements are 1000. Then, this branch would only get a total weight of 1000 x 0.00001 = 0.01. Assuming that your reference tree is well suited for your query sequences, you will (hopefully) find that for many sequences, the most probable placement position has a likelihood weight ratio (LWR) of greater than 0.9 (maximum is 1.0). So, a single one of those "well placed" sequences (i.e., with high confidence / high LWR) puts already 90 times more mass on the branch than those 1000 low-LWR placements.

Using the `--keep-factor` in pplacer is mostly useful to keep your output files small. Those low-weight placement do not contribute to most of the downstream analyses anyway, so it is okay to filter them out. In other words: When running the placement algorithm (pplacer), there is a total mass of 1.0 for each sequence to distribute over all the branches of the tree. It is a probability distribution of how likely the sequence sits on each branch. So, the LWRs of one sequence over all branches of the tree always sum up to 1.0. If you just keep the more probable ones, you are throwing away a bit of that proability mass, so you will end up with a sum of something like 0.95 instead - but again, this is okay, because the 0.95 still represent the most probable placement positions.

So, if a pquery "has 7 placements", all this means is that the algorithm threw away all other possible placements because of their low probability. It does not automatically mean that the sequence has 7 likely placement positions. It could well be that only one or two are actually probable (as indicated by their LWR), and the rest have a low probability/LWR. If you now decide to "pump up" all those 7 positions to a weight of 1.0, you are heavily overestimating those low probabilities. So, a pquery where 7 placements were kept will count more than double as much as one where only 3 were kept - although both represent only one sequence. You will thus simply measure how many placements were kept by the algorithm for a pquery - which is kind of arbitrary and does not represent it's "distribution" over the tree.

In that light, I think your final question does not really make sense. What is "the most probable" branch, if not the one with the most placement mass?

Does this help? Let me know whether I understood your questions correctly and whether you need more clarification.

Best

Lucas

Lucas