Bootstrapping Estimates for Comment Likelihood, Hacker News: EDA II

In my previous Hacker News EDA we looked at how words could be embedded in two dimensions.

This time we implement a bootstrapping simulator for seeing the impact of posting time on number of comments received.

Examining the dataset
- Identifying response variable
- Transforming data
Modeling comment likelihood
Constructing a decision function
- Bias vs. variance for the t-comparison of estimators

Examining the dataset

To get an idea of what keywords are popular at different times of the day, we will be using the Hacker News Posts dataset from Kaggle. If you haven’t seen my previous EDA, here’s what it looks like:

We’ll make the simplifying assumptions that

the popularity of a post can be judged by the number of comments, and
the effect of post time is consistent among users.

Whereas the first assumption that post popularity follows naturally from the number of comments, the second assumption would typically not suffice for legitimate analysis. Nonetheless, given that we would likely have access to user location from website tracking data in a real application, we’ll make do with the indiscriminate Eastern Standard.

To guide our EDA, consider an objective:

Given the limited set of information

Title, Likes, Number of comments, Time, Username (ambiguous)

it is impossible to appropriately model the number of comments that a post receives. To see why, look at the distribution of comments:

The data are heavily skewed by a few outliers receiving huge numbers of comments. And yet, more than half of the posts receive 0 comments.

Any estimator of comment number follows a Poisson distribution. Given a time period of observation, we could model the comment number as a series of “post was commented on” events. Already, this should seem problematic: posts possessing identical keywords may follow different distributions and have different comment rates. Somebody with many followers receives more comments than a poster who just opened their account.

Without determining a specific prior condition for the post that we’d like to model, these factors not only are indistinguishable from the supplied data, but we lack enough examples of highly-commented posts.

Instead, while we can keep in mind the original goal of modeling comment numbers, we’ll revise the original objective (feature engineering):

Response variable

Given that most posts in the dataset belong to users receiving no comments, we’ll keep our analysis in scope of the average user. Each post can be considered as either (A) receiving a comment or (B) no comments, and we label the entries accordingly.

Should we filter the posts, target specific users, or transform our response variable? Only if there is unruly bias by not doing so.

Luckily, this is a consistent estimator of an average user’s comment rate. Remember, most posts received 0 comments, and if there exists a measurable bias imparted by the top-commented posters, it would act identically on each time bin. So besides removing entries that are NaN and making sure all strings are lowercase, we’ll proceed without further modification.

Transforming the data

We’ve encoded the Time field ordinally, dropping everything but HH:MM to tokenize by an integer minute + 60 * hour.

It seems that there are more posts between t=800, 1200 than at any other time of the day, despite the dataset being time zone agnostic.

Modeling comment likelihood

Now having acquired a distinct set of post statistics for each of ten time bins (0=12am, 5=12pm) we’ll look at how the comments are distributed:

There is no apparent connection between comments and time for the dataset at-large. Observing the marginal distribution for a niche topic like “programming” doesn’t give any obvious clues either:

We haven’t looked at much data yet, but what we would like to see is whether commentors are time-sensitive. And for this we’ll need a statistical test.

Sampling

We can model the commenting behavior very simply with a Bernoulli Comment likelihood. Draw samples from each time slot and count the number of posts that have received at least one comment in which case X=1. From a sample of n postings,

$k = \sum_i X_i \approx n p =n E(X)$

count those that have received at least one comment, and estimate the probability p. Better yet, we’ll consider the log-odds ratio

$L = \log \frac{ \sum_i X_i }{ n - \sum_i X_i } = \frac{k}{n-k}$

which generally provides, large n, an unbiased estimator of the Bernoulli parameter p

$\widehat p = \frac{k}{n} = \frac{n - k }{n} e^L\,.$

Bootstrap Simulation

Through simulation, we’ll construct a t-distribution for each time slot centered at the observed log-odds-ratio estimators with variance determined through bootstrapping.

Define a bootstrap sampling routine for our DataFrame queries. Extending the multiprocess bootstrapping module,

from df_query_stats import *

def do_bootstrap(_X,words):
    """
    construct bootstrap samples of commented likelihoods for each category
    (hours of day / 10 segments)
    """
    # bootstrapped statistics go in word_stats, baseline_stats
    word_stats, baseline_stats = [], []

    # determine confidence/dof of t test by counting rows in the dataset.
    # these go into word_sizes, baseline_sizes
    word_sizes, baseline_sizes = [], []

    # size of bootstrap
    _IN, _OUT = 20,10

    for u in words:
        bootstrap_estimates_per_t, word_sizes_u = [], []

        # ten categories, run bootstrap outer loop for each
        for j in range(10):
            DF_handler = df_query_stats(
                _X, "title.str.contains('" + u + "') & timebin == " + str(j))
            estimates, sizes = bootstrap_mean_var(
                _OUT,_IN,"num_comments>0", nthr=6)
            bootstrap_estimates_per_t.append(estimates)
            word_sizes_u.append(sizes)
            del DF_handler

        word_sizes.append(word_sizes_u)

        DF_handler = df_query_stats(_X, "title.str.contains('" + u + "')")

        # the baseline bootstrap simulation (mean across all categories
        _foo = []
        estimates, sizes = bootstrap_mean_var(
            _OUT,_IN,"num_comments>0", nthr=6)
        _foo.append(estimates)
        baseline_sizes.append(sizes)
        del DF_handler

        Mean_Var_per_t = np.array(
            [np.array(list(zip(*bootstrap_estimates_per_t[i]))).mean(axis=1)
             for i in range(len(bootstrap_estimates_per_t))]
        )
        Mean_Var_per_t_base = np.array(
            [np.array(list(zip(*_foo[i]))).mean(axis=1)
             for i in range(len(_foo))]
        )
        word_stats.append(Mean_Var_per_t)
        baseline_stats.append(Mean_Var_per_t_base)

    return Mean_Var_per_t, word_stats, baseline_stats

# where we've previously defined
# _X = pd.read_csv("../archive/HN_posts_year_to_Sep_26_2016.csv")
#         -> ... -> cleaning :-> _X
Mean_Var_per_t, word_stats, baseline_stats = do_bootstrap(_X,words)

From our earlier dimensionality reduction, let’s pick a few words to look at:

With these words in mind, we execute the sampling routine. It’ll also help to compute effect sizes and unnormalized displacement-from-mean for each time point if we would like to later train a decision-making algorithm for best time.

$\text{effect} = \frac{\widehat \mu_n - \mu_0}{ \widehat \sigma_n}\,,\quad \Delta = \widehat \mu_n - \mu_0$

After bootstrapping is complete …

DF_handler = df_query_stats(_X, "title.str.contains('" + u + "') & timebin == " + str(j))
estimates, sizes = bootstrap_mean_var(_OUT,_IN,"num_comments>0", nthr=6)
bootstrap_estimates_per_t.append(estimates)
word_sizes_u.append(sizes)

… we can finally compare the words by their t distributions. I picked them somewhat arbitrarily, so surely there are better illustrations of time preference that can be found in the dataset. Nonetheless, the t curves give an inviting cursory look at how posting time could be chosen with a keyword in mind:

Deciding what time to post

We truly don’t have much information yet, but we’ll attempt nonetheless to create a simple decision algorithm: Given my post with a keyword “X” in the title, when do I post it?

So far we’ve computed

t-distribution for commentability in each time slot for each keyword
effect size and difference in likelihood ratios

Any machine algorithm using this information alone would be unwieldy at best, but we’ll try it, and at the least learn about how our classification error would respond to inaccuracies in our estimators of effect size and likelihood ratios.

Consider a simple filter. With the estimated effect and $\Delta = \mu-\mu_0$ for a keyword at each time point $t_1,\ldots, t_{10}$ , let

$\Lambda(keyword) = \alpha \; \text{{Effect} }+ \beta \; {\Delta \mu}$

$F(\Lambda) = \arg\max \Lambda(keyword)$

where $F$ provides the index of the time point corresponding to the greatest value of all coordinates in $\Lambda$ ,

$\text{Effect} = \bold x = (x_1, \ldots, x_{10})\,,\quad {\Delta\mu} = \bold y (y_1,\ldots, y_{10})$

we compute a weighted average of effect and $\Delta$ . Since these are estimators of likelihood, the decision of most likely is obtained by taking the greatest values in $\bold x, \bold y$ which maximizes probability.

By assigning more weight to effect, we’re choosing to favor the time points that we’ve simply observed more frequently. Likewise assigning more weight to $\Delta$ , we’re trusting the plain old sample mean.

This can be thought of as updating a Bayesian prior: if there is some unknown distribution of best posting/commenting times then our most likely time choice would be the one predicted by our confidence intervals weighted according to how often we’d observed them. And if the unknown distribution deviates from what we expect it to be then it’s expected deviation is, after all, expected to deviate from the center/mean we’ve estimated.

Simulation

Taking a mix between our observed best-comment-time distributions for the four words “programming, iphone, lessons, law,” we find, as expected, that the error for a simulated “unknown distribution” fares better for higher weights given to effect size. This simulates the bias-variance tradeoff for a single observation set and a single unknown probability distribution.

Identifying the mode of each word’s time distribution is equivalent to maximizing the likelihood of obtaining a comment.

x-axis = weight given to $\alpha$
y-axis = weight given to $\beta$

x-axis = weight given to $\alpha$
fixed $\beta=1$
y-axis = error

Bootstrapping Estimates for Comment Likelihood, Hacker News: EDA II