In my previous Hacker News EDA we looked at how words could be embedded in two dimensions.
This time we implement a bootstrapping simulator for seeing the impact of posting time on number of comments received.
Examining the dataset
To get an idea of what keywords are popular at different times of the day, we will be using the Hacker News Posts dataset from Kaggle. If you haven’t seen my previous EDA, here’s what it looks like:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
We’ll make the simplifying assumptions that
- the popularity of a post can be judged by the number of comments, and
- the effect of post time is consistent among users.
Whereas the first assumption that post popularity follows naturally from the number of comments, the second assumption would typically not suffice for legitimate analysis. Nonetheless, given that we would likely have access to user location from website tracking data in a real application, we’ll make do with the indiscriminate Eastern Standard.
To guide our EDA, consider an objective:
Determine the time of day that a post containing “X” in its title receives the greatest number of comments.
Given the limited set of information
Title, Likes, Number of comments, Time, Username (ambiguous)
it is impossible to appropriately model the number of comments that a post receives. To see why, look at the distribution of comments:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
The data are heavily skewed by a few outliers receiving huge numbers of comments. And yet, more than half of the posts receive 0 comments.
Any estimator of comment number follows a Poisson distribution. Given a time period of observation, we could model the comment number as a series of “post was commented on” events. Already, this should seem problematic: posts possessing identical keywords may follow different distributions and have different comment rates. Somebody with many followers receives more comments than a poster who just opened their account.
Without determining a specific prior condition for the post that we’d like to model, these factors not only are indistinguishable from the supplied data, but we lack enough examples of highly-commented posts.
Instead, while we can keep in mind the original goal of modeling comment numbers, we’ll revise the original objective (feature engineering):
*Determine the time of day that a post containing “X” in its title is most likely to receive 1 or more comments.
Response variable
Given that most posts in the dataset belong to users receiving no comments, we’ll keep our analysis in scope of the average user. Each post can be considered as either (A) receiving a comment or (B) no comments, and we label the entries accordingly.
Should we filter the posts, target specific users, or transform our response variable? Only if there is unruly bias by not doing so.
Luckily, this is a consistent estimator of an average user’s comment rate. Remember, most posts received 0 comments, and if there exists a measurable bias imparted by the top-commented posters, it would act identically on each time bin. So besides removing entries that are NaN and making sure all strings are lowercase, we’ll proceed without further modification.
Transforming the data
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
We’ve encoded the Time field ordinally, dropping everything but HH:MM to tokenize by an integer minute + 60 * hour.
It seems that there are more posts between t=800, 1200 than at any other time of the day, despite the dataset being time zone agnostic.
Modeling comment likelihood
Now having acquired a distinct set of post statistics for each of ten time bins (0=12am, 5=12pm) we’ll look at how the comments are distributed:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
There is no apparent connection between comments and time for the dataset at-large. Observing the marginal distribution for a niche topic like “programming” doesn’t give any obvious clues either:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
We haven’t looked at much data yet, but what we would like to see is whether commentors are time-sensitive. And for this we’ll need a statistical test.
Sampling
We can model the commenting behavior very simply with a Bernoulli Comment likelihood. Draw samples from each time slot and count the number of posts that have received at least one comment in which case X=1. From a sample of n postings,
count those that have received at least one comment, and estimate the probability p. Better yet, we’ll consider the log-odds ratio
which generally provides, large n, an unbiased estimator of the Bernoulli parameter p
Bootstrap Simulation
Through simulation, we’ll construct a t-distribution for each time slot centered at the observed log-odds-ratio estimators with variance determined through bootstrapping.
Define a bootstrap sampling routine for our DataFrame queries. Extending the multiprocess bootstrapping module,
from df_query_stats import * def do_bootstrap(_X,words): """ construct bootstrap samples of commented likelihoods for each category (hours of day / 10 segments) """ # bootstrapped statistics go in word_stats, baseline_stats word_stats, baseline_stats = [], [] # determine confidence/dof of t test by counting rows in the dataset. # these go into word_sizes, baseline_sizes word_sizes, baseline_sizes = [], [] # size of bootstrap _IN, _OUT = 20,10 for u in words: bootstrap_estimates_per_t, word_sizes_u = [], [] # ten categories, run bootstrap outer loop for each for j in range(10): DF_handler = df_query_stats( _X, "title.str.contains('" + u + "') & timebin == " + str(j)) estimates, sizes = bootstrap_mean_var( _OUT,_IN,"num_comments>0", nthr=6) bootstrap_estimates_per_t.append(estimates) word_sizes_u.append(sizes) del DF_handler word_sizes.append(word_sizes_u) DF_handler = df_query_stats(_X, "title.str.contains('" + u + "')") # the baseline bootstrap simulation (mean across all categories _foo = [] estimates, sizes = bootstrap_mean_var( _OUT,_IN,"num_comments>0", nthr=6) _foo.append(estimates) baseline_sizes.append(sizes) del DF_handler Mean_Var_per_t = np.array( [np.array(list(zip(*bootstrap_estimates_per_t[i]))).mean(axis=1) for i in range(len(bootstrap_estimates_per_t))] ) Mean_Var_per_t_base = np.array( [np.array(list(zip(*_foo[i]))).mean(axis=1) for i in range(len(_foo))] ) word_stats.append(Mean_Var_per_t) baseline_stats.append(Mean_Var_per_t_base) return Mean_Var_per_t, word_stats, baseline_stats # where we've previously defined # _X = pd.read_csv("../archive/HN_posts_year_to_Sep_26_2016.csv") # -> ... -> cleaning :-> _X Mean_Var_per_t, word_stats, baseline_stats = do_bootstrap(_X,words)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
From our earlier dimensionality reduction, let’s pick a few words to look at:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
With these words in mind, we execute the sampling routine. It’ll also help to compute effect sizes and unnormalized displacement-from-mean for each time point if we would like to later train a decision-making algorithm for best time.
After bootstrapping is complete …
DF_handler = df_query_stats(_X, "title.str.contains('" + u + "') & timebin == " + str(j))
estimates, sizes = bootstrap_mean_var(_OUT,_IN,"num_comments>0", nthr=6)
bootstrap_estimates_per_t.append(estimates)
word_sizes_u.append(sizes)
… we can finally compare the words by their t distributions. I picked them somewhat arbitrarily, so surely there are better illustrations of time preference that can be found in the dataset. Nonetheless, the t curves give an inviting cursory look at how posting time could be chosen with a keyword in mind:
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Deciding what time to post
We truly don’t have much information yet, but we’ll attempt nonetheless to create a simple decision algorithm: Given my post with a keyword “X” in the title, when do I post it?
So far we’ve computed
- t-distribution for commentability in each time slot for each keyword
- effect size and difference in likelihood ratios
Any machine algorithm using this information alone would be unwieldy at best, but we’ll try it, and at the least learn about how our classification error would respond to inaccuracies in our estimators of effect size and likelihood ratios.
Consider a simple filter. With the estimated effect and for a keyword at each time point , let
where provides the index of the time point corresponding to the greatest value of all coordinates in ,
we compute a weighted average of effect and . Since these are estimators of likelihood, the decision of most likely is obtained by taking the greatest values in
By assigning more weight to effect, we’re choosing to favor the time points that we’ve simply observed more frequently. Likewise assigning more weight to
This can be thought of as updating a Bayesian prior: if there is some unknown distribution of best posting/commenting times then our most likely time choice would be the one predicted by our confidence intervals weighted according to how often we’d observed them. And if the unknown distribution deviates from what we expect it to be then it’s expected deviation is, after all, expected to deviate from the center/mean we’ve estimated.
Simulation
Taking a mix between our observed best-comment-time distributions for the four words “programming, iphone, lessons, law,” we find, as expected, that the error for a simulated “unknown distribution” fares better for higher weights given to effect size. This simulates the bias-variance tradeoff for a single observation set and a single unknown probability distribution.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Identifying the mode of each word’s time distribution is equivalent to maximizing the likelihood of obtaining a comment.
x-axis = weight given to
y-axis = weight given to
x-axis = weight given to
fixed
y-axis = error