The Hacker News Posts dataset from Kaggle contains an entry for each post made on Hacker News around the year 2016. Hacker News is a social media site where, like on Reddit, users share URLs, write posts, give likes, and leave comments.
In this EDA we examine the words used in post titles, identifying embeddings of similar words by training a neural network with a special loss function.
The words “sunny” and “weather” seem more likely to cooccur in a post title than “sunny” and “England,” and we will capture this statistical behavior in our word embedding model. To do this, we build a co-occurrence matrix of the 500 most frequently used title words. This is equivalent to constructing their 500×500 correlation matrix.
is a symmetric matrix, and we compute it after compiling the matrix of all post titles:
Each row and column of is identified with a word
Each entry of a row in contains the number of post titles that contain both the words identified by the column and row.
So if there are N post titles and 500 words, we produce the matrix
where the resulting matrix is proportional to
with the number of times the j’th word appears in the i’th post, and the number of times the i’th word appears in the same post as the j’th word.
We do this with Sklearn’s CountVectorizer, scipy’s sparse matrix library, and tensorflow,
from scipy.sparse import coo_matrix
from sklearn.feature_extraction.text import CountVectorizer
#... #Imap = word identity vectors
countme = CountVectorizer(vocabulary=Imap)
TransformedWords = countme.fit_transform(_X.title).astype(np.int32)
TransformedWords = TransformedWords.tocoo()
TransformedWords_0 = tf.sparse.SparseTensor(indices=np.vstack([TransformedWords.row,TransformedWords.col]).T.reshape(-1,2), values=\
tf.constant(TransformedWords.data), dense_shape=[279256,500])
TransformedWords_0T = tf.sparse.transpose(TransformedWords_0)
ATA = tf.sparse.sparse_dense_matmul(TransformedWords_0T, tf.sparse.to_dense(TransformedWords_0))
ATA = np.log1p(ATA); ATA = ATA.astype(np.float32)
coo = coo_matrix(ATA)
Neural network model
Train data
With 500 words and a 500×500 cooccurrence matrix, there are at most 500×500 cooccurrences that we would like to model. We’ll construct a neural network model that takes each cooccurrence as a data point and outputs a 2d vector representing each word. We feed the network word vectors represented as 500-dimensional unit vectors, with a one in the i’th position for the i’th word.
Given the cooccurrence information between any two words we’ll train the network on the loss function
obtaining word vector pairs <train_x> by concatenating the word vectors from the cooccurrence sparse-matrix in the same order as they appear,
Z = list(zip(self.coo.row, self.coo.col, self.coo.data))
train_x = np.array(
[np.hstack([self.getk(z[0]), self.getk(z[1]), z[2]]) for z in Z]
)
2-dimensional word embedding
As the model is being trained, it produces a 2-d embedding for each word based on cooccurrences. By our choice of loss function :
words that occur frequently in the same post title are placed closeby, and those that don’t occur together are pushed to opposite ends of the 2d space that they occupy.
After training, the neural network gives embeddings like the following:
If an embedding is of any value, it should be reproducible. This means that other embeddings we produce should be fairly similar. When we look at the groups of words that occur, similar words should fall into clusters that contain similar words, and this should be observed in each of the embeddings we produce.
To verify this, we can try the following method:
Obtain an embedding and clustering. This will be the baseline.
Fix one word {W} per cluster, assuming that each describes the others in its group to some extent.
Randomly generate new embeddings and clusters in the same fashion as the baseline. For each new embedding:
Locate the words {W} in the new embedding. Count the number of words belonging to the new cluster.
Now take the fraction F of these words that also belonged to the baseline cluster. If this fraction is high, then we’ve managed to consistently put words into similar categories.
For three clusters of equal size, we should expect an arbitrary new set of clusters to have fraction F=1/3 of words that fall into the same clusters. If we were to assign greater weight to meaningful words (business, economy as opposed to noise like they, or), we should observe a greater fraction.
In the clustering that we found, let’s fix the words programming, business, and government as baseline centers of their respective clusters.
from association_model import *
models = []
for i in range(200):
AM = association_model()
AM.build_run()
AM.predict()
B1 = AM.cluster()
models.append(AM);
In the above code, we produce many more neural network models. Afterwards we select the first 30 for which KMeans returns groups that are equal enough in size.
Implementing the outlined scoring routine,
def score_cluster(base, target):
""" programming, business, government """
words = ['programming', 'business', 'government']
ONE = base; TWO = target;
# ID of cluster in COMPARISON containing word 'programming'
A0 = np.arange(3)[([words[0] in k for k in ONE])]
A0_programming = int(A0)
A1 = np.arange(3)[([words[0] in k for k in TWO])]
programming_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])
# word 'business'
A0 = np.arange(3)[([words[1] in k for k in ONE])]
A0_business = int(A0)
A1 = np.arange(3)[([words[1] in k for k in TWO])]
business_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])
# word 'technology'
A0 = np.arange(3)[([words[2] in k for k in ONE])]
A0_tech = int(A0)
A1 = np.arange(3)[([words[2] in k for k in TWO])]
tech_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])
return np.array([programming_score, business_score, tech_score])
# baseline
ONE = pk.load(open('appendix/appendix.1661655443.8803668.pk', 'rb'))
# comparison
TWOs = [pk.load(open('appendix/appendix.' + f + '.pk', 'rb')) \
for f in ['1661647722.3773346', '1661647632.295023', \
'1661648112.5537024', '1661648233.6341639', \
'1661648112.5537024', '1661648233.6341639', \
'1661648294.2745764', '1661648354.2067013', \
'1661648414.5664623', '1661648474.9807565', \
'1661648505.3888621', '1661647209.181138', \
'1661648597.4234366', '1661648657.3324254', \
'1661648716.7604432' , '1661648687.7048774',
'1661648746.0174868', '1661648566.415309', \
'1661652967.4215994', '1661653870.3439283', \
'1661654950.4271033', '1661655066.0586948', \
'1661655095.67823', '1661655124.6171143', \
'1661655474.1003916', '1661655621.7354438', \
'1661655740.991688', '1661655800.070665', \
'1661657652.6165714', '1661657711.1495018']]
we find that the new random clusters are on average 33.4% similar to baseline. By direct comparison, this is no better than raw luck.
With further modeling of word importance and/or eliminating noise words like and or the, however, we may be able to assign more meaningful weights to keywords, and our clustering may after all be judgeably impactful. While I’ll leave further cleaning and weighting of keywords for another day, you can check out the new word groupings that were produced. If some patterns seem to make sense to the human eye here, our next task would be teach these to a machine:
This concludes our exploration of words appearing in the Hacker News dataset, EDA I. In Hacker News EDA II, we implement a bootstrapping simulator for seeing the impact of posting time on number of comments received.