Training a Neural Network for Word Separation

The Hacker News Posts dataset from Kaggle contains an entry for each post made on Hacker News around the year 2016. Hacker News is a social media site where, like on Reddit, users share URLs, write posts, give likes, and leave comments.

In this EDA we examine the words used in post titles, identifying embeddings of similar words by training a neural network with a special loss function.

Here’s what the dataset looks like:

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw head.ipynb hosted with ❤ by GitHub

Hacker News: EDA I

Extracting word features

The words “sunny” and “weather” seem more likely to cooccur in a post title than “sunny” and “England,” and we will capture this statistical behavior in our word embedding model. To do this, we build a co-occurrence matrix K of the 500 most frequently used title words. This is equivalent to constructing their 500×500 correlation matrix.

K = A^TA is a symmetric matrix, and we compute it after compiling the matrix A of all post titles:

  • Each row and column of A is identified with a word
    • Each entry of a row in A contains the number of post titles that contain both the words identified by the column and row.

So if there are N post titles and 500 words, we produce the matrix

    \[A^TA = K\]

    \[= \begin{bmatrix} n_{1,1}  & n_{1,2} & \cdots & n_{1,500} \\  n_{2,1}  & n_{2,2} & \cdots & n_{2,500} \\ & & \vdots & \\ n_{N,1}  & n_{N,2} & \cdots & n_{N,500}\end{bmatrix} ^T \begin{bmatrix} n_{1,1}  & n_{1,2} & \cdots & n_{1,500} \\  n_{2,1}  & n_{2,2} & \cdots & n_{2,500} \\ & & \vdots & \\ n_{N,1}  & n_{N,2} & \cdots & n_{N,500}\end{bmatrix}\]

where the resulting matrix K^{500\times 500} is proportional to

    \[\begin{bmatrix}n'_{1,1} & \cdots & n'_{1,500}\\ & \vdots & \\n'_{500,1} & \cdots & n'_{500,500}\\\end{bmatrix}\]

with n_{i,j}\,\; 1\leq i\leq N\,,\; 1\leq j \leq 500 the number of times the j’th word appears in the i’th post, and n'_{i,j}\,,\; 1\leq i,j\leq 500 the number of times the i’th word appears in the same post as the j’th word.

This is one way of obtaining an inter-dependence embedding of natural language, and is the technique used by Stanford’s GloVe Word Representation learning algorithm.

cooccurence matrix for training a neural network in word separation
Word cooccurrence matrix A^TA

We do this with Sklearn’s CountVectorizer, scipy’s sparse matrix library, and tensorflow,

from scipy.sparse import coo_matrix
from sklearn.feature_extraction.text import CountVectorizer

#... #Imap = word identity vectors
countme = CountVectorizer(vocabulary=Imap)
TransformedWords = countme.fit_transform(_X.title).astype(np.int32)
TransformedWords = TransformedWords.tocoo()
TransformedWords_0 = tf.sparse.SparseTensor(indices=np.vstack([TransformedWords.row,TransformedWords.col]).T.reshape(-1,2), values=\
                      tf.constant(TransformedWords.data), dense_shape=[279256,500])
TransformedWords_0T = tf.sparse.transpose(TransformedWords_0)
ATA = tf.sparse.sparse_dense_matmul(TransformedWords_0T, tf.sparse.to_dense(TransformedWords_0))
ATA = np.log1p(ATA); ATA = ATA.astype(np.float32)
coo = coo_matrix(ATA)

Neural network model

Train data

With 500 words and a 500×500 cooccurrence matrix, there are at most 500×500 cooccurrences that we would like to model. We’ll construct a neural network model that takes each cooccurrence as a data point and outputs a 2d vector representing each word. We feed the network word vectors represented as 500-dimensional unit vectors, (0,0,\ldots, 1, 0) with a one in the i’th position for the i’th word.

Given the cooccurrence information between any two words w_x, w_y we’ll train the network on the loss function \phi,

def embed_loss_plain_association(x,rho):
   return tf.keras.backend.sum(
      tf.keras.backend.pow((x[0]-x[1]) * rho,2)
   )

    \[\phi_{\rho}(\bold x, \bold y) = \left\| \left(\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} - \begin{bmatrix} y_1  \\ y_2 \end{bmatrix}\right)  \rho \right\|_{fro}\]

    \[= \sum_{i=1,2} \left(\ x_i - y_i \right)^2  \rho^2\]

where \rho is the cooccurrence between w_x, w_y, and \bold x, \bold y are the transformed word encodings produced by the neural network, R^{500}\rightarrow R^2.

Define a Keras functional model,

D = 10**-10
R = 10**-5
inputs = tf.keras.layers.Input(shape=(1000,)); x = inputs[:,:-1]
x1 = tf.keras.layers.Reshape((500,))(x[:,:500])
x2 = tf.keras.layers.Reshape((500,))(x[:,500:])

D1 = tf.keras.layers.Dense(250, kernel_regularizer=tf.keras.regularizers.l2(R), activity_regularizer=tf.keras.regularizers.l2(R))
x1 = D1(x1); x2 = D1(x2);

D1d = tf.keras.layers.Dropout(D)
x1 = D1d(x1); x2 = D1d(x2)

D2 = tf.keras.layers.Dense(120, kernel_regularizer=tf.keras.regularizers.l2(R), activity_regularizer=tf.keras.regularizers.l2(R))
D2d = tf.keras.layers.Dropout(D)
x1 = D2(x1); x1 = D2d(x1);
x2 = D2(x2); x2 = D2d(x2)

D3 =  tf.keras.layers.Dense(60, kernel_regularizer=tf.keras.regularizers.l2(R), activity_regularizer=tf.keras.regularizers.l2(R))
D3d = tf.keras.layers.Dropout(D)
x1 = D3(x1); x1 = D3d(x1);
x2 = D3(x2); x2 = D3d(x2);

D4 = tf.keras.layers.Dense(2, kernel_regularizer=tf.keras.regularizers.l2(R), activity_regularizer=tf.keras.regularizers.l2(R))
x1 = D4(x1); x2 = D4(x2);

R1 = tf.keras.layers.Reshape((2,1))
x1 = R1(x1); x2 = R1(x2);
y = tf.keras.layers.Concatenate(axis=-1)([x1, x2])

Model = tf.keras.models.Model(inputs = inputs, outputs = y)
Model.compile(loss=embed_loss_plain_association, optimizer=tf.keras.optimizers.Adam(\
   tf.keras.optimizers.schedules.ExponentialDecay(1.,50,.5,staircase=False)))
Model.fit(train_x.reshape(-1,1000,1), self.coo.data.reshape(-1,1), epochs=30, batch_size=4096,verbose=verbose)

obtaining word vector pairs <train_x> by concatenating the word vectors from the cooccurrence sparse-matrix in the same order as they appear,

Z = list(zip(self.coo.row, self.coo.col, self.coo.data))
train_x = np.array(
   [np.hstack([self.getk(z[0]), self.getk(z[1]), z[2]]) for z in Z]
)

2-dimensional word embedding

As the model is being trained, it produces a 2-d embedding for each word based on cooccurrences. By our choice of loss function \phi:

words that occur frequently in the same post title are placed closeby, and those that don’t occur together are pushed to opposite ends of the 2d space that they occupy.

After training, the neural network gives embeddings like the following:

static image

Here are what some others look like:

example word separations in training a neural network
Example word separations in training a neural network

With 2d word representations, it is much easier to try identifying clusters of words. Applying KMeans,

static image

It may be helpful to shade by number of occurrences as well:

You can try looking for these words in the above charts:

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

How likely are these embeddings?

If an embedding is of any value, it should be reproducible. This means that other embeddings we produce should be fairly similar. When we look at the groups of words that occur, similar words should fall into clusters that contain similar words, and this should be observed in each of the embeddings we produce.

To verify this, we can try the following method:

  • Obtain an embedding and clustering. This will be the baseline.
  • Fix one word {W} per cluster, assuming that each describes the others in its group to some extent.
  • Randomly generate new embeddings and clusters in the same fashion as the baseline. For each new embedding:
    • Locate the words {W} in the new embedding. Count the number of words belonging to the new cluster.
    • Now take the fraction F of these words that also belonged to the baseline cluster. If this fraction is high, then we’ve managed to consistently put words into similar categories.

For three clusters of equal size, we should expect an arbitrary new set of clusters to have fraction F=1/3 of words that fall into the same clusters. If we were to assign greater weight to meaningful words (business, economy as opposed to noise like they, or), we should observe a greater fraction.


In the clustering that we found, let’s fix the words programming, business, and government as baseline centers of their respective clusters.

from association_model import *
models = []
for i in range(200):
    AM = association_model()
    AM.build_run()
    AM.predict()
    B1 = AM.cluster()
    models.append(AM);

In the above code, we produce many more neural network models. Afterwards we select the first 30 for which KMeans returns groups that are equal enough in size.

Implementing the outlined scoring routine,

def score_cluster(base, target):
    """ programming, business, government """
    words = ['programming', 'business', 'government']
    ONE = base; TWO = target;
    
    # ID of cluster in COMPARISON containing word 'programming'
    A0 = np.arange(3)[([words[0] in k for k in ONE])]
    A0_programming = int(A0)
    A1 = np.arange(3)[([words[0] in k for k in TWO])]
    programming_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])
    
    # word 'business'
    A0 = np.arange(3)[([words[1] in k for k in ONE])]
    A0_business = int(A0)
    A1 = np.arange(3)[([words[1] in k for k in TWO])]
    business_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])

    # word 'technology'
    A0 = np.arange(3)[([words[2] in k for k in ONE])]
    A0_tech = int(A0)
    A1 = np.arange(3)[([words[2] in k for k in TWO])]

    tech_score = sum([g in ONE[int(A0)] for g in TWO[int(A1)]])/ len(TWO[int(A1)])
    
    return np.array([programming_score, business_score, tech_score]) 

# baseline
ONE = pk.load(open('appendix/appendix.1661655443.8803668.pk', 'rb'))
# comparison
TWOs = [pk.load(open('appendix/appendix.' + f + '.pk', 'rb')) \
        for f in ['1661647722.3773346', '1661647632.295023', \
                  '1661648112.5537024', '1661648233.6341639', \
                 '1661648112.5537024', '1661648233.6341639', \
                 '1661648294.2745764', '1661648354.2067013', \
                 '1661648414.5664623', '1661648474.9807565', \
                  '1661648505.3888621', '1661647209.181138', \
                 '1661648597.4234366', '1661648657.3324254', \
                  '1661648716.7604432' , '1661648687.7048774',
                  '1661648746.0174868', '1661648566.415309', \
                 '1661652967.4215994', '1661653870.3439283', \
                  '1661654950.4271033', '1661655066.0586948', \
                  '1661655095.67823', '1661655124.6171143', \
                  '1661655474.1003916', '1661655621.7354438', \
                 '1661655740.991688', '1661655800.070665', \
                  '1661657652.6165714', '1661657711.1495018']]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

we find that the new random clusters are on average 33.4% similar to baseline. By direct comparison, this is no better than raw luck.

With further modeling of word importance and/or eliminating noise words like and or the, however, we may be able to assign more meaningful weights to keywords, and our clustering may after all be judgeably impactful. While I’ll leave further cleaning and weighting of keywords for another day, you can check out the new word groupings that were produced. If some patterns seem to make sense to the human eye here, our next task would be teach these to a machine:

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This concludes our exploration of words appearing in the Hacker News dataset, EDA I. In Hacker News EDA II, we implement a bootstrapping simulator for seeing the impact of posting time on number of comments received.

Full code on github:

"""
association_model() class-> module for word clusterings
"""

import re
import pickle as pk
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from scipy.sparse import coo_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.cluster import KMeans
from time import time

# association_model class
class association_model():
    """
    association_model()
    construct and train a word association model
    usage:
    AM = association_model()
    AM.build_run()
    AM.predict()
    AM.cluster()
    """
    def __init__(self):
        self.TransformedWords = None

    # getk, helper fct
    def getk(self,k):
        """ getk(k), helper function to produce a k-index unit vector in R1 """
        return (np.arange(500) == k).astype(np.float64)

    # getk, generate statistical correlations
    def create_ATA(self, from_file=True):
        """ create_ATA, generate the cooccurrence matrix """
        # do some minor cleaning, ie. nan's
        _X = pd.read_csv("./workspace/archive/HN_posts_year_to_Sep_26_2016.csv")
        _X.title = _X.title.apply(lambda s: s.lower())
        _X = _X.dropna()
        all_times = list(_X['created_at'])
        # **Sample and sort the categories**

        # label by times
        Times = \
            [sum(np.array([60,1]) * np.array(
                [_.split('/') for _ in x.split(' ')][1][0].split(':'))\
                 .astype(np.float64))
             for x in all_times]
        Times=np.array(Times)
        binme = KBinsDiscretizer(10,encode="ordinal",strategy="uniform")
        binme.fit(Times.reshape(-1,1))
        Times_ = binme.transform(Times.reshape(-1,1))
        _X['timebin'] = Times_
        _X['log_comments'] = np.log1p(_X['num_comments'])

        # search by 1000 entries at a time
        t,s = 0,1000; i=0
        bigun = {''}
        listem = []
        returns = False
        while not returns:
            t+= 1000; s+=1000
            if s > _X.shape[0]:
                s = _X.shape[0]
                returns = True

            Q = _X.title.loc[_X.index[t:s]].str.lower().str.split()
            V = {''}
            l = []
            # filter alphanumeric
            for q in Q:
                for b in q:
                    x = re.sub("[^a-zA-Z]",' ',b)
                    l.extend(x.split())
            # extend a list, then concat into a set to get only unique entries
            listem.extend(l)
            for q in l:
                V.update({q})
            bigun.update(V)
        bigun.remove('')

        # get the counts
        wordcounts = {}.fromkeys(bigun,0)
        for l in listem:
            wordcounts[l]+=1
        wordcounts = \
            {k: v for k, v in sorted(wordcounts.items(),
                                     key=lambda item: item[1],reverse=True)}
        # top 500 words
        I100 = list(wordcounts.keys())[:500]
        self.Itup = [(i,wordcounts[i]) for i in I100]

        Imap = {''}
        i = 0
        for v in I100:
            Imap.update({v: i})
            i+=1
        Imap.remove('')
        # sklearn ---> counts per word
        self.countme = CountVectorizer(vocabulary=Imap)
        self.word_legend = self.countme.get_feature_names_out()
        self.ATA = 0

        global TransformedWords
        TransformedWords = self.countme.fit_transform(_X.title)
        self.TransformedWords=TransformedWords.toarray().copy()
        print (1232)
        if not from_file:
            """ enable this segment for generation of ATA matrix """
            # TransformedWords = TransformedWords.tocoo()
            # TransformedWords = self.countme.fit_transform(_X.title)
            # TransformedWords_0 = \
            #     tf.sparse.SparseTensor(indices=np.vstack(
            #         [TransformedWords.row,TransformedWords.col]).T\
            #             .reshape(-1,2),
            #             values=tf.constant(TransformedWords.data),
            #             dense_shape=[279256,500])
            # TransformedWords_0T = tf.sparse.transpose(TransformedWords_0)
            # ATA = tf.sparse.sparse_dense_matmul(
            #     TransformedWords_0T, tf.sparse.to_dense(TransformedWords_0))
            # ATA = np.log1p(ATA)
            # ATA = ATA.astype(np.float32)
            """ pk.dump(ATA, open('500x500Association.pk', 'wb'))"""
            assert True
            with open("./workspace/clustering/500x500Association.pk", 'rb') as file:
                self.ATA = pk.load(file)
        else:
            with open("./workspace/clustering/500x500Association.pk", 'rb') as file:
                self.ATA = pk.load(file)
        # generate sparse representation
        self.coo = coo_matrix(self.ATA).astype(np.float32)

    def define_model(self):
        """
        define_model()
        define the neural network
        """
        # Dense layer Neural Network encoder
        D = 10**-10
        R = 10**-5
        inputs = tf.keras.layers.Input(shape=(1001,))
        x = inputs[:,:-1]
        x1 = tf.keras.layers.Reshape((500,))(x[:,:500])
        x2 = tf.keras.layers.Reshape((500,))(x[:,500:])

        D1 = tf.keras.layers.Dense(
            250, kernel_regularizer=tf.keras.regularizers.l2(R),
            activity_regularizer=tf.keras.regularizers.l2(R))
        x1 = D1(x1)
        x2 = D1(x2)

        D1d = tf.keras.layers.Dropout(D)
        x1 = D1d(x1)
        x2 = D1d(x2)

        D2 = tf.keras.layers.Dense(
            120, kernel_regularizer=tf.keras.regularizers.l2(R),
            activity_regularizer=tf.keras.regularizers.l2(R))
        D2d = tf.keras.layers.Dropout(D)
        x1 = D2(x1); x1 = D2d(x1)
        x2 = D2(x2); x2 = D2d(x2)

        D3 =  tf.keras.layers.Dense(
            60, kernel_regularizer=tf.keras.regularizers.l2(R),
            activity_regularizer=tf.keras.regularizers.l2(R))
        D3d = tf.keras.layers.Dropout(D)
        x1 = D3(x1); x1 = D3d(x1)
        x2 = D3(x2); x2 = D3d(x2)

        D4 = tf.keras.layers.Dense(
            2, kernel_regularizer=tf.keras.regularizers.l2(R),
            activity_regularizer=tf.keras.regularizers.l2(R))
        x1 = D4(x1)
        x2 = D4(x2)

        R1 = tf.keras.layers.Reshape((2,1))
        x1 = R1(x1); x2 = R1(x2)
        y = tf.keras.layers.Concatenate(axis=-1)([x1, x2])

        Model = tf.keras.models.Model(inputs = inputs, outputs= y)
        return Model

    def build_run(self, verbose=False):
        """
        build_run( verbose bool)
        build and run the neural network to produce embeddings (and graphics)
        """
        def embed_loss_plain_association(x,y):
            return tf.keras.backend.sum(
                tf.keras.backend.pow((x[0]-x[1]) * y,2))

        self.create_ATA()
        self.Model = self.define_model()
        self.Model.build(input_shape=(1001,))
        if verbose:
            self.Model.summary()

        self.Model.compile(loss=embed_loss_plain_association,
                           optimizer=tf.keras.optimizers.Adam(
                            tf.keras.optimizers.schedules.ExponentialDecay(
                                1.,50,.5,staircase=False)))

        # generate training data
        Z = list(zip(self.coo.row, self.coo.col, self.coo.data))
        train_x = np.array(
            [np.hstack([self.getk(z[0]), self.getk(z[1]), z[2]])\
            for z in Z])

        self.Model.fit(train_x.reshape(-1,1001,1), self.coo.data.reshape(-1,1),
                       epochs=30, batch_size=4096, verbose=verbose)

    def predict(self):
        """
        predict()
        generate predictions
        """
        retrieve_x = np.array(
            [ np.hstack([self.getk(z), np.zeros(501)])    for z in range(500)]
        )
        self.YY = self.Model.predict(retrieve_x.reshape(-1,1001,1))
        return self.YY

    def cluster(self,show=True):
        """
        cluster( bool show)
        produce clusterings
        """
        # visualize_test predictions
        KM = KMeans(n_clusters=3)
        YY = self.YY
        Itup = self.Itup
        labels = KM.fit_predict(YY[:,:,0])

        score = KM.score(YY[:,:,0])
        STAMP  = str(time())
        print(STAMP)
        if show:
            _ = np.log(np.array(Itup[:])[:,1].astype(np.float32))
            _ = _/np.max(_); _ = _**1.5
            plt.figure(figsize=(16,10))
            plt.scatter(YY[:,0,0], YY[:,1,0],c=labels,alpha=_**1.3)
            plt.colorbar(ticks=np.arange(4))
            plt.show()
        labeled500Words = pd.DataFrame(np.squeeze([self.word_legend,labels]).T,
                                       columns=['word', 'label'])

        labeled500Words = labeled500Words.groupby('label').apply(np.array)

        appendix_of_words =\
            [labeled500Words[i][:,0] for i in range(len(labeled500Words))]

        with open("appendix/appendix." + STAMP + ".pk",'wb') as file:
            pk.dump(appendix_of_words,file)

        log_word_count_in_corpus = \
            np.log(np.array(Itup[:])[:,1].astype(np.float32))

        DF = pd.DataFrame(np.vstack([YY[:,0,0],YY[:,1,0], self.word_legend,
            labels, log_word_count_in_corpus]).T,
            columns=['x','y','word', 'cluster', 'log_count'])

        with open('appendix/predicted_DF.' + STAMP + '.pk', 'wb') as file:
            pk.dump(DF, file)

        if show:
            print(appendix_of_words)

        counts = [len(k) for k in appendix_of_words]
        print(counts)

        return counts, score, appendix_of_words

A correspondance matrix can be generated from large datasets using structured calls to the matrix operation library. Check out my simple implementation using a Tensorflow matmult call.

Avatar

By Alexander Wei

BA, MS Mathematics, Tufts University

Leave a comment

Your email address will not be published. Required fields are marked *