Saturday, November 5, 2022

KeplerMapper & NLP examples

 (.env) boris@UbuntuLTS:~/TDANALYSIS$ cat Newsgroups20.py

import kmapper as km

from kmapper import Cover, jupyter

import numpy as np


from sklearn.datasets import fetch_20newsgroups

from sklearn.cluster import AgglomerativeClustering

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD

from sklearn.manifold import Isomap

from sklearn.preprocessing import MinMaxScaler

newsgroups = fetch_20newsgroups(subset='train')

X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)

print("SAMPLE",X[0])

print("SHAPE",X.shape)

print("TARGET",target_names[y[0]])


mapper = km.KeplerMapper(verbose=2)

projected_X = mapper.fit_transform(X,

    projection=[TfidfVectorizer(analyzer="char",

                                ngram_range=(1,6),

                                max_df=0.83,

                                min_df=0.05),

                TruncatedSVD(n_components=100,

                             random_state=1729),

                Isomap(n_components=2,

                       n_jobs=-1)],

    scaler=[None, None, MinMaxScaler()])


print("SHAPE",projected_X.shape)


from sklearn import cluster

graph = mapper.map(projected_X,

                   X=None,

                   clusterer=cluster.AgglomerativeClustering(n_clusters=3,

                                                             linkage="complete",

                                                             affinity="cosine"),

                   cover=Cover(perc_overlap=0.33))


vec = TfidfVectorizer(analyzer="word",

                      strip_accents="unicode",

                      stop_words="english",

                      ngram_range=(1,3),

                      max_df=0.97,

                      min_df=0.02)


interpretable_inverse_X = vec.fit_transform(X).toarray()

interpretable_inverse_X_names = vec.get_feature_names()


print("SHAPE", interpretable_inverse_X.shape)

print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])


_ = mapper.visualize(graph,

                        X=interpretable_inverse_X,

                        X_names=interpretable_inverse_X_names,

                        path_html="output/newsgroups20.html",

                        lens=projected_X,

                        lens_names=["ISOMAP1", "ISOMAP2"],

                        title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",

                        custom_tooltips=np.array([target_names[ys] for ys in y]),

                        color_values=y,

                        color_function_name='target')

(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 Newsgroups20.py

SAMPLE From: lerxst@wam.umd.edu (where's my thing)

Subject: WHAT car is this!?

Nntp-Posting-Host: rac3.wam.umd.edu

Organization: University of Maryland, College Park

Lines: 15


 I was wondering if anyone out there could enlighten me on this car I saw

the other day. It was a 2-door sports car, looked to be from the late 60s/

early 70s. It was called a Bricklin. The doors were really small. In addition,

the front bumper was separate from the rest of the body. This is 

all I know. If anyone can tellme a model name, engine specs, years

of production, where this car is made, history, or whatever info you

have on this funky looking car, please e-mail.


Thanks,

- IL

   ---- brought to you by your neighborhood Lerxst ----

SHAPE (11314,)

TARGET rec.autos

KeplerMapper(verbose=2)

..Composing projection pipeline of length 3:

Projections: TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))

TruncatedSVD(n_components=100, random_state=1729)

Isomap(n_jobs=-1)

Distance matrices: False

False

False

Scalers: None

None

MinMaxScaler()

..Projecting on data shaped (11314,)


..Projecting data using: 

TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))



..Created projection shaped (11314, 13967)

..Projecting on data shaped (11314, 13967)


..Projecting data using: 

TruncatedSVD(n_components=100, random_state=1729)


..Projecting on data shaped (11314, 100)


..Projecting data using: 

Isomap(n_jobs=-1)


/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/manifold/_isomap.py:348: UserWarning: The number of connected components of the neighbors graph is 2 > 1. Completing the graph to fit Isomap might be slow. Increase the number of neighbors to avoid this issue.

  self._fit_transform(X)

/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/scipy/sparse/_index.py:103: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

  self._set_intXint(row, col, x.flat[0])


..Scaling with: MinMaxScaler()


SHAPE (11314, 2)

Mapping on data shaped (11314, 2) using lens shaped (11314, 2)


Minimal points in hypercube before clustering: 3

Creating 100 hypercubes.

   > Found 3 clusters in hypercube 0.

   > Found 3 clusters in hypercube 1.

   > Found 3 clusters in hypercube 2.

   > Found 3 clusters in hypercube 3.

Cube_4 is empty.


   > Found 3 clusters in hypercube 5.

   > Found 3 clusters in hypercube 6.

   > Found 3 clusters in hypercube 7.

   > Found 3 clusters in hypercube 8.

   > Found 3 clusters in hypercube 9.

   > Found 3 clusters in hypercube 10.

   > Found 3 clusters in hypercube 11.

   > Found 3 clusters in hypercube 12.

   > Found 3 clusters in hypercube 13.

   > Found 3 clusters in hypercube 14.

   > Found 3 clusters in hypercube 15.

   > Found 3 clusters in hypercube 16.

   > Found 3 clusters in hypercube 17.

   > Found 3 clusters in hypercube 18.

   > Found 3 clusters in hypercube 19.

   > Found 3 clusters in hypercube 20.

   > Found 3 clusters in hypercube 21.

   > Found 3 clusters in hypercube 22.

   > Found 3 clusters in hypercube 23.

   > Found 3 clusters in hypercube 24.

   > Found 3 clusters in hypercube 25.

   > Found 3 clusters in hypercube 26.

   > Found 3 clusters in hypercube 27.

   > Found 3 clusters in hypercube 28.

   > Found 3 clusters in hypercube 29.

   > Found 3 clusters in hypercube 30.

   > Found 3 clusters in hypercube 31.

   > Found 3 clusters in hypercube 32.

   > Found 3 clusters in hypercube 33.

   > Found 3 clusters in hypercube 34.

   > Found 3 clusters in hypercube 35.

   > Found 3 clusters in hypercube 36.

   > Found 3 clusters in hypercube 37.

   > Found 3 clusters in hypercube 38.

   > Found 3 clusters in hypercube 39.

   > Found 3 clusters in hypercube 40.

   > Found 3 clusters in hypercube 41.

   > Found 3 clusters in hypercube 42.

   > Found 3 clusters in hypercube 43.

   > Found 3 clusters in hypercube 44.

   > Found 3 clusters in hypercube 45.

   > Found 3 clusters in hypercube 46.

   > Found 3 clusters in hypercube 47.

   > Found 3 clusters in hypercube 48.

   > Found 3 clusters in hypercube 49.

   > Found 3 clusters in hypercube 50.

   > Found 3 clusters in hypercube 51.

   > Found 3 clusters in hypercube 52.

   > Found 3 clusters in hypercube 53.

   > Found 3 clusters in hypercube 54.

   > Found 3 clusters in hypercube 55.

   > Found 3 clusters in hypercube 56.

   > Found 3 clusters in hypercube 57.

   > Found 3 clusters in hypercube 58.

   > Found 3 clusters in hypercube 59.

   > Found 3 clusters in hypercube 60.

   > Found 3 clusters in hypercube 61.

   > Found 3 clusters in hypercube 62.

   > Found 3 clusters in hypercube 63.

   > Found 3 clusters in hypercube 64.

   > Found 3 clusters in hypercube 65.

   > Found 3 clusters in hypercube 66.

   > Found 3 clusters in hypercube 67.

   > Found 3 clusters in hypercube 68.

   > Found 3 clusters in hypercube 69.

   > Found 3 clusters in hypercube 70.

   > Found 3 clusters in hypercube 71.

   > Found 3 clusters in hypercube 72.

   > Found 3 clusters in hypercube 73.

   > Found 3 clusters in hypercube 74.

   > Found 3 clusters in hypercube 75.

   > Found 3 clusters in hypercube 76.

Cube_77 is empty.


Cube_78 is empty.


   > Found 3 clusters in hypercube 79.

   > Found 3 clusters in hypercube 80.

   > Found 3 clusters in hypercube 81.

   > Found 3 clusters in hypercube 82.

   > Found 3 clusters in hypercube 83.


Created 618 edges and 243 nodes in 0:00:01.169844.

/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

  warnings.warn(msg, category=FutureWarning)

SHAPE (11314, 947)

FEATURE NAMES SAMPLE ['00', '000', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1992', '1993', '1993apr15', '20', '200', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '408', '41', '42', '43', '44', '45', '49', '50', '500', '60', '70', '80', '90', '92', '93', 'able', 'ac', 'ac uk', 'accept', 'access', 'according', 'acs', 'act', 'action', 'actually', 'add', 'address', 'advance', 'advice', 'ago', 'agree', 'air', 'al', 'allow', 'allowed', 'america', 'american', 'andrew', 'answer', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'apply', 'appreciate', 'appreciated', 'apr', 'apr 1993', 'apr 93', 'april', 'area', 'aren', 'argument', 'article', 'article 1993apr15', 'ask', 'asked', 'asking', 'assume', 'att', 'att com', 'au', 'available', 'average', 'avoid', 'away', 'bad', 'base', 'baseball', 'based', 'basic', 'basically', 'basis', 'bbs', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'bitnet', 'black', 'blue', 'board', 'bob', 'body', 'book', 'books', 'bought', 'box', 'break', 'brian', 'bring', 'brought', 'btw', 'build', 'building', 'built', 'bus', 'business', 'buy', 'ca', 'ca lines', 'california', 'called', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cc', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'cheap', 'check', 'chicago', 'children', 'chip', 'choice', 'chris', 'christ', 'christian', 'christians', 'church', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'cleveland', 'clinton', 'clipper', 'close', 'cmu', 'cmu edu', 'code', 'college', 'color', 'colorado', 'com', 'com organization', 'com writes', 'come', 'comes', 'coming', 'comment', 'comments', 'common', 'communications', 'comp', 'company', 'complete', 'completely', 'computer', 'computer science', 'computing', 'condition', 'consider', 'considered', 'contact', 'continue', 'control', 'copy', 'corp', 'corporation', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cs', 'cso', 'cso uiuc', 'cso uiuc edu', 'cup', 'current', 'currently', 'cut', 'cwru', 'cwru edu', 'data', 'date', 'dave', 'david', 'day', 'days', 'dead', 'deal', 'death', 'decided', 'defense', 'deleted', 'department', 'dept', 'design', 'designed', 'details', 'development', 'device', 'did', 'didn', 'die', 'difference', 'different', 'difficult', 'directly', 'disclaimer', 'discussion', 'disk', 'display', 'distribution', 'distribution na', 'distribution na lines', 'distribution usa', 'distribution usa lines', 'distribution world', 'distribution world nntp', 'distribution world organization', 'division', 'dod', 'does', 'does know', 'doesn', 'doing', 'don', 'don know', 'don think', 'don want', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'edu article', 'edu au', 'edu david', 'edu organization', 'edu organization university', 'edu reply', 'edu subject', 'edu writes', 'effect', 'email', 'encryption', 'end', 'engineering', 'entire', 'error', 'especially', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'exists', 'expect', 'experience', 'explain', 'expressed', 'extra', 'face', 'fact', 'faith', 'family', 'fan', 'faq', 'far', 'fast', 'faster', 'fax', 'federal', 'feel', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'folks', 'follow', 'following', 'force', 'forget', 'form', 'frank', 'free', 'friend', 'ftp', 'future', 'game', 'games', 'gave', 'general', 'generally', 'germany', 'gets', 'getting', 'given', 'gives', 'giving', 'gmt', 'god', 'goes', 'going', 'gone', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greatly', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'happy', 'hard', 'hardware', 'haven', 'having', 'head', 'hear', 'heard', 'heart', 'hell']

Wrote visualization to: output/newsgroups20.html

















Another sample

(.env) boris@UbuntuLTS:~/TDANALYSIS$ cat plot_cat.py

"""

3D Cat Data

============

This example generates a Mapper built from a point-cloud sampled from a 3D model of a cat.

`Visualization of the cat mapper <../../_static/cat.html>`_

"""


import numpy as np

import sklearn

import kmapper as km


data = np.genfromtxt("./cat-reference.csv", delimiter=",")

mapper = km.KeplerMapper(verbose=2)

lens = mapper.fit_transform(data)


graph = mapper.map(

    lens,

    data,

    clusterer=sklearn.cluster.DBSCAN(eps=0.1, min_samples=5),

    cover=km.Cover(n_cubes=15, perc_overlap=0.2),

)


mapper.visualize(graph, path_html="output/cat.html")

km.draw_matplotlib(graph)

import matplotlib.pyplot as plt

plt.show()

(.env) boris@UbuntuLTS:~/TDANALYSIS$ python3 plot_cat.py

KeplerMapper(verbose=2)

..Composing projection pipeline of length 1:

Projections: sum

Distance matrices: False

Scalers: MinMaxScaler()

..Projecting on data shaped (7207, 3)


..Projecting data using: sum


..Scaling with: MinMaxScaler()


Mapping on data shaped (7207, 3) using lens shaped (7207, 1)


Minimal points in hypercube before clustering: 5

Creating 15 hypercubes.

   > Found 2 clusters in hypercube 0.

   > Found 2 clusters in hypercube 1.

   > Found 2 clusters in hypercube 2.

   > Found 1 clusters in hypercube 3.

   > Found 2 clusters in hypercube 4.

   > Found 2 clusters in hypercube 5.

   > Found 1 clusters in hypercube 6.

   > Found 1 clusters in hypercube 7.

   > Found 1 clusters in hypercube 8.

   > Found 1 clusters in hypercube 9.

   > Found 1 clusters in hypercube 10.

   > Found 1 clusters in hypercube 11.

   > Found 1 clusters in hypercube 12.

   > Found 1 clusters in hypercube 13.

   > Found 1 clusters in hypercube 14.


Created 19 edges and 20 nodes in 0:00:00.078696.

Wrote visualization to: output/cat.html



To be able reproduce any sample at https://kepler-mapper.scikit-tda.org/en/latest/examples.html

(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ pip install --force-reinstall numpy==1.21.5

(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 plot_horse.py 

KeplerMapper(verbose=2)

..Composing projection pipeline of length 1:

Projections: sum

Distance matrices: False

Scalers: MinMaxScaler()

..Projecting on data shaped (8431, 3)


..Projecting data using: sum


..Scaling with: MinMaxScaler()


Mapping on data shaped (8431, 3) using lens shaped (8431, 1)


Minimal points in hypercube before clustering: 5

Creating 30 hypercubes.

   > Found 1 clusters in hypercube 0.

   > Found 2 clusters in hypercube 1.

   > Found 3 clusters in hypercube 2.

   > Found 3 clusters in hypercube 3.

   > Found 3 clusters in hypercube 4.

   > Found 3 clusters in hypercube 5.

   > Found 3 clusters in hypercube 6.

   > Found 1 clusters in hypercube 7.

   > Found 2 clusters in hypercube 8.

   > Found 2 clusters in hypercube 9.

   > Found 2 clusters in hypercube 10.

   > Found 3 clusters in hypercube 11.

   > Found 2 clusters in hypercube 12.

   > Found 2 clusters in hypercube 13.

   > Found 1 clusters in hypercube 14.

   > Found 1 clusters in hypercube 15.

   > Found 1 clusters in hypercube 16.

   > Found 1 clusters in hypercube 17.

   > Found 1 clusters in hypercube 18.

   > Found 1 clusters in hypercube 19.

   > Found 1 clusters in hypercube 20.

   > Found 1 clusters in hypercube 21.

   > Found 1 clusters in hypercube 22.

   > Found 1 clusters in hypercube 23.

   > Found 1 clusters in hypercube 24.

   > Found 1 clusters in hypercube 25.

   > Found 1 clusters in hypercube 26.

   > Found 1 clusters in hypercube 27.

   > Found 1 clusters in hypercube 28.

   > Found 1 clusters in hypercube 29.


Created 48 edges and 48 nodes in 0:00:00.072752.

Wrote visualization to: output/horse.html














REFERENCES

1.  https://kepler-mapper.scikit-tda.org/en/latest/notebooks/KeplerMapper-Newsgroup20-Pipeline.html

2. https://pypi.org/project/scikit-tda/

3. https://kepler-mapper.scikit-tda.org/en/latest/generated/gallery/plot_cat.html






No comments:

Post a Comment