(.env) boris@UbuntuLTS:~/TDANALYSIS$ cat Newsgroups20.py
import kmapper as km
from kmapper import Cover, jupyter
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import Isomap
from sklearn.preprocessing import MinMaxScaler
newsgroups = fetch_20newsgroups(subset='train')
X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)
print("SAMPLE",X[0])
print("SHAPE",X.shape)
print("TARGET",target_names[y[0]])
mapper = km.KeplerMapper(verbose=2)
projected_X = mapper.fit_transform(X,
projection=[TfidfVectorizer(analyzer="char",
ngram_range=(1,6),
max_df=0.83,
min_df=0.05),
TruncatedSVD(n_components=100,
random_state=1729),
Isomap(n_components=2,
n_jobs=-1)],
scaler=[None, None, MinMaxScaler()])
print("SHAPE",projected_X.shape)
from sklearn import cluster
graph = mapper.map(projected_X,
X=None,
clusterer=cluster.AgglomerativeClustering(n_clusters=3,
linkage="complete",
affinity="cosine"),
cover=Cover(perc_overlap=0.33))
vec = TfidfVectorizer(analyzer="word",
strip_accents="unicode",
stop_words="english",
ngram_range=(1,3),
max_df=0.97,
min_df=0.02)
interpretable_inverse_X = vec.fit_transform(X).toarray()
interpretable_inverse_X_names = vec.get_feature_names()
print("SHAPE", interpretable_inverse_X.shape)
print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])
_ = mapper.visualize(graph,
X=interpretable_inverse_X,
X_names=interpretable_inverse_X_names,
path_html="output/newsgroups20.html",
lens=projected_X,
lens_names=["ISOMAP1", "ISOMAP2"],
title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",
custom_tooltips=np.array([target_names[ys] for ys in y]),
color_values=y,
color_function_name='target')
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 Newsgroups20.py
SAMPLE From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
SHAPE (11314,)
TARGET rec.autos
KeplerMapper(verbose=2)
..Composing projection pipeline of length 3:
Projections: TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))
TruncatedSVD(n_components=100, random_state=1729)
Isomap(n_jobs=-1)
Distance matrices: False
False
False
Scalers: None
None
MinMaxScaler()
..Projecting on data shaped (11314,)
..Projecting data using:
TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))
..Created projection shaped (11314, 13967)
..Projecting on data shaped (11314, 13967)
..Projecting data using:
TruncatedSVD(n_components=100, random_state=1729)
..Projecting on data shaped (11314, 100)
..Projecting data using:
Isomap(n_jobs=-1)
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/manifold/_isomap.py:348: UserWarning: The number of connected components of the neighbors graph is 2 > 1. Completing the graph to fit Isomap might be slow. Increase the number of neighbors to avoid this issue.
self._fit_transform(X)
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/scipy/sparse/_index.py:103: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_intXint(row, col, x.flat[0])
..Scaling with: MinMaxScaler()
SHAPE (11314, 2)
Mapping on data shaped (11314, 2) using lens shaped (11314, 2)
Minimal points in hypercube before clustering: 3
Creating 100 hypercubes.
> Found 3 clusters in hypercube 0.
> Found 3 clusters in hypercube 1.
> Found 3 clusters in hypercube 2.
> Found 3 clusters in hypercube 3.
Cube_4 is empty.
> Found 3 clusters in hypercube 5.
> Found 3 clusters in hypercube 6.
> Found 3 clusters in hypercube 7.
> Found 3 clusters in hypercube 8.
> Found 3 clusters in hypercube 9.
> Found 3 clusters in hypercube 10.
> Found 3 clusters in hypercube 11.
> Found 3 clusters in hypercube 12.
> Found 3 clusters in hypercube 13.
> Found 3 clusters in hypercube 14.
> Found 3 clusters in hypercube 15.
> Found 3 clusters in hypercube 16.
> Found 3 clusters in hypercube 17.
> Found 3 clusters in hypercube 18.
> Found 3 clusters in hypercube 19.
> Found 3 clusters in hypercube 20.
> Found 3 clusters in hypercube 21.
> Found 3 clusters in hypercube 22.
> Found 3 clusters in hypercube 23.
> Found 3 clusters in hypercube 24.
> Found 3 clusters in hypercube 25.
> Found 3 clusters in hypercube 26.
> Found 3 clusters in hypercube 27.
> Found 3 clusters in hypercube 28.
> Found 3 clusters in hypercube 29.
> Found 3 clusters in hypercube 30.
> Found 3 clusters in hypercube 31.
> Found 3 clusters in hypercube 32.
> Found 3 clusters in hypercube 33.
> Found 3 clusters in hypercube 34.
> Found 3 clusters in hypercube 35.
> Found 3 clusters in hypercube 36.
> Found 3 clusters in hypercube 37.
> Found 3 clusters in hypercube 38.
> Found 3 clusters in hypercube 39.
> Found 3 clusters in hypercube 40.
> Found 3 clusters in hypercube 41.
> Found 3 clusters in hypercube 42.
> Found 3 clusters in hypercube 43.
> Found 3 clusters in hypercube 44.
> Found 3 clusters in hypercube 45.
> Found 3 clusters in hypercube 46.
> Found 3 clusters in hypercube 47.
> Found 3 clusters in hypercube 48.
> Found 3 clusters in hypercube 49.
> Found 3 clusters in hypercube 50.
> Found 3 clusters in hypercube 51.
> Found 3 clusters in hypercube 52.
> Found 3 clusters in hypercube 53.
> Found 3 clusters in hypercube 54.
> Found 3 clusters in hypercube 55.
> Found 3 clusters in hypercube 56.
> Found 3 clusters in hypercube 57.
> Found 3 clusters in hypercube 58.
> Found 3 clusters in hypercube 59.
> Found 3 clusters in hypercube 60.
> Found 3 clusters in hypercube 61.
> Found 3 clusters in hypercube 62.
> Found 3 clusters in hypercube 63.
> Found 3 clusters in hypercube 64.
> Found 3 clusters in hypercube 65.
> Found 3 clusters in hypercube 66.
> Found 3 clusters in hypercube 67.
> Found 3 clusters in hypercube 68.
> Found 3 clusters in hypercube 69.
> Found 3 clusters in hypercube 70.
> Found 3 clusters in hypercube 71.
> Found 3 clusters in hypercube 72.
> Found 3 clusters in hypercube 73.
> Found 3 clusters in hypercube 74.
> Found 3 clusters in hypercube 75.
> Found 3 clusters in hypercube 76.
Cube_77 is empty.
Cube_78 is empty.
> Found 3 clusters in hypercube 79.
> Found 3 clusters in hypercube 80.
> Found 3 clusters in hypercube 81.
> Found 3 clusters in hypercube 82.
> Found 3 clusters in hypercube 83.
Created 618 edges and 243 nodes in 0:00:01.169844.
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
SHAPE (11314, 947)
FEATURE NAMES SAMPLE ['00', '000', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1992', '1993', '1993apr15', '20', '200', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '408', '41', '42', '43', '44', '45', '49', '50', '500', '60', '70', '80', '90', '92', '93', 'able', 'ac', 'ac uk', 'accept', 'access', 'according', 'acs', 'act', 'action', 'actually', 'add', 'address', 'advance', 'advice', 'ago', 'agree', 'air', 'al', 'allow', 'allowed', 'america', 'american', 'andrew', 'answer', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'apply', 'appreciate', 'appreciated', 'apr', 'apr 1993', 'apr 93', 'april', 'area', 'aren', 'argument', 'article', 'article 1993apr15', 'ask', 'asked', 'asking', 'assume', 'att', 'att com', 'au', 'available', 'average', 'avoid', 'away', 'bad', 'base', 'baseball', 'based', 'basic', 'basically', 'basis', 'bbs', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'bitnet', 'black', 'blue', 'board', 'bob', 'body', 'book', 'books', 'bought', 'box', 'break', 'brian', 'bring', 'brought', 'btw', 'build', 'building', 'built', 'bus', 'business', 'buy', 'ca', 'ca lines', 'california', 'called', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cc', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'cheap', 'check', 'chicago', 'children', 'chip', 'choice', 'chris', 'christ', 'christian', 'christians', 'church', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'cleveland', 'clinton', 'clipper', 'close', 'cmu', 'cmu edu', 'code', 'college', 'color', 'colorado', 'com', 'com organization', 'com writes', 'come', 'comes', 'coming', 'comment', 'comments', 'common', 'communications', 'comp', 'company', 'complete', 'completely', 'computer', 'computer science', 'computing', 'condition', 'consider', 'considered', 'contact', 'continue', 'control', 'copy', 'corp', 'corporation', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cs', 'cso', 'cso uiuc', 'cso uiuc edu', 'cup', 'current', 'currently', 'cut', 'cwru', 'cwru edu', 'data', 'date', 'dave', 'david', 'day', 'days', 'dead', 'deal', 'death', 'decided', 'defense', 'deleted', 'department', 'dept', 'design', 'designed', 'details', 'development', 'device', 'did', 'didn', 'die', 'difference', 'different', 'difficult', 'directly', 'disclaimer', 'discussion', 'disk', 'display', 'distribution', 'distribution na', 'distribution na lines', 'distribution usa', 'distribution usa lines', 'distribution world', 'distribution world nntp', 'distribution world organization', 'division', 'dod', 'does', 'does know', 'doesn', 'doing', 'don', 'don know', 'don think', 'don want', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'edu article', 'edu au', 'edu david', 'edu organization', 'edu organization university', 'edu reply', 'edu subject', 'edu writes', 'effect', 'email', 'encryption', 'end', 'engineering', 'entire', 'error', 'especially', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'exists', 'expect', 'experience', 'explain', 'expressed', 'extra', 'face', 'fact', 'faith', 'family', 'fan', 'faq', 'far', 'fast', 'faster', 'fax', 'federal', 'feel', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'folks', 'follow', 'following', 'force', 'forget', 'form', 'frank', 'free', 'friend', 'ftp', 'future', 'game', 'games', 'gave', 'general', 'generally', 'germany', 'gets', 'getting', 'given', 'gives', 'giving', 'gmt', 'god', 'goes', 'going', 'gone', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greatly', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'happy', 'hard', 'hardware', 'haven', 'having', 'head', 'hear', 'heard', 'heart', 'hell']
Wrote visualization to: output/newsgroups20.html
Another sample
(.env) boris@UbuntuLTS:~/TDANALYSIS$ cat plot_cat.py
"""
3D Cat Data
============
This example generates a Mapper built from a point-cloud sampled from a 3D model of a cat.
`Visualization of the cat mapper <../../_static/cat.html>`_
"""
import numpy as np
import sklearn
import kmapper as km
data = np.genfromtxt("./cat-reference.csv", delimiter=",")
mapper = km.KeplerMapper(verbose=2)
lens = mapper.fit_transform(data)
graph = mapper.map(
lens,
data,
clusterer=sklearn.cluster.DBSCAN(eps=0.1, min_samples=5),
cover=km.Cover(n_cubes=15, perc_overlap=0.2),
)
mapper.visualize(graph, path_html="output/cat.html")
km.draw_matplotlib(graph)
import matplotlib.pyplot as plt
plt.show()
(.env) boris@UbuntuLTS:~/TDANALYSIS$ python3 plot_cat.py
KeplerMapper(verbose=2)
..Composing projection pipeline of length 1:
Projections: sum
Distance matrices: False
Scalers: MinMaxScaler()
..Projecting on data shaped (7207, 3)
..Projecting data using: sum
..Scaling with: MinMaxScaler()
Mapping on data shaped (7207, 3) using lens shaped (7207, 1)
Minimal points in hypercube before clustering: 5
Creating 15 hypercubes.
> Found 2 clusters in hypercube 0.
> Found 2 clusters in hypercube 1.
> Found 2 clusters in hypercube 2.
> Found 1 clusters in hypercube 3.
> Found 2 clusters in hypercube 4.
> Found 2 clusters in hypercube 5.
> Found 1 clusters in hypercube 6.
> Found 1 clusters in hypercube 7.
> Found 1 clusters in hypercube 8.
> Found 1 clusters in hypercube 9.
> Found 1 clusters in hypercube 10.
> Found 1 clusters in hypercube 11.
> Found 1 clusters in hypercube 12.
> Found 1 clusters in hypercube 13.
> Found 1 clusters in hypercube 14.
Created 19 edges and 20 nodes in 0:00:00.078696.
Wrote visualization to: output/cat.html
To be able reproduce any sample at https://kepler-mapper.scikit-tda.org/en/latest/examples.html
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ pip install --force-reinstall numpy==1.21.5
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 plot_horse.py
KeplerMapper(verbose=2)
..Composing projection pipeline of length 1:
Projections: sum
Distance matrices: False
Scalers: MinMaxScaler()
..Projecting on data shaped (8431, 3)
..Projecting data using: sum
..Scaling with: MinMaxScaler()
Mapping on data shaped (8431, 3) using lens shaped (8431, 1)
Minimal points in hypercube before clustering: 5
Creating 30 hypercubes.
> Found 1 clusters in hypercube 0.
> Found 2 clusters in hypercube 1.
> Found 3 clusters in hypercube 2.
> Found 3 clusters in hypercube 3.
> Found 3 clusters in hypercube 4.
> Found 3 clusters in hypercube 5.
> Found 3 clusters in hypercube 6.
> Found 1 clusters in hypercube 7.
> Found 2 clusters in hypercube 8.
> Found 2 clusters in hypercube 9.
> Found 2 clusters in hypercube 10.
> Found 3 clusters in hypercube 11.
> Found 2 clusters in hypercube 12.
> Found 2 clusters in hypercube 13.
> Found 1 clusters in hypercube 14.
> Found 1 clusters in hypercube 15.
> Found 1 clusters in hypercube 16.
> Found 1 clusters in hypercube 17.
> Found 1 clusters in hypercube 18.
> Found 1 clusters in hypercube 19.
> Found 1 clusters in hypercube 20.
> Found 1 clusters in hypercube 21.
> Found 1 clusters in hypercube 22.
> Found 1 clusters in hypercube 23.
> Found 1 clusters in hypercube 24.
> Found 1 clusters in hypercube 25.
> Found 1 clusters in hypercube 26.
> Found 1 clusters in hypercube 27.
> Found 1 clusters in hypercube 28.
> Found 1 clusters in hypercube 29.
Created 48 edges and 48 nodes in 0:00:00.072752.
Wrote visualization to: output/horse.html
REFERENCES
1. https://kepler-mapper.scikit-tda.org/en/latest/notebooks/KeplerMapper-Newsgroup20-Pipeline.html
2. https://pypi.org/project/scikit-tda/
3. https://kepler-mapper.scikit-tda.org/en/latest/generated/gallery/plot_cat.html