(.env) boris@UbuntuLTS:~/TDANALYSIS$ cat Newsgroups20.py
import kmapper as km
from kmapper import Cover, jupyter
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import Isomap
from sklearn.preprocessing import MinMaxScaler
newsgroups = fetch_20newsgroups(subset='train')
X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)
mapper = km.KeplerMapper(verbose=2)
projected_X = mapper.fit_transform(X,
scaler=[None, None, MinMaxScaler()])
from sklearn import cluster
graph = mapper.map(projected_X,
vec = TfidfVectorizer(analyzer="word",
interpretable_inverse_X = vec.fit_transform(X).toarray()
interpretable_inverse_X_names = vec.get_feature_names()
print("SHAPE", interpretable_inverse_X.shape)
print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])
_ = mapper.visualize(graph,
lens_names=["ISOMAP1", "ISOMAP2"],
title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",
custom_tooltips=np.array([target_names[ys] for ys in y]),
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 Newsgroups20.py
SAMPLE From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
- IL
---- brought to you by your neighborhood Lerxst ----
SHAPE (11314,)
TARGET rec.autos
..Composing projection pipeline of length 3:
Projections: TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))
TruncatedSVD(n_components=100, random_state=1729)
Distance matrices: False
Scalers: None
..Projecting on data shaped (11314,)
..Projecting data using:
TfidfVectorizer(analyzer='char', max_df=0.83, min_df=0.05, ngram_range=(1, 6))
..Created projection shaped (11314, 13967)
..Projecting on data shaped (11314, 13967)
..Projecting data using:
TruncatedSVD(n_components=100, random_state=1729)
..Projecting on data shaped (11314, 100)
..Projecting data using:
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/manifold/_isomap.py:348: UserWarning: The number of connected components of the neighbors graph is 2 > 1. Completing the graph to fit Isomap might be slow. Increase the number of neighbors to avoid this issue.
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/scipy/sparse/_index.py:103: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_intXint(row, col, x.flat[0])
..Scaling with: MinMaxScaler()
SHAPE (11314, 2)
Mapping on data shaped (11314, 2) using lens shaped (11314, 2)
Minimal points in hypercube before clustering: 3
Creating 100 hypercubes.
> Found 3 clusters in hypercube 0.
> Found 3 clusters in hypercube 1.
Cube_77 is empty.
Cube_78 is empty.
> Found 3 clusters in hypercube 79.
> Found 3 clusters in hypercube 80.
> Found 3 clusters in hypercube 81.
> Found 3 clusters in hypercube 82.
> Found 3 clusters in hypercube 83.
Created 618 edges and 243 nodes in 0:00:01.169844.
/home/boris/SCIKIT-TDA/.env/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
SHAPE (11314, 947)
FEATURE NAMES SAMPLE ['00', '000', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1992', '1993', '1993apr15', '20', '200', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '408', '41', '42', '43', '44', '45', '49', '50', '500', '60', '70', '80', '90', '92', '93', 'able', 'ac', 'ac uk', 'accept', 'access', 'according', 'acs', 'act', 'action', 'actually', 'add', 'address', 'advance', 'advice', 'ago', 'agree', 'air', 'al', 'allow', 'allowed', 'america', 'american', 'andrew', 'answer', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'apply', 'appreciate', 'appreciated', 'apr', 'apr 1993', 'apr 93', 'april', 'area', 'aren', 'argument', 'article', 'article 1993apr15', 'ask', 'asked', 'asking', 'assume', 'att', 'att com', 'au', 'available', 'average', 'avoid', 'away', 'bad', 'base', 'baseball', 'based', 'basic', 'basically', 'basis', 'bbs', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'bitnet', 'black', 'blue', 'board', 'bob', 'body', 'book', 'books', 'bought', 'box', 'break', 'brian', 'bring', 'brought', 'btw', 'build', 'building', 'built', 'bus', 'business', 'buy', 'ca', 'ca lines', 'california', 'called', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cc', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'cheap', 'check', 'chicago', 'children', 'chip', 'choice', 'chris', 'christ', 'christian', 'christians', 'church', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'cleveland', 'clinton', 'clipper', 'close', 'cmu', 'cmu edu', 'code', 'college', 'color', 'colorado', 'com', 'com organization', 'com writes', 'come', 'comes', 'coming', 'comment', 'comments', 'common', 'communications', 'comp', 'company', 'complete', 'completely', 'computer', 'computer science', 'computing', 'condition', 'consider', 'considered', 'contact', 'continue', 'control', 'copy', 'corp', 'corporation', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cs', 'cso', 'cso uiuc', 'cso uiuc edu', 'cup', 'current', 'currently', 'cut', 'cwru', 'cwru edu', 'data', 'date', 'dave', 'david', 'day', 'days', 'dead', 'deal', 'death', 'decided', 'defense', 'deleted', 'department', 'dept', 'design', 'designed', 'details', 'development', 'device', 'did', 'didn', 'die', 'difference', 'different', 'difficult', 'directly', 'disclaimer', 'discussion', 'disk', 'display', 'distribution', 'distribution na', 'distribution na lines', 'distribution usa', 'distribution usa lines', 'distribution world', 'distribution world nntp', 'distribution world organization', 'division', 'dod', 'does', 'does know', 'doesn', 'doing', 'don', 'don know', 'don think', 'don want', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'edu article', 'edu au', 'edu david', 'edu organization', 'edu organization university', 'edu reply', 'edu subject', 'edu writes', 'effect', 'email', 'encryption', 'end', 'engineering', 'entire', 'error', 'especially', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'exists', 'expect', 'experience', 'explain', 'expressed', 'extra', 'face', 'fact', 'faith', 'family', 'fan', 'faq', 'far', 'fast', 'faster', 'fax', 'federal', 'feel', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'folks', 'follow', 'following', 'force', 'forget', 'form', 'frank', 'free', 'friend', 'ftp', 'future', 'game', 'games', 'gave', 'general', 'generally', 'germany', 'gets', 'getting', 'given', 'gives', 'giving', 'gmt', 'god', 'goes', 'going', 'gone', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greatly', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'happy', 'hard', 'hardware', 'haven', 'having', 'head', 'hear', 'heard', 'heart', 'hell']
Wrote visualization to: output/newsgroups20.html
Another sample
(.env) boris@UbuntuLTS:~/TDANALYSIS$ cat plot_cat.py
3D Cat Data
This example generates a Mapper built from a point-cloud sampled from a 3D model of a cat.
`Visualization of the cat mapper <../../_static/cat.html>`_
import numpy as np
import sklearn
import kmapper as km
data = np.genfromtxt("./cat-reference.csv", delimiter=",")
mapper = km.KeplerMapper(verbose=2)
lens = mapper.fit_transform(data)
graph = mapper.map(
clusterer=sklearn.cluster.DBSCAN(eps=0.1, min_samples=5),
cover=km.Cover(n_cubes=15, perc_overlap=0.2),
mapper.visualize(graph, path_html="output/cat.html")
import matplotlib.pyplot as plt
(.env) boris@UbuntuLTS:~/TDANALYSIS$ python3 plot_cat.py
..Composing projection pipeline of length 1:
Projections: sum
Distance matrices: False
Scalers: MinMaxScaler()
..Projecting on data shaped (7207, 3)
..Projecting data using: sum
..Scaling with: MinMaxScaler()
Mapping on data shaped (7207, 3) using lens shaped (7207, 1)
Minimal points in hypercube before clustering: 5
Creating 15 hypercubes.
> Found 2 clusters in hypercube 0.
Created 19 edges and 20 nodes in 0:00:00.078696.
Wrote visualization to: output/cat.html
To be able reproduce any sample at https://kepler-mapper.scikit-tda.org/en/latest/examples.html
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ pip install --force-reinstall numpy==1.21.5
(.env) boris@UbuntuLTS:~/SCIKIT-TDA$ python3 plot_horse.py
..Composing projection pipeline of length 1:
Projections: sum
Distance matrices: False
Scalers: MinMaxScaler()
..Projecting on data shaped (8431, 3)
..Projecting data using: sum
..Scaling with: MinMaxScaler()
Mapping on data shaped (8431, 3) using lens shaped (8431, 1)
Minimal points in hypercube before clustering: 5
Creating 30 hypercubes.
> Found 1 clusters in hypercube 0.
Created 48 edges and 48 nodes in 0:00:00.072752.
Wrote visualization to: output/horse.html
1. https://kepler-mapper.scikit-tda.org/en/latest/notebooks/KeplerMapper-Newsgroup20-Pipeline.html
2. https://pypi.org/project/scikit-tda/
3. https://kepler-mapper.scikit-tda.org/en/latest/generated/gallery/plot_cat.html