「Superset」- 数据探索平台 | 数据可视化平台

认识

官网:https://superset.apache.org/
文档:https://superset.apache.org/docs/intro
仓库:https://github.com/apache/superset

Superset is a modern data exploration and data visualization platform. Superset can replace or augment proprietary business intelligence tools for many teams. Superset integrates well with a variety of data sources.

Superset 提供强大的数据探索和可视化功能。它允许用户连接到各种数据源,包括 MySQL、PostgreSQL、SQLite 等数据库。连接后,用户可以创建交互式仪表板,通过 SQL 查询或用户友好的界面探索数据集,并构建各种可视化效果,如图表、表格和地图。

组成

—— 组件 | Components | 原理 | Principles

根据 superset/docker-compose.yml 文件,其包含 nginx、redis、db、superset、superset-websocket、superset-init、superset-node、superset-worker、superset-worker-beat、superset-tests-worker 组件。

根据 Architecture 文档,其包含组件:
1)The Superset application itself
2)A metadata database
3)A caching layer (optional, but necessary for some features)
4)A worker & beat (optional, but necessary for some features)

针对 Alerts and Reports、Caching、Async Queries、Dashboard Thumbnails 功能,需要开启可选组件。

The Superset Application

This is the core application. Superset operates like this:

A user visits a chart or dashboard
That triggers a SQL query to the data warehouse holding the underlying dataset
The resulting data is served up in a data visualization
The Superset application is comprised of the Python (Flask) backend application (server), API layer, and the React frontend, built via Webpack, and static assets needed for the application to work

Metadata Database

This is where chart and dashboard definitions, user information, logs, etc. are stored. Superset is tested to work with PostgreSQL and MySQL databases as the metadata database (not be confused with a data source like your data warehouse, which could be a much greater variety of options like Snowflake, Redshift, etc.).

Some installation methods like our Quickstart and PyPI come configured by default to use a SQLite on-disk database. And in a Docker Compose installation, the data would be stored in a PostgresQL container volume. Neither of these cases are recommended for production instances of Superset.

For production, a properly-configured, managed, standalone database is recommended. No matter what database you use, you should plan to back it up regularly.

Caching Layer

The caching layer serves two main functions:

Store the results of queries to your data warehouse so that when a chart is loaded twice, it pulls from the cache the second time, speeding up the application and reducing load on your data warehouse.

Act as a message broker for the worker, enabling the Alerts & Reports, async queries, and thumbnail caching features.

Most people use Redis for their cache, but Superset supports other options too. See the cache docs for more.

Worker and Beat

This is one or more workers who execute tasks like run async queries or take snapshots of reports and send emails, and a “beat” that acts as the scheduler and tells workers when to perform their tasks. Most installations use Celery for these components.

Other components

Other components can be incorporated into Superset. The best place to learn about additional configurations is the Configuration page. For instance, you could set up a load balancer or reverse proxy to implement HTTPS in front of your Superset application, or specify a Mapbox URL to enable geospatial charts, etc.

Superset won’t even start without certain configuration settings established, so it’s essential to review that page.

性质

它提供数据过滤、排序和聚合功能,使用户能够快速从大量数据中获取见解。仪表板可以定制并与他人共享,使其成为数据分析师、商业智能团队和决策者的宝贵工具。

Superset 具有高度的可扩展性,允许开发人员添加自定义插件和集成以满足特定的业务需求。凭借其活跃的社区和持续的开发,它在不断发展和改进。

A no-code interface for building charts quickly
A powerful, web-based SQL Editor for advanced querying
A lightweight semantic layer for quickly defining custom dimensions and metrics
Out of the box support for nearly any SQL database or data engine
A wide array of beautiful visualizations to showcase your data, ranging from simple bar charts to geospatial visualizations
Lightweight, configurable caching layer to help ease database load
Highly extensible security roles and authentication options
An API for programmatic customization
A cloud-native architecture designed from the ground up for scale

数据的记录与抓取

Superset 是否可以直接记录数据 —— 豆包

注意,Superset 本身不具备直接记录数据的功能。

数据来源的性质:Superset 主要是一个数据可视化和数据分析工具,它侧重于对已有的数据进行查询、可视化展示和分析。其数据通常来自于各种外部数据源,如关系型数据库(MySQL、PostgreSQL 等)、非关系型数据库(MongoDB 等)或者云存储(Amazon S3 等)。它的设计理念是通过连接这些数据源,利用 SQL 或其他数据查询语言来获取数据,而不是用于直接记录新的数据。

间接记录数据的可能方式:

通过数据源接口间接记录:

如果想要在 Superset 环境下记录数据,可以通过其连接的数据源提供的接口来实现。例如,假设 Superset 连接到一个 MySQL 数据库,那么可以通过在外部编写代码(如 Python 脚本)使用 MySQL 的数据库连接库(如pymysql)来插入新的数据记录。这些新数据随后可以在 Superset 中通过刷新数据或者重新查询来进行可视化和分析。

与 ETL 工具结合记录数据(数据管道方式):

可以将 Superset 与 ETL(Extract、Transform、Load)工具相结合。例如,使用 Apache Airflow 作为 ETL 工具,在数据管道中,首先从各种数据源提取数据,然后进行必要的转换(如数据清洗、格式调整等),最后将数据加载到 Superset 连接的数据源中。这样就可以实现数据的记录和更新。在 Airflow 中,可以定义一个 DAG(Directed Acyclic Graph)来执行这些操作。

构建

部署 | Formation | 运维 | Governance

版本选择

https://github.com/apache/superset/releases

部署方式

我们通过 Helm Chart 部署

with Helm on Kubernetes

https://superset.apache.org/docs/installation/kubernetes

我们使用其内置的 Redis / PostgreSQL 数据库

helm repo add superset https://apache.github.io/superset
helm search repo superset
helm pull superset/superset --version x.x.x

helm show values superset/superset > superset.helm-values.yaml

vim superset.helm-values.yaml
# 修改 PostgreSQL 信息:存储 StorageClass 修改;通过 size 修改大小;
# 修改 PostgreSQL 信息:数据库密码;
# 修改 Redis 信息:存储 StorageClass 修改;通过 size 修改大小;
# 修改 SECRET_KEY 信息:https://superset.apache.org/docs/installation/kubernetes#security-settings
# 修改 Ingress 配置:...
# 修改 SuperSet 配置:一处 Superset 连接

helm upgrade --install --namespace superset-ha --create-namespace  \
    superset-ha ./superset-0.12.11.tgz -f superset-0.12.11.tgz.helm-values.yaml 

应用

Creating Your First Dashboard
https://superset.apache.org/docs/using-superset/creating-your-first-dashboard

Exploring Data in Superset
https://superset.apache.org/docs/using-superset/exploring-data

Preset.io maintains an updated set of end-user documentation at docs.preset.io.
https://docs.preset.io/

改进

https://superset.apache.org/docs/intro#get-involved

4.1 Undertakings and Revisions

Issue Code Reference | 常见错误
https://superset.apache.org/docs/using-superset/issue-codes