? ???? ???? ??? ??? ?? ???? ?????:
- ???? ??? ??: ???? ??? ??: RAPIDS cuDF? ??? ? ?? ??? ????? RAPIDS cuDF? ??? ??? ??? ??? ???? ??? ?????.
- ? ?????? pandas ?????? Python?? ????? ???? ??? ??? ???? ??? ?? ?????.
?? ???, ??, ??, ???? ??? ???? ?? ?? ?? ??? ???? ???? ????. IDC? ??? 2025??? 2020?? 64ZB? ?? 180ZB? ???? ??? ???, ? ?? ???? ????? ???? ?? ??? ??? ???? ??? ????.
- ??? ? ?? ???? ?? ???? ???, ??, ?? ??? ???? ??? ???? ?? ??? ???? ? ??? ?? ????.
- ?? ???? ? ??? ???? ??? ? ???? ? ???? ? ?? ????? ???? ??? ??? ?? ?????.
- ??? ?? ? ???? ???? ????? ??? ?? ? ?? ???? ???? ????.
NVIDIA? ??? ??????? ????? ??? ???? ? ?? ?????? ???? GPU?? ??? ? ??? ?? ?? ????? ????? ? API? RAPIDS ???? ?????. ???? ?? ? ??? ????? ?? ???? ??? ?? ??? DataFrame API? ?????: RAPIDS cuDF.
???? ??? ?? ??????? ?? 40??? ??? ???? ???? ??? ??? ??? ???? ?? ?? ??? ?? ??? ?? ? ?? ?? ??? ??? ?????.
?? ??? ??? ??? ???? ??, ? ?????? RAPIDS cuDF? ??? ??? ?? ??? ??(EDA) ???? ???????.
?? Spark 3.0? ??? ??? ??? ??? ???? ????, Spark? RAPIDS ???? ?????. ?? RAPIDS? ??? ??? ????, RAPIDS? ?? ?? ??????? ???? ?? ??? ???? ?????.
? EDA? RAPIDS cuDF???
??? ???? ??? ??????? EDA ??? ?? ??? ?? ? ??? ?? Python? ???? ??? ?? ?? ?? ?? ????? ?????? pandas? ?????. ???? ???? ??? ??? ??? ??? ???? ???? ??? ???? ?? ????? ??? ? ??? ??????, ?? EDA? ?? ??? ?????? ?????.
??? pandas? ?? ???? ????? ???? ??? ??? 1~2GB? ???? ??? ???? ???? ?? ???? ??? ????. ??? ??? 10~20GB ??? ???? ??, Dask? Apache Spark? ?? ?? ??? ??? ???? ???. ??? ??? ?? ???? ??? ??? ??? ? ? ??? ????.
2~10GB? ?? ???? RAPIDS cuDF? ??? Goldilocks ??????.
RAPIDS cuDF? GPU? ?? ??? ???? ?????, ??? ??? API? ????? ????. pandas?? ?? ?? ?? ???? ?? ??? ??? cuDF ??? ?? cuDF? pandas?? ????? ? ????. RAPIDS? ?? pandas ????? ?? ?? ??? ????? ???? ???.
RAPIDS cuDF? ??? EDA
? ?????? EDA ?? ??? ?? ? cuDF? ??? ?? ??? ? ??? ?????.
? ????? ??? ??? ?? pandas? ??? ?? ???? cuDF? ??? ? ?? ??? 15? ???? ?? ??????. ? ??? ??? ?? ?????? ??? ? ??? ???? ? ??? ? ? ????.
??? ?? ??? ?? ??? ??? RAPIDS GitHub ?????? ?? ?? ???, cuDF? ??? ??? ??? ??? ?????. ?? ?????? ???? ??? ??? 15?? ?????.
??? ??
Meteonet? 2016??? 2018??? ?? ??? ??? ?? ???? ???? ??? ?? ??? ?????. ????? ???? ?? ???? ?? ???? ??? ?????.
?? ?? ??
? ?????? ??? ??????? ? ?? ???? ?? ? ??? ????? ??? ?????. ? ?? ??? ???, ?? ?? ?? ?? ?? ?? ??? ?? ??? ???? ? ??? ? ?? ??????.
?? ??? ?? ???? ???????? ??? ??????:
- ?? ????.
- ??? ??? ?? ??.
- ?? ?? ?? ??.
1??. ?? ????
?? cuDF ?????? ???? ??? ??? ??????. ??? ???? 2016?, 2017?, 2018? ???? ??? NW.csv? ?????? ???? cuDF? ??? ??? ??? ?? ???? ?????.
# Import cuDF and CuPy
import cudf
import cupy as cp
# Read in the data files into DataFrame placeholders
gdf_2016 = cudf.read_csv('./NW.csv')
??? ??? ???? ?? ?? ??? ?????. ?? ??????? ??? ???? ??? ???? ? ??? ???. ????? pandas ??? ???? ????.
pandas? ??? .info ??? ???? ?? ??? ??? ??? ?????:
GS_cudf.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 22034571 entries, 0 to 22034570
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 number_sta int64
1 lat float64
2 lon float64
3 height_sta float64
4 date datetime64[ns]
5 dd float64
6 ff float64
7 precip float64
8 hu float64
9 td float64
10 t float64
object
dtypes: datetime64[ns](1), float64(9), int64(1), object(1)
memory usage: 6.5+ GB
????? 12?? ??? ?????, ??? ???? ? ? ????.
???? ?? ? ??? ????? ??? ???? ???? ??(?? ?)? ?????.
# Checking the DataFrame dimensions. Millions of rows by 12 columns.
GS_cudf.shape
(65826837, 12)
? ? ??? ?? ??? ? ? ??? ? ??? ??? ??? ? ? ????. 65,826,837?? ?? ???? ?? ??? ??? ? ?? ???? ? ????.
??, ??? ???? ?? ?? ?? ???? ??? ??? ?? ??? ?????:
# Display the first five rows of the DataFrame to examine details
GS_cudf.head()
number_sta | lat | lon | height_sta | date | dd | ff | precip | hu | td | t | psl | |
0 | 14066001 | 49.33 | -0.43 | 2.0 | 2016-01-01 | 210.0 | 4.4 | 0.0 | 91.0 | 278.45 | 279.85 | <NA> |
1 | 14126001 | 49.15 | 0.04 | 125.0 | 2016-01-01 | <NA> | <NA> | 0.0 | 99.0 | 278.35 | 278.45 | <NA> |
2 | 14137001 | 49.18 | -0.46 | 67.0 | 2016-01-01 | 220.0 | 0.6 | 0.0 | 92.0 | 276.45 | 277.65 | 102360.0 |
3 | 14216001 | 48.93 | -0.15 | 155.0 | 2016-01-01 | 220.0 | 1.9 | 0.0 | 95.0 | 278.25 | 278.95 | <NA> |
4 | 14296001 | 48.80 | -1.03 | 339.0 | 2016-01-01 | <NA> | <NA> | 0.0 | <NA> | <NA> | 278.35 | <NA> |
? 1. ?? ??
?? ? ?? ??? ??? ??? ? ????. ??? ??? ? ????? ? ?? ???? ???? ????? ?? ?????:
# How many weather stations are covered in this dataset?
# Call nunique() to count the distinct elements along a specified axis.
number_stations = GS_cudf['number_sta'].nunique()
print("The full dataset is composed of {} unique weather stations.".format(GS_cudf['number_sta'].nunique()))
?? ??? ??? 287?? ??? ?? ???? ???? ????. ? ?? ???? ??? ?? ???? ????????
## Investigate the frequency of one specific station's data
## date column is datetime dtype, and the diff() function calculates the delta time
## TimedeltaProperties.seconds can help get the delta seconds between each record, divide by 60 seconds to see the minutes difference.
delta_mins = GS_cudf['date'].diff().dt.seconds.max()/60
print(f"The data is recorded every {delta_mins} minutes")
???? 6.0??? ?????. ?? ???? ? ??? 10?? ??? ?????.
?? ??? ?? ??? ??? ??????:
- ??? ??
- ??? ??? ??
- ??? ??? ???? ??? ?
- ??? ?? ???? ??
??? ? ???? ????? ??? ??? ???? ?? ???? ? ??? ??? ???? ???. ??? ??? ? ???? ? ??? ??? ? ?? ??? ??? ? ??? ??? ??? ????.
2??. ?? ??
??? ?? ????? ?? ?? ??? ????. ??? ??? ?? ??? ? ? ?? ??? ? ?? ???? ??? ? ????. ?? ??? ???? ?????, ???? ?? ?? ???? ???? ? ? ?? ?? ??? ?? ??? ? ????.
? ??? ??? ???? ????? ??? ???? ??? ???? ?? ??? ?? ?????.
??? ???? ??? ????? ?? ??? ?? ?? ??? ?????.
??? ???? 271?? ??? ????? ???? ???, ??? 10?? ???? ?????(6??? ???). ??? ????? ??? ??? ? ???? ??? ??? ?? ?? 271 x 10 x 24 x 365 = 23,739,600????. ??? ?? .shape ???? ??? ??? ??? ?? 22,034,571?? ?????.
# Theoretical number of records is...
theoretical_nb_records = number_stations * (60 / delta_mins) * 365 * 24
actual_nb_of_rows = GS_cudf.shape[0]
missing_record_ratio = 1 - (actual_nb_of_rows/theoretical_nb_records)
print("Percentage of missing records of the NW dataset is: {:.1f}%".format(missing_record_ratio * 100))
print("Theoretical total number of values in dataset is: {:d}".format(int(theoretical_nb_records)))
Percentage of missing records of the NW dataset is: 12.7%
1? ??? ????? 12.7%? ?? ? 19.8?? ???? ??? ?? ?????.
36? ? ? 5??? ???? ??? ???? ???? ?? ? ? ????. ?? ??? ??? ?? ?? ??? ?? ??? ??? ?? ??? ?? ??? ?? ??? ???? ????. ??? ??? ?? ?? ?? ?? ? ??? ? ? ???, ???? NA? ???? ????.
??? ?? NA ???? ??? ????? ?? ?? ??? NA ???? ??? ?????. ? ??? ??? ?? ???? ?? ???? ??? ?????. ??? ? ????? 2018? ???? ????? ??? ??? ????.
# Finding which items have NA value(s) during year 2018
NA_sum = GS_cudf[GS_cudf['date'].dt.year==2018].isna().sum()
NA_data = NA_sum[NA_sum>0]
NA_data.index
StringIndex(['dd' 'ff' 'precip' 'hu' 'td' 't' 'psl'], dtype='object')
NA_data
dd 8605703
ff 8598613
precip 1279127
hu 8783452
td 8786154
t 2893694
psl 17621180
dtype: int64
PSL(??? ??)? ??? ??? ?? ??? ?? ? ? ????. ?? ???? ? 80%? ???? ????. ???? ???? ?? ???? ? 6%? ?? ???, ?? ??? ????? ?? ?????.
? ? ?? ??? ??? ??? ?? ??? ???, ?? ?? ?? ?? ??? ???? ??? ???? ? ??? ???. ?? ??? ???? ??? ????? ?? ???? ??? ???? ?????.
?? ???? ??? ??? ? ?????? ?? ??? ??? ? ? ?? ?? ?? ??? ??? ? ????.
3??. ?? ?? ?? ??
?? ML ??????? ?? ???? ?? ?? ??????? ?????. ?? ?? ??? ????? ?? ??? ??? ??? ? ???? ???? ???.
? ???? ?? ????? ??? ???? ?? ?? ?????. ?? 3? ??? ??? ?? ???? ????? ???? ???? ? ???? ?????.
# Only analyze meteorological columns
Meteo_series = ['dd', 'ff', 'precip' ,'hu', 'td', 't', 'psl']
Meteo_df = cudf.DataFrame(GS_cudf,columns=Meteo_series)
Meteo_corr = Meteo_df.dropna().corr()
# Check the items with correlation value > 0.7
Meteo_corr[Meteo_corr>0.7]
dd | ff | precip | hu | td | t | psl | |
dd | 1.0 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
ff | <NA> | 1.0 | <NA> | <NA> | <NA> | <NA> | <NA> |
precip | <NA> | <NA> | 1.0 | <NA> | <NA> | <NA> | <NA> |
hu | <NA> | <NA> | <NA> | 1.0 | <NA> | <NA> | <NA> |
td | <NA> | <NA> | <NA> | <NA> | 1.0 | 0.840558357 | <NA> |
t | <NA> | <NA> | <NA> | <NA> | 0.840558357 | 1.0 | <NA> |
psl | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
???? td(???)? t(??) ???? ???? ????? ????. ?? ??? ?? ??? ?????? ???? ?? ????? ???? ?? ? ?? ?????. ???? ?? ??? ???? ???? ??? ??? ??? ???? ?? ???? ??? ??? ? ????.
???? ??? ?? ??? ?? ???? ??? ??? ???? ?? ???? ??? ??????. ? ?? EDA ??? ???, ? ???? ??? ??? ??? ???????? ??? ?? ?? ?? ???? ????? ??? ???? ?????. ????? ?? ????, cuDF? pandas ??? ?? ?????.
EDA? ?? RAPIDS cuDF ??????
?? ?????, ? ???? ?? ???? ?? ?????? ?? ??? ?????. ? ?????? 9.55?? ?? ??? ??????. ??? ??? NVIDIA A6000 GPU?? ???????.

? 3? ? ???? ?? ???? ??? ??? ?? ??????.
Full Notebook (15x speed-up) | Pandas on CPU (Intel Core i7-7800X CPU) | user 1 min 15 sec sys: 14.3 sec total: 1 min 30 sec |
RAPIDS cuDF on NVIDIA A6000 | user 3.92 sec sys: 2.03 sec total: 5.95 sec |
?? ??
??? ??? ?? ????? ????? ???? ?? ????? ?? ???? ?? ??, ??? ??, ????? ????.
??? 15?? ?? ???? ? ?? ??? ????? ?? ??? ??? ??? ? ????. EDA ??? ???? ? 1??? ??? ??, 4? ?? ??? ? ????. ??? ???? ??? ?? ??? ????, ??? ??? ????, ? ??? ??? 1?? ???? ???? ??? ?????, ?? ??? ?? ??? ??? ?????? ? ?? 56?? ????. ????? ??? ?? ??? ? ????.
??? ??? ???? cuDF? ?? ? ??? ????? ??? ??? cuDF? ??? ??? ??? ??? ??? ???. NVIDIA GTC 2023? ??? ?? ??? ???? ??? ??????.
??? ???? cuDF? ???? ??? ??? ???? ??? ??: RAPIDS cuDF? ? ??? ??? ????? ?????.
??? ?
Meiran Peng, David Taubenheim, Sheng Luo, Jay Rodge? ? ???? ??? ?????.
? ???? ??? SDK? ???? ?? ???, ?? ???, ?? ??, ??, ?? ??, ???? NVIDIA ??? ???? ??? ??? ??? ??? ? ????. ?? ??? ???? NVIDIA? ?? ????? ???? ? ??? ??? ??? ?????? ???? ??? ??? ???.