? ???? ???? ??? ??? ?? ???? ?????:
- ???? ??? ??: ???? ??? ??: RAPIDS cuDF? ??? ?? ?? ?????? pandas ?????? ??? Python?? ????? ???? ??? ??? ?????? ?? ?????.
- ? ?????? RAPIDS cuDF? ??? ??? ??? ??? ???? ??? ?????.
????? ?? ??? ???? ??? ?? ??? ??? ??(EDA) ???????? pandas? ??? ?????? ?? ???? ??? ?? ?????? RAPIDS cuDF? ??? ?? ???? ??? ?? ? ????. ??? ???? ?????? ??? ???? ??? ?? ??? ??? ??? ?? ??? ??? ?? ???, RAPIDS? ???? ? ?? ??? ?? ?????.
RAPIDS cuDF? ???? ?? ??? ??? ?? ‘????’ ??? ??? ?? ??? ?? ??? ?? ? ????. ??? ??? ??? ???? ?????? Apache Spark? Dask? ?? ??? ?? ??? ??? ???? ????.
??? ???? ??????
? ????? ??? ???? ???? ?? ??(ML) ?? ??? ???? ??? ??? ???? ?? ??? ?? ?????.
??? ???? ???? ????. ?????? ?? ??, ?? ?? ??, ?? ?? ?? ? ??? ??? ??? ???? ??? ?????.
?????? ??? ??? ????? ? ??? ??? ? ?? ??? ???? ?????. ????? ???? ??? ???? ???? ??? ???? ?? ?? ??? ????? ??? ??? ? ?? ???.
??? ?? ???? ML ?? ??? ??? ???? ?? ????, ?? ?? ?? ??? ???????:
- ?? ??? ??? ?? ?? ?? ??
- ?? ??? ?? ??
- ?? ??? ?? ?? ??
- ??? ??? ?? ?? ???
??? ??? ?? ???? ???? ?? ?? ?? ???? ?? ???? ??? ??? ??? ??? ???? ? ?? ??? ??? ???? ???? ???? ?? ??? ????. ??? ???? ??? ?? ?? ??? ??? ??? ?? ???? ???? ??????, ??? ???? ???? ?? ???? ????? ?? ??? ????.
pandas? ??? ??? ??? ? ?? ???? ???? ??? ??? ?????, ?? ??? ??? ??? ????? ?? ?? ???? ? ??? ?? ?? ????? ??? ????. ?? ?? ??? ?? ?? ??? ?? ?? ??? ?? ??? ??? ??? ??? ???? ?? ??? ??? ? ????. ??? ???? ??? ??? ???? ??? ???? ??? ? ??? ???? ??? ??? ? ????.
??? ??? ????? ??? ??? ??? cuDF? ?? ???? ????. ??? ??? API? ???? ?? ?????? ???? ?? 40? ?? ??? ??? ? ???? ?? ??? ?????? ?? ??? ??? ??? ??? ? ????.
RAPIDS cuDF? ??? ???
? ?????? RAPIDS cuDF? ???? ??? ? ???? ??? ?? ?? ??? ? ?? ??? ????? ?? ??? ??? ???? ??? ?? ??? ?? ??? ?????. ? ???? ????? ?? ??? ?? ?? ?? ??? ??? ?? ??? ??? ????, RAPIDS GitHub ??????? ??? ? ????.
?? ???? RAPIDS cuDF? 13?? ??? ???????(??? ??? ? ???? ???? ?? ???? ?? ??). ????? ?? ?????? ????? ??? ?????.
?? ??? ???? ??? ?? ??? ???? ??? ????. 1??? ??? ??? 5? ??? ??? ? ??? ??? ??? ?? ?? ?? ? ????.
??? ??
Meteonet? 2016??? 2018??? ?? ??? ??? ?? ???? ???? ???? ?? ?? ??? ????, ????? ???? ?? ???? ???? ????. ??? ? 12.5GB???.
?? ?? ??
? ?????? ? ?? ???? ?? ?? ??? ?? ??? ?? ???? ?? ??? ????? ??? ?????. ???? ?? ??? ??, ??? ?? ?? ??? ??? ? ? ?? ??????.
? ???? ????? ???? ??? ??? ??? ???? ??????? ??? ????. ? ??? ?? ??? ????? ????? ????:
- ??? ???? ??? ?????.
- ???? ?? ??????.
- ?? ??? ??? ?????.
? ?????? ????? ??? ????? ?????, cuDF? ??? ??? ??? ???? ??? ? ?? ??? ???? ?????.
1??. ??? ??? ?? ??
?? ?? ??? ???? ? ??? ??? ???? ?????:
# Import the necessary packages
import cudf
import cupy as cp
import pandas as pd
???? CSV ???? ?????.
## Read in data
gdf = cudf.read_csv('./SE_data.csv')
?? ?? ?? ????? ??, ??, ??? ??? ??? ??? ?????.?
gdf = gdf.drop(columns=['dd','precip','td','psl'])
?? ?? ????? ??? ? ??? ?? ??? ?????. ?? ?? ??/?? ??? ???? ???? ? ?? ??? ?????. ?? ?? ?? 5?? ?? ???? ?? ?? ??? ????? ? ?? ??? ??? ??? ?????.
# Change the date column to the datetime data type. Look at the DataFrame info
gdf['date'] = cudf.to_datetime(gdf['date'])
gdf.head()
Gdf.shape
number_sta | lat | lon | height_sta | date | ff | hu | t | |
0 | 1027003 | 45.83 | 5.11 | 196.0 | 2016-01-01 | <NA> | 98.0 | 279.05 |
1 | 1033002 | 46.09 | 5.81 | 350.0 | 2016-01-01 | 0.0 | 99.0 | 278.35 |
2 | 1034004 | 45.77 | 5.69 | 330.0 | 2016-01-01 | 0.0 | 100.0 | 279.15 |
3 | 1072001 | 46.20 | 5.29 | 260.0 | 2016-01-01 | <NA> | <NA> | 276.55 |
4 | 1089001 | 45.98 | 5.33 | 252.0 | 2016-01-01 | 0.0 | 95.0 | 279.55 |
??
??? ??? ??(127515796, 8)? 127,515,796?? ?? 8?? ?? ?????. ?? ??? ??? ??? ??? ????? ???? ??? ??? ???? ?? ? ? ???? ??? ??? ? ????.
## Investigate the sampling frequency with the diff() function to calculate the time diff
## dt.seconds, which is used to find the seconds value in the datetime frame. Then apply the
## max() function to calculate the maximum date value of the series.
delta_mins = gdf['date'].diff().dt.seconds.max()/60
print(f"The dataset collection covers from {gdf['date'].min()} to {gdf['date'].max()} with {delta_mins} minute sampling interval")
??? ??? 6? ??? ???? 2016-01-01T00:00:00.000000000
?? 2018-12-31T23:54:00.000000000
,??? ?? ???? ?????. ?? ??? ??? ??? ??? ????? ?????.
??? ??? ?? ???? ??? ???? ???? ?? ??? ?????. ?? ??? ??? ?? ???? ??? ?????.
gdf['year'] = gdf['date'].dt.year
gdf['month'] = gdf['date'].dt.month
gdf['day'] = gdf['date'].dt.day
gdf['hour'] = gdf['date'].dt.hour
gdf['mins'] = gdf['date'].dt.minute
gdf.tail
?? ???? ???? ??, ?, ?? ?? ?????. ??? ???? ?? ?? ??? ?? ? ???? ??? ? ????.
number_sta | lat | lon | height_sta | date | ff | hu | t | year | month | day | hour | mins | |
127515791 | 84086001 | 43.811 | 5.146 | 672.0 | 2018-12-31 23:54:00 | 3.7 | 85.0 | 276.95 | 2018 | 12 | 31 | 23 | 54 |
127515792 | 84087001 | 44.145 | 4.861 | 55.0 | 2018-12-31 23:54:00 | 11.4 | 80.0 | 281.05 | 2018 | 12 | 31 | 23 | 54 |
127515793 | 84094001 | 44.289 | 5.131 | 392.0 | 2018-12-31 23:54:00 | 3.6 | 68.0 | 280.05 | 2018 | 12 | 31 | 23 | 54 |
127515794 | 84107002 | 44.041 | 5.493 | 836.0 | 2018-12-31 23:54:00 | 0.6 | 91.0 | 270.85 | 2018 | 12 | 31 | 23 | 54 |
127515795 | 84150001 | 44.337 | 4.905 | 141.0 | 2018-12-31 23:54:00 | 6.7 | 84.0 | 280.45 | 2018 | 12 | 31 | 23 | 54 |
??? ?? ?? ??? ????? ???? ????? ??????? ??? ???.
# Use the cupy.logical_and(...) function to select the data from a specific time range.
import pandas as pd
start_time = pd.Timestamp('2017-02-01T00')
end_time = pd.Timestamp('2018-11-01T00')
station_id = 84086001
gdf_period = gdf.loc[cp.logical_and(cp.logical_and(gdf['date']>start_time,gdf['date']<end_time),gdf['number_sta']==station_id)]
gdf_period.shape
(146039, 13)
13?? ??? 146,039?? ?? ??? ??? ???? ????? ???????.
2??. ??? ????
?? ??????? ??????? ??? ???? ??? ?????. ???? 6??? ???????, ? ?? ???? ? ?? ??? ????? ???.
?? ??? ???? ???? ??? ??? ??? ?? ????? ???. 6??? ???? ??? ??? ???? ???? ??????? ?? ? ??? ?? ??? ???? ?????. ? ??? ???? ?? ??? ???? ?????.
## Set "date" as the index. See what that does?
gdf_period.set_index("date", inplace=True)
## Now, resample by daylong intervals and check the max data during the resampled period.
## Use .reset_index() to reset the index instead of date.
gdf_day_max = gdf_period.resample('D').max().bfill().reset_index()
gdf_day_max.head()
?? ???? ? ??? ??? ? ????. ?? ???? ??? ??? ??? ???? ?????.
date | number_sta | lat | lon | height_sta | ff | hu | t | year | month | day | hour | mins | |
0 | 2017-02-01 | 84086001 | 43.81 | 5.15 | 672.0 | 8.1 | 98.0 | 283.05 | 2017 | 2 | 1 | 23 | 54 |
1 | 2017-02-02 | 84086001 | 43.81 | 5.15 | 672.0 | 14.1 | 98.0 | 283.85 | 2017 | 2 | 2 | 23 | 54 |
2 | 2017-02-03 | 84086001 | 43.81 | 5.15 | 672.0 | 10.1 | 99.0 | 281.45 | 2017 | 2 | 3 | 23 | 54 |
3 | 2017-02-04 | 84086001 | 43.81 | 5.15 | 672.0 | 12.5 | 99.0 | 284.35 | 2017 | 2 | 4 | 23 | 54 |
4 | 2017-02-05 | 84086001 | 43.81 | 5.15 | 672.0 | 7.3 | 99.0 | 280.75 | 2017 | 2 | 5 | 23 | 54 |
3??. ?? ??? ?? ??
?? ???? ????? ??? ???? ???? ???????. ??? ?? ???? ???? ??? ???? ???? ???? ?? ????.
?? ???? ???? ?? 3? ?? ???? ?????. ???? ? ??? ???? ?????.
# Specify the rolling window.
gdf_3d_max = gdf_day_max.rolling('3d',min_periods=1).max()
gdf_3d_max.reset_index(inplace=True)
gdf_3d_max.head()
?? ???? ???? ??? ???? ???? ?? ??? ?? ??? ???? ??? ? ????.
date | number_sta | lat | lon | height_sta | ff | hu | t | year | month | day | hour | mins | |
0 | 2017-02-01 | 84086001 | 43.81 | 5.15 | 672.0 | 8.1 | 98.0 | 283.05 | 2017 | 2 | 1 | 23 | 54 |
1 | 2017-02-02 | 84086001 | 43.81 | 5.15 | 672.0 | 14.1 | 98.0 | 283.85 | 2017 | 2 | 2 | 23 | 54 |
2 | 2017-02-03 | 84086001 | 43.81 | 5.15 | 672.0 | 14.1 | 99.0 | 283.85 | 2017 | 2 | 3 | 23 | 54 |
3 | 2017-02-04 | 84086001 | 43.81 | 5.15 | 672.0 | 14.1 | 99.0 | 283.35 | 2017 | 2 | 4 | 23 | 54 |
4 | 2017-02-05 | 84086001 | 43.81 | 5.15 | 672.0 | 12.5 | 99.0 | 283.35 | 2017 | 2 | 5 | 23 | 54 |
? ?????? ??? ??? ??? ???? ??? ??? ?????. ????? ?? ??? ??? ??????, ? ??? ?? ??? ??? ???? ??? ? ????. ????? ???? ??? ?? ??? ?????.
?? ?? ??
Meteonet ?? ??? ???? ?? ???? ???? ?, RAPIDS 23.02? ???? NVIDIA RTX A6000 GPU?? 13?? ?? ??? ??????(?? 1).

Pandas on CPU (Intel Core i7-7800X CPU) | User: 2 min 32 sec Sys: 27.3 sec Total: 3 min |
RAPIDS cuDF on NVIDIA A6000 GPUs | User: 5.33 sec Sys: 8.67 sec Total: 14 sec |
?? ???
??? ??? ?? ??? ?? ?? ??? ??? ??? ?? ?????. RAPIDs cuDF? ???? ??? Pandas ???? ?? ??? ? ??? ???? ????? ???? ??? ??? ? ????.
??? ???? cuDF? ?? ? ??? ????? GitHub? rapidsai-community/notebooks-contrib? ?????. EDA ???????? cuDF? ?? ????? ???? ??? ??: RAPIDS cuDF? ??? ?? ??? ????? ?????.
3? 20??? 23??? ?? NVIDIA GTC 2023? ??? ???? ??? ????? ?????.
??
Meiran Peng, David Taube
? ???? ??? SDK? ???? ?? ???, ?? ???, ?? ??, ??, ?? ??, ???? NVIDIA ??? ???? ??? ??? ??? ??? ? ????. ?? ??? ???? NVIDIA? ?? ????? ???? ? ??? ??? ??? ?????? ???? ??? ??? ???.