• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Data Science

    ???? ??? ??: RAPIDS cuDF? ??? ?? ?? ??

    Reading Time: 6 minutes

    ? ???? ???? ??? ??? ?? ???? ?????:

    ?? ???, ??, ??, ???? ??? ???? ?? ?? ?? ??? ???? ???? ????. IDC? ??? 2025??? 2020?? 64ZB? ?? 180ZB? ???? ??? ???, ? ?? ???? ????? ???? ?? ??? ??? ???? ??? ????.

    • ??? ? ?? ???? ?? ???? ???, ??, ?? ??? ???? ??? ???? ?? ??? ???? ? ??? ?? ????.
    • ?? ???? ? ??? ???? ??? ? ???? ? ???? ? ?? ????? ???? ??? ??? ?? ?????.
    • ??? ?? ? ???? ???? ????? ??? ?? ? ?? ???? ???? ????.

    NVIDIA? ??? ??????? ????? ??? ???? ? ?? ?????? ???? GPU?? ??? ? ??? ?? ?? ????? ????? ? API? RAPIDS ???? ?????. ???? ?? ? ??? ????? ?? ???? ??? ?? ??? DataFrame API? ?????: RAPIDS cuDF.

    ???? ??? ?? ??????? ?? 40??? ??? ???? ???? ??? ??? ??? ???? ?? ?? ??? ?? ??? ?? ? ?? ?? ??? ??? ?????.

    ?? ??? ??? ??? ???? ??, ? ?????? RAPIDS cuDF? ??? ??? ?? ??? ??(EDA) ???? ???????.

    ?? Spark 3.0? ??? ??? ??? ??? ???? ????, Spark? RAPIDS ???? ?????. ?? RAPIDS? ??? ??? ????, RAPIDS? ?? ?? ??????? ???? ?? ??? ???? ?????.

    ? EDA? RAPIDS cuDF???

    ??? ???? ??? ??????? EDA ??? ?? ??? ?? ? ??? ?? Python? ???? ??? ?? ?? ?? ?? ????? ?????? pandas? ?????. ???? ???? ??? ??? ??? ??? ???? ???? ??? ???? ?? ????? ??? ? ??? ??????, ?? EDA? ?? ??? ?????? ?????.

    ??? pandas? ?? ???? ????? ???? ??? ??? 1~2GB? ???? ??? ???? ???? ?? ???? ??? ????. ??? ??? 10~20GB ??? ???? ??, Dask? Apache Spark? ?? ?? ??? ??? ???? ???. ??? ??? ?? ???? ??? ??? ??? ? ? ??? ????.

    2~10GB? ?? ???? RAPIDS cuDF? ??? Goldilocks ??????.

    RAPIDS cuDF? GPU? ?? ??? ???? ?????, ??? ??? API? ????? ????. pandas?? ?? ?? ?? ???? ?? ??? ??? cuDF ??? ?? cuDF? pandas?? ????? ? ????. RAPIDS? ?? pandas ????? ?? ?? ??? ????? ???? ???.

    RAPIDS cuDF? ??? EDA

    ? ?????? EDA ?? ??? ?? ? cuDF? ??? ?? ??? ? ??? ?????.

    ? ????? ??? ??? ?? pandas? ??? ?? ???? cuDF? ??? ? ?? ??? 15? ???? ?? ??????. ? ??? ??? ?? ?????? ??? ? ??? ???? ? ??? ? ? ????.

    ??? ?? ??? ?? ??? ??? RAPIDS GitHub ?????? ?? ?? ???, cuDF? ??? ??? ??? ??? ?????. ?? ?????? ???? ??? ??? 15?? ?????.

    ??? ??

    Meteonet? 2016??? 2018??? ?? ??? ??? ?? ???? ???? ??? ?? ??? ?????. ????? ???? ?? ???? ?? ???? ??? ?????.

    ?? ?? ??

    ? ?????? ??? ??????? ? ?? ???? ?? ? ??? ????? ??? ?????. ? ?? ??? ???, ?? ?? ?? ?? ?? ?? ??? ?? ??? ???? ? ??? ? ?? ??????.

    ?? ??? ?? ???? ???????? ??? ??????:

    • ?? ????.
    • ??? ??? ?? ??.
    • ?? ?? ?? ??.

    1??. ?? ????

    ?? cuDF ?????? ???? ??? ??? ??????. ??? ???? 2016?, 2017?, 2018? ???? ??? NW.csv? ?????? ???? cuDF? ??? ??? ??? ?? ???? ?????.

    # Import cuDF and CuPy
    import cudf
    import cupy as cp
    
    # Read in the data files into DataFrame placeholders
    gdf_2016 = cudf.read_csv('./NW.csv')

    ??? ??? ???? ?? ?? ??? ?????. ?? ??????? ??? ???? ??? ???? ? ??? ???. ????? pandas ??? ???? ????.

    pandas? ??? .info ??? ???? ?? ??? ??? ??? ?????:

    GS_cudf.info()
    <class 'cudf.core.dataframe.DataFrame'>
    RangeIndex: 22034571 entries, 0 to 22034570
    Data columns (total 12 columns):
     #   Column      Dtype
    ---  ------      -----
     0   number_sta  int64
     1   lat         float64
     2   lon         float64
     3   height_sta  float64
     4   date        datetime64[ns]
     5   dd          float64
     6   ff          float64
     7   precip      float64
     8   hu          float64
     9   td          float64
     10  t           float64
            object
    dtypes: datetime64[ns](1), float64(9), int64(1), object(1)
    memory usage: 6.5+ GB

    ????? 12?? ??? ?????, ??? ???? ? ? ????.

    ???? ?? ? ??? ????? ??? ???? ???? ??(?? ?)? ?????.

    # Checking the DataFrame dimensions. Millions of rows by 12 columns.
    GS_cudf.shape
    (65826837, 12)

    ? ? ??? ?? ??? ? ? ??? ? ??? ??? ??? ? ? ????. 65,826,837?? ?? ???? ?? ??? ??? ? ?? ???? ? ????.

    ??, ??? ???? ?? ?? ?? ???? ??? ??? ?? ??? ?????:

    # Display the first five rows of the DataFrame to examine details
    GS_cudf.head()
     number_stalatlonheight_stadateddffpreciphutdtpsl
    01406600149.33-0.432.02016-01-01210.04.40.091.0278.45279.85<NA>
    11412600149.150.04  125.02016-01-01<NA><NA>0.099.0278.35278.45<NA>
    21413700149.18-0.4667.02016-01-01220.00.60.092.0276.45277.65102360.0
    31421600148.93-0.15155.02016-01-01220.01.90.095.0278.25278.95<NA>
    41429600148.80-1.03339.02016-01-01<NA><NA>0.0<NA><NA>278.35<NA>

    ? 1. ?? ??

    ?? ? ?? ??? ??? ??? ? ????. ??? ??? ? ????? ? ?? ???? ???? ????? ?? ?????:

    # How many weather stations are covered in this dataset? 
    # Call nunique() to count the distinct elements along a specified axis.
    
    number_stations = GS_cudf['number_sta'].nunique()
    print("The full dataset is composed of {} unique weather stations.".format(GS_cudf['number_sta'].nunique()))

    ?? ??? ??? 287?? ??? ?? ???? ???? ????. ? ?? ???? ??? ?? ???? ????????

    ## Investigate the frequency of one specific station's data
    ## date column is datetime dtype, and the diff() function calculates the delta time 
    ## TimedeltaProperties.seconds can help get the delta seconds between each record, divide by 60 seconds to see the minutes difference.
    delta_mins = GS_cudf['date'].diff().dt.seconds.max()/60
    print(f"The data is recorded every {delta_mins} minutes")

    ???? 6.0??? ?????. ?? ???? ? ??? 10?? ??? ?????.

    ?? ??? ?? ??? ??? ??????:

    • ??? ??
    • ??? ??? ??
    • ??? ??? ???? ??? ?
    • ??? ?? ???? ??

    ??? ? ???? ????? ??? ??? ???? ?? ???? ? ??? ??? ???? ???. ??? ??? ? ???? ? ??? ??? ? ?? ??? ??? ? ??? ??? ??? ????.

    2??. ?? ??

    ??? ?? ????? ?? ?? ??? ????. ??? ??? ?? ??? ? ? ?? ??? ? ?? ???? ??? ? ????. ?? ??? ???? ?????, ???? ?? ?? ???? ???? ? ? ?? ?? ??? ?? ??? ? ????.

    ? ??? ??? ???? ????? ??? ???? ??? ???? ?? ??? ?? ?????.

    ??? ???? ??? ????? ?? ??? ?? ?? ??? ?????.

    ??? ???? 271?? ??? ????? ???? ???, ??? 10?? ???? ?????(6??? ???). ??? ????? ??? ??? ? ???? ??? ??? ?? ?? 271 x 10 x 24 x 365 = 23,739,600????. ??? ?? .shape ???? ??? ??? ??? ?? 22,034,571?? ?????.

    # Theoretical number of records is... 
    theoretical_nb_records = number_stations * (60 / delta_mins) * 365 * 24 
    actual_nb_of_rows = GS_cudf.shape[0]
    missing_record_ratio = 1 - (actual_nb_of_rows/theoretical_nb_records)
    print("Percentage of missing records of the NW dataset is: {:.1f}%".format(missing_record_ratio * 100))
    print("Theoretical total number of values in dataset is: {:d}".format(int(theoretical_nb_records)))
    
    Percentage of missing records of the NW dataset is: 12.7%

    1? ??? ????? 12.7%? ?? ? 19.8?? ???? ??? ?? ?????.

    36? ? ? 5??? ???? ??? ???? ???? ?? ? ? ????. ?? ??? ??? ?? ?? ??? ?? ??? ??? ?? ??? ?? ??? ?? ??? ???? ????. ??? ??? ?? ?? ?? ?? ? ??? ? ? ???, ???? NA? ???? ????.

    ??? ?? NA ???? ??? ????? ?? ?? ??? NA ???? ??? ?????. ? ??? ??? ?? ???? ?? ???? ??? ?????. ??? ? ????? 2018? ???? ????? ??? ??? ????.

    # Finding which items have NA value(s) during year 2018
    NA_sum = GS_cudf[GS_cudf['date'].dt.year==2018].isna().sum()
    NA_data = NA_sum[NA_sum>0]
    NA_data.index
    
    StringIndex(['dd' 'ff' 'precip' 'hu' 'td' 't' 'psl'], dtype='object')
    NA_data
    dd         8605703
    ff         8598613
    precip     1279127
    hu         8783452
    td         8786154
    t          2893694
    psl       17621180
    dtype: int64

    PSL(??? ??)? ??? ??? ?? ??? ?? ? ? ????. ?? ???? ? 80%? ???? ????. ???? ???? ?? ???? ? 6%? ?? ???, ?? ??? ????? ?? ?????.

    ? ? ?? ??? ??? ??? ?? ??? ???, ?? ?? ?? ?? ??? ???? ??? ???? ? ??? ???. ?? ??? ???? ??? ????? ?? ???? ??? ???? ?????.

    ?? ???? ??? ??? ? ?????? ?? ??? ??? ? ? ?? ?? ?? ??? ??? ? ????.

    3??. ?? ?? ?? ??

    ?? ML ??????? ?? ???? ?? ?? ??????? ?????. ?? ?? ??? ????? ?? ??? ??? ??? ? ???? ???? ???.

    ? ???? ?? ????? ??? ???? ?? ?? ?????. ?? 3? ??? ??? ?? ???? ????? ???? ???? ? ???? ?????.

    # Only analyze meteorological columns
    Meteo_series = ['dd', 'ff', 'precip' ,'hu', 'td', 't', 'psl']
    Meteo_df = cudf.DataFrame(GS_cudf,columns=Meteo_series)
    Meteo_corr = Meteo_df.dropna().corr()
    
    # Check the items with correlation value > 0.7 
    Meteo_corr[Meteo_corr>0.7]
     ddffpreciphutdtpsl
    dd1.0<NA><NA><NA><NA><NA><NA>
    ff<NA>1.0<NA><NA><NA><NA><NA>
    precip<NA><NA>1.0<NA><NA><NA><NA>
    hu<NA><NA><NA>1.0<NA><NA><NA>
    td<NA><NA><NA><NA>1.00.840558357<NA>
    t<NA><NA><NA><NA>0.8405583571.0<NA>
    psl<NA><NA><NA><NA><NA><NA> 
    ? 2. ?? ??

    ???? td(???)? t(??) ???? ???? ????? ????. ?? ??? ?? ??? ?????? ???? ?? ????? ???? ?? ? ?? ?????. ???? ?? ??? ???? ???? ??? ??? ??? ???? ?? ???? ??? ??? ? ????.

    ???? ??? ?? ??? ?? ???? ??? ??? ???? ?? ???? ??? ??????. ? ?? EDA ??? ???, ? ???? ??? ??? ??? ???????? ??? ?? ?? ?? ???? ????? ??? ???? ?????. ????? ?? ????, cuDF? pandas ??? ?? ?????.

    EDA? ?? RAPIDS cuDF ??????

    ?? ?????, ? ???? ?? ???? ?? ?????? ?? ??? ?????. ? ?????? 9.55?? ?? ??? ??????. ??? ??? NVIDIA A6000 GPU?? ???????.

    ?? 1. ??? ??? ??? ?? ???? ??

    ? 3? ? ???? ?? ???? ??? ??? ?? ??????.

    Full Notebook (15x speed-up)Pandas on CPU (Intel Core i7-7800X CPU)user 1 min 15 sec
    sys: 14.3 sec
    total: 1 min 30 sec
    RAPIDS cuDF on NVIDIA A6000user 3.92 sec
    sys: 2.03 sec
    total: 5.95 sec
    ? 3. NVIDIA RTX A6000 GPU?? RAPIDS cuDF? ??? EDA? ?? 15?? ?? ??? ??????.

    ?? ??

    ??? ??? ?? ????? ????? ???? ?? ????? ?? ???? ?? ??, ??? ??, ????? ????.

    ??? 15?? ?? ???? ? ?? ??? ????? ?? ??? ??? ??? ? ????. EDA ??? ???? ? 1??? ??? ??, 4? ?? ??? ? ????. ??? ???? ??? ?? ??? ????, ??? ??? ????, ? ??? ??? 1?? ???? ???? ??? ?????, ?? ??? ?? ??? ??? ?????? ? ?? 56?? ????. ????? ??? ?? ??? ? ????.

    ??? ??? ???? cuDF? ?? ? ??? ????? ??? ??? cuDF? ??? ??? ??? ??? ??? ???. NVIDIA GTC 2023? ??? ?? ??? ???? ??? ??????.

    ??? ???? cuDF? ???? ??? ??? ???? ??? ??: RAPIDS cuDF? ? ??? ??? ????? ?????.

    ??? ?

    Meiran Peng, David Taubenheim, Sheng Luo, Jay Rodge? ? ???? ??? ?????.

    ? ???? ??? SDK? ???? ?? ???, ?? ???, ?? ??, ??, ?? ??, ???? NVIDIA ??? ???? ??? ??? ??? ??? ? ????. ?? ??? ???? NVIDIA? ?? ????? ???? ? ??? ??? ??? ?????? ???? ??? ??? ???.

    Discuss (0)
    +1

    Tags

    ?? ???

    人人超碰97caoporen国产