类型和缺失值 - Julia Data Science

4.5 类型和缺失值

正如在 Section 4.1 讨论的那样， CSV.jl 会尽可能推断每列数据应该使用的类型。然而，这并不总是能完美实现。本节将说明为什么合适的类型是重要的，以及如何修复错误数据类型。为了更清晰地展示类型，接下来将给出 DataFrame 的文本输出，而不是格式化打印的表。本节将使用如下的数据集：

function wrong_types()
    id = 1:4
    date = ["28-01-2018", "03-04-2019", "01-08-2018", "22-11-2020"]
    age = ["adolescent", "adult", "infant", "adult"]
    DataFrame(; id, date, age)
end
wrong_types()

4×3 DataFrame
 Row │ id     date        age
     │ Int64  String      String
─────┼───────────────────────────────
   1 │     1  28-01-2018  adolescent
   2 │     2  03-04-2019  adult
   3 │     3  01-08-2018  infant
   4 │     4  22-11-2020  adult

因为日期列的类型并不正确，所以 sort 并不能正常工作：

sort(wrong_types(), :date)

4×3 DataFrame
 Row │ id     date        age
     │ Int64  String      String
─────┼───────────────────────────────
   1 │     3  01-08-2018  infant
   2 │     2  03-04-2019  adult
   3 │     4  22-11-2020  adult
   4 │     1  28-01-2018  adolescent

为了修复此问题，可以使用在 Section 3.5.1 中提到的 Julia 标准库 Date 模块：

function fix_date_column(df::DataFrame)
    strings2dates(dates::Vector) = Date.(dates, dateformat"dd-mm-yyyy")
    dates = strings2dates(df[!, :date])
    df[!, :date] = dates
    df
end
fix_date_column(wrong_types())

4×3 DataFrame
 Row │ id     date        age
     │ Int64  Date        String
─────┼───────────────────────────────
   1 │     1  2018-01-28  adolescent
   2 │     2  2019-04-03  adult
   3 │     3  2018-08-01  infant
   4 │     4  2020-11-22  adult

现在，排序的结果与预期相符：

df = fix_date_column(wrong_types())
sort(df, :date)

4×3 DataFrame
 Row │ id     date        age
     │ Int64  Date        String
─────┼───────────────────────────────
   1 │     1  2018-01-28  adolescent
   2 │     3  2018-08-01  infant
   3 │     2  2019-04-03  adult
   4 │     4  2020-11-22  adult

年龄列存在相似的问题：

sort(wrong_types(), :age)

4×3 DataFrame
 Row │ id     date        age
     │ Int64  String      String
─────┼───────────────────────────────
   1 │     1  28-01-2018  adolescent
   2 │     2  03-04-2019  adult
   3 │     4  22-11-2020  adult
   4 │     3  01-08-2018  infant

这显然不正确，因为婴儿比成年人和青少年更年轻。对于此问题和其他分类数据的解决方案是 CategoricalArrays.jl：

using CategoricalArrays

可以使用 CategoricalArrays.jl 包为分类变量数据添加层级顺序：

function fix_age_column(df)
    levels = ["infant", "adolescent", "adult"]
    ages = categorical(df[!, :age]; levels, ordered=true)
    df[!, :age] = ages
    df
end
fix_age_column(wrong_types())

4×3 DataFrame
 Row │ id     date        age
     │ Int64  String      Cat…
─────┼───────────────────────────────
   1 │     1  28-01-2018  adolescent
   2 │     2  03-04-2019  adult
   3 │     3  01-08-2018  infant
   4 │     4  22-11-2020  adult

NOTE: 此处注意参数 ordered=true 将告诉 CategoricalArrays.jl 的 categorical 函数，分类数据是排好序的。如果没有此参数，任何的大小比较都不能实现。

现在可以正确地按年龄排序：

df = fix_age_column(wrong_types())
sort(df, :age)

4×3 DataFrame
 Row │ id     date        age
     │ Int64  String      Cat…
─────┼───────────────────────────────
   1 │     3  01-08-2018  infant
   2 │     1  28-01-2018  adolescent
   3 │     2  03-04-2019  adult
   4 │     4  22-11-2020  adult

因为已经定义了一组函数，因此可以通过调用函数来定义修正后的数据：

function correct_types()
    df = wrong_types()
    df = fix_date_column(df)
    df = fix_age_column(df)
end
correct_types()

4×3 DataFrame
 Row │ id     date        age
     │ Int64  Date        Cat…
─────┼───────────────────────────────
   1 │     1  2018-01-28  adolescent
   2 │     2  2019-04-03  adult
   3 │     3  2018-08-01  infant
   4 │     4  2020-11-22  adult

数据中的年龄是有序的 (ordered=true)，因此可以正确比较年龄类别：

df = correct_types()
a = df[1, :age]
b = df[2, :age]
a < b

true

如果元素类型为字符串，这将产生错误的比较：

"infant" < "adult"

false

4.4 Select ← → 4.6 Join

CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso, 刘贵欣 (中文翻译), 田俊（中文审校）