索引和选择

import pandas as pd
import numpy as np

概述

pandas 索引和选择的操作主要是以下五种:

操作语法Result
选择列df[col]Series
通过标签选择行df.loc[label]Series
通过整数下标选择行df.iloc[loc]Series
行切片df[5:10]DataFrame
通过布尔向量切片df[bool_vec]DataFrame
df = pd.DataFrame(np.random.randn(4, 5), index=['a', 'b', 'c', 'd'], columns=['A', 'B', 'C', 'D', 'E'])
df

ABCDE
a-0.103212-0.719843-0.355287-0.4402571.083105
b1.2150910.3508091.1296270.179248-0.367324
c0.7032860.733214-1.4725111.311112-0.663189
d0.6446370.7393761.270120-1.3539120.187226
df['B']
a   -0.719843
b    0.350809
c    0.733214
d    0.739376
Name: B, dtype: float64
df[['B', 'C']]

BC
a-0.719843-0.355287
b0.3508091.129627
c0.733214-1.472511
d0.7393761.270120
df.loc['d']
A    0.644637
B    0.739376
C    1.270120
D   -1.353912
E    0.187226
Name: d, dtype: float64
df.iloc[0]
A   -0.103212
B   -0.719843
C   -0.355287
D   -0.440257
E    1.083105
Name: a, dtype: float64
df[:-1]

ABCDE
a-0.103212-0.719843-0.355287-0.4402571.083105
b1.2150910.3508091.1296270.179248-0.367324
c0.7032860.733214-1.4725111.311112-0.663189
df[:2] > 0

ABCDE
aFalseFalseFalseFalseTrue
bTrueTrueTrueTrueFalse
df[df < 0] = 0
df

ABCDE
a0.0000000.0000000.0000000.0000001.083105
b1.2150910.3508091.1296270.1792480.000000
c0.7032860.7332140.0000001.3111120.000000
d0.6446370.7393761.2701200.0000000.187226

使用 [] 进行索引

使用 [] 进行索引,实际上就是选择低维切片。对于 Series 对象和 DataFrame 对象,[] 返回的类型不同。

对象类型调用方法返回值类型
Seriesseries[label]标量值
DataFrameframe[colname]对应 colnameSeries
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
                  index=dates, columns=['A', 'B', 'C', 'D'])
df

ABCD
2000-01-010.169225-0.158166-0.166857-0.078189
2000-01-020.5847190.6316900.0016750.263198
2000-01-03-0.234786-1.622012-0.705652-0.171561
2000-01-041.038103-1.316517-0.135135-0.411320
2000-01-051.1230700.290033-0.372262-0.340214
2000-01-060.9002630.304163-1.6633220.633026
2000-01-070.7662080.359713-0.4603400.746346
2000-01-080.864615-0.5760920.065375-1.200949
s = df['A']
s
2000-01-01    0.169225
2000-01-02    0.584719
2000-01-03   -0.234786
2000-01-04    1.038103
2000-01-05    1.123070
2000-01-06    0.900263
2000-01-07    0.766208
2000-01-08    0.864615
Freq: D, Name: A, dtype: float64
s[dates[5]]
0.9002627651374207

交换 AB 两列:

df[['B', 'A']] = df[['A', 'B']]
df

ABCD
2000-01-01-0.1581660.169225-0.166857-0.078189
2000-01-020.6316900.5847190.0016750.263198
2000-01-03-1.622012-0.234786-0.705652-0.171561
2000-01-04-1.3165171.038103-0.135135-0.411320
2000-01-050.2900331.123070-0.372262-0.340214
2000-01-060.3041630.900263-1.6633220.633026
2000-01-070.3597130.766208-0.4603400.746346
2000-01-08-0.5760920.8646150.065375-1.200949

注意,如果使用了 loc 或者 ilocpandas 会先对齐所有 axes。这不会修改 df,因为列对齐在赋值之前。

df.loc[:, ['B', 'A']] = df[['A', 'B']]
df

ABCD
2000-01-01-0.1581660.169225-0.166857-0.078189
2000-01-020.6316900.5847190.0016750.263198
2000-01-03-1.622012-0.234786-0.705652-0.171561
2000-01-04-1.3165171.038103-0.135135-0.411320
2000-01-050.2900331.123070-0.372262-0.340214
2000-01-060.3041630.900263-1.6633220.633026
2000-01-070.3597130.766208-0.4603400.746346
2000-01-08-0.5760920.8646150.065375-1.200949

可以强转成 ndarray 数组来实现这一操作:

df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
df

ABCD
2000-01-010.169225-0.158166-0.166857-0.078189
2000-01-020.5847190.6316900.0016750.263198
2000-01-03-0.234786-1.622012-0.705652-0.171561
2000-01-041.038103-1.316517-0.135135-0.411320
2000-01-051.1230700.290033-0.372262-0.340214
2000-01-060.9002630.304163-1.6633220.633026
2000-01-070.7662080.359713-0.4603400.746346
2000-01-080.864615-0.5760920.065375-1.200949

切片

切片的规则和 Python 以及 NumPy 中的切片规则,一样。这里用 [] 运算符说明切片的语义。

s = pd.Series(np.random.randn(7), index=list('abcdefg'))
s
a   -0.381270
b   -0.485434
c   -0.298939
d    0.508133
e    0.961189
f    0.107214
g   -1.452212
dtype: float64
s[:5]
a   -0.381270
b   -0.485434
c   -0.298939
d    0.508133
e    0.961189
dtype: float64
s[::2]
a   -0.381270
c   -0.298939
e    0.961189
g   -1.452212
dtype: float64
s[::-1]
g   -1.452212
f    0.107214
e    0.961189
d    0.508133
c   -0.298939
b   -0.485434
a   -0.381270
dtype: float64
s[:3] = 0
s
a    0.000000
b    0.000000
c    0.000000
d    0.508133
e    0.961189
f    0.107214
g   -1.452212
dtype: float64
df = pd.DataFrame(np.random.randn(3, 4), index=list('abc'), columns=list('ABCD'))
df

ABCD
a0.8719671.3816720.395105-1.459083
b1.4666181.626593-1.0028210.092995
c0.3843460.020456-0.441192-0.476958
df[:2]

ABCD
a0.8719671.3816720.395105-1.459083
b1.4666181.626593-1.0028210.092995
df[::-1]

ABCD
c0.3843460.020456-0.441192-0.476958
b1.4666181.626593-1.0028210.092995
a0.8719671.3816720.395105-1.459083

通过属性索引

s = pd.Series([1, 2, 3], index=list('abc'))
s
a    1
b    2
c    3
dtype: int64
s.b
2
df = pd.DataFrame(np.random.randn(3, 4), index=list('abc'), columns=list('ABCD'))
df

ABCD
a0.147620-1.3361050.9453160.501963
b1.2204050.697981-0.8699250.790545
c0.9358090.5847751.3160570.138111
df.A
a    0.147620
b    1.220405
c    0.935809
Name: A, dtype: float64
df.A = list(range(len(df.index)))
df

ABCD
a0-1.3361050.9453160.501963
b10.697981-0.8699250.790545
c20.5847751.3160570.138111

采用访问属性的方法必须确保该属性存在。如果要创建新的一列,仍然需要通过 []。否则,会出现 UserWanring 的警告。

df['E'] = list(range(len(df.index)))
df

ABCDE
a0-1.3361050.9453160.5019630
b10.697981-0.8699250.7905451
c20.5847751.3160570.1381112

通过标签索引

使用 loc 方法可以使用标签对行进行索引。

s = pd.Series(np.random.randn(6), index=list('abcdef'))
s
a    0.763607
b   -0.538096
c   -0.032027
d    0.312734
e    1.096504
f   -0.857242
dtype: float64
s.loc['b']
-0.5380958810076775

标签支持冒号表达式,进行切片运算。

s.loc['c':]
c   -0.032027
d    0.312734
e    1.096504
f   -0.857242
dtype: float64
df = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df

ABCD
a-0.3674012.027401-0.379841-1.462746
b0.0326650.143619-2.0854440.442935
c-1.1172300.9374480.738852-0.411036
d0.2384270.3266190.1827131.151853
e0.4542720.844479-1.220454-1.453612
f-1.706529-0.7332420.400927-0.431352

DataFrame 对象可以行列同时索引。

df.loc[['a', 'b', 'd'], ['A', 'C']]

AC
a-0.367401-0.379841
b0.032665-2.085444
d0.2384270.182713
df.loc['d':, 'A':'C']

ABC
d0.2384270.3266190.182713
e0.4542720.844479-1.220454
f-1.706529-0.7332420.400927

注意:下面的切片不是下标切片,而是标签切片。因此,3:5 表示标签 35 之间的所有标签。35 之间还有一个标签 2,因此返回结果为:

s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])
s.loc[3:5]
3    b
2    c
5    d
dtype: object

通过下标位置索引

使用 iloc 方法可以通过下标位置进行索引。

s = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
s
0   -0.465348
2   -0.507533
4    0.600791
6   -0.485520
8    0.703295
dtype: float64
s.iloc[:3]
0   -0.465348
2   -0.507533
4    0.600791
dtype: float64
s.iloc[-1]
0.7032949052125851
df = pd.DataFrame(np.random.randn(6, 4),
                   index=list(range(0, 12, 2)),
                   columns=list(range(0, 8, 2)))
df

0246
01.490701-0.5498090.504431-1.027572
2-1.7907520.6871350.674383-1.053664
41.168693-0.1290010.078535-0.770111
60.711808-0.2693571.663142-1.358983
8-0.405399-0.8681190.183488-1.307719
102.0910570.4263310.1200990.599744

DataFrame 对象可以行列同时索引。

df.iloc[:3]

0246
01.490701-0.5498090.504431-1.027572
2-1.7907520.6871350.674383-1.053664
41.168693-0.1290010.078535-0.770111
df.iloc[1:5, 2:4]

46
20.674383-1.053664
40.078535-0.770111
61.663142-1.358983
80.183488-1.307719
df.iloc[[1, 3, 5], [1, 3]]

26
20.687135-1.053664
6-0.269357-1.358983
100.4263310.599744
df.iloc[1:3, :]

0246
2-1.7907520.6871350.674383-1.053664
41.168693-0.1290010.078535-0.770111

可以超出索引范围,但是可能会返回空 DataFrame。

df.iloc[:, 4:]

0
2
4
6
8
10

选择接受可调用对象

[]lociloc 都接受可调用对象进行索引。

df = pd.DataFrame(np.random.randn(6, 4),
                   index=list('abcdef'),
                   columns=list('ABCD'))
df

ABCD
a-0.3638062.1351960.1120950.183855
b0.9640611.5492310.1620251.408502
c0.0337140.4842940.8950771.834204
d-1.6327350.6280791.4972781.194587
e-1.6924140.4942851.618099-1.882802
f-1.5438410.550801-1.429393-0.001107
df.loc[lambda df: df['A'] > 0, :]

ABCD
b0.9640611.5492310.1620251.408502
c0.0337140.4842940.8950771.834204
df.loc[:, lambda df: ['A', 'B']]

AB
a-0.3638062.135196
b0.9640611.549231
c0.0337140.484294
d-1.6327350.628079
e-1.6924140.494285
f-1.5438410.550801
df.iloc[:, lambda df: [0, 1]]

AB
a-0.3638062.135196
b0.9640611.549231
c0.0337140.484294
d-1.6327350.628079
e-1.6924140.494285
f-1.5438410.550801
df[lambda df: df.columns[0]]
a   -0.363806
b    0.964061
c    0.033714
d   -1.632735
e   -1.692414
f   -1.543841
Name: A, dtype: float64
df['A'].loc[lambda s: s > 0]
b    0.964061
c    0.033714
Name: A, dtype: float64

快速访问

由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此它需要一些开销才能确定您的要求。如果您只想访问一个标量值,最快的方法是使用 atiat 方法,它们在所有数据结构上都实现了。

s = pd.Series(np.random.randint(0, 7, size=(7,)), index=list('abcdefg'))
s
a    0
b    6
c    0
d    6
e    6
f    1
g    1
dtype: int32
s.iat[5]
1
s.at['a']
0
df = pd.DataFrame(np.random.randint(0, 7, size=(3, 4)), index=list('abc'), columns=list('ABCD'))
df

ABCD
a1565
b1453
c4310
df.iat[0, 1]
5
df.at['b', 'A']
1

布尔索引

Series 对象的布尔索引和 Python 以及 NumPy 类似。

s = pd.Series(range(-3, 4))
s
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64
s[s > 0]
4    1
5    2
6    3
dtype: int64
s[(s < -1) | (s > 0.5)]
0   -3
1   -2
4    1
5    2
6    3
dtype: int64
s[~(s < 0)]
3    0
4    1
5    2
6    3
dtype: int64
df = pd.DataFrame(np.random.randn(7, 4),
                  index=pd.date_range('2022/02/22', periods=7),
                  columns=list('ABCD'))
df

ABCD
2022-02-22-1.645831-0.6035770.855909-0.172192
2022-02-23-0.006510-0.708135-0.788163-2.029753
2022-02-24-1.0340410.977184-0.176987-0.066881
2022-02-250.7139772.029726-0.4418140.105782
2022-02-26-0.038251-1.624340-0.8826590.655924
2022-02-271.122102-0.0194431.120410-0.599446
2022-02-281.8292950.039461-0.4687061.038191
df[df['A'] > 0]

ABCD
2022-02-250.7139772.029726-0.4418140.105782
2022-02-271.122102-0.0194431.120410-0.599446
2022-02-281.8292950.039461-0.4687061.038191

高级布尔函数

isin(values) 用于判断元素是否包含在 values 中,适用于 SeriesDataFrameIndex 对象。

df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
                  index=['falcon', 'dog'])
df

num_legsnum_wings
falcon22
dog40

isin 函数可以接受 list、dict、Seires 等可迭代对象作为参数。

df.isin([0, 2])

num_legsnum_wings
falconTrueTrue
dogFalseTrue

如果传入的是 dict,则可以分别设置每一列的判断规则。如若不设置,默认得到 False。

df.isin({'num_wings': [0, 3]})

num_legsnum_wings
falconFalseFalse
dogFalseTrue

如果传入的是 Series 或者 DataFrame 对象,那么必须确保每行每列匹配。

other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]},
                     index=['spider', 'falcon'])
df.isin(other)

num_legsnum_wings
falconFalseTrue
dogFalseFalse

查询 num_wings 列中值为 2 或 4 的行。

df[df["num_wings"].isin([2, 4])]

num_legsnum_wings
falcon22

过滤并保留 num_wings 列中值为 4 或 8 的行。

df = df[df["num_legs"].isin([4, 8])]
df

num_legsnum_wings
dog40
Previous
Next