合并与连接

import pandas as pd

concat: 合并多个 DataFrame 和 Series

垂直合并

left = pd.DataFrame({'Name': ['Sam', 'Emma'], 'Age': [14, 15]})
left

NameAge
0Sam14
1Emma15
right = pd.DataFrame({'Name': ['Karen', 'Rahul'], 'Age': [10, 13]})
right

NameAge
0Karen10
1Rahul13

垂直合并即按列合并,一般需要 DataFrame 的 column 完全一致。

pd.concat([left, right])

NameAge
0Sam14
1Emma15
0Karen10
1Rahul13

不过,上面的结果仍然保留了这些 DataFrame 原来的索引。我们可以重置索引:

pd.concat([left, right]).reset_index(drop=True)

NameAge
0Sam14
1Emma15
2Karen10
3Rahul13

水平合并

left = pd.DataFrame({'Name': ['Sam', 'Emma'], 'Age': [14, 15]})
left

NameAge
0Sam14
1Emma15
right = pd.DataFrame({'Math': ['B', 'A+'], 'Science': ['A', 'B+']})
right

MathScience
0BA
1A+B+

设置 axis=1 即可实现水平合并,即按行合并,一般要求 DataFrame 的 index 完全一致。

pd.concat([left, right], axis=1)

NameAgeMathScience
0Sam14BA
1Emma15A+B+

设置合并模式

如果 index 或者 column 不一致,则可以设置连接 join 的方式。

  • outer:类似于关系代数的外连接运算,没有的内容补 NaN 值。该选项为默认值。
  • inner:类似于关系代数的自然连接运算,先求笛卡尔积中对应相等的元组,再去掉其中重复的属性值。

注:与关系代数不同的是,pandas 对 DataFrame 的连接可以在 index 和 column 上都可以进行。

left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)
left

ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
right = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
right

BDF
2B2D2F2
3B3D3F3
6B6D6F6
7B7D7F7
pd.concat([left, right], axis=1, join='outer')

ABCDBDF
0A0B0C0D0NaNNaNNaN
1A1B1C1D1NaNNaNNaN
2A2B2C2D2B2D2F2
3A3B3C3D3B3D3F3
6NaNNaNNaNNaNB6D6F6
7NaNNaNNaNNaNB7D7F7
pd.concat([left, right], axis=1, join='inner')

ABCDBDF
2A2B2C2D2B2D2F2
3A3B3C3D3B3D3F3
pd.concat([left, right], axis=0, join='outer')

ABCDF
0A0B0C0D0NaN
1A1B1C1D1NaN
2A2B2C2D2NaN
3A3B3C3D3NaN
2NaNB2NaND2F2
3NaNB3NaND3F3
6NaNB6NaND6F6
7NaNB7NaND7F7
pd.concat([left, right], axis=0, join='inner')

BD
0B0D0
1B1D1
2B2D2
3B3D3
2B2D2
3B3D3
6B6D6
7B7D7

可以设置忽略 index,index 将按照顺序递增。这样就和关系代数中的连接运算完全一致了。

pd.concat([left, right], ignore_index=True)

ABCDF
0A0B0C0D0NaN
1A1B1C1D1NaN
2A2B2C2D2NaN
3A3B3C3D3NaN
4NaNB2NaND2F2
5NaNB3NaND3F3
6NaNB6NaND6F6
7NaNB7NaND7F7

DataFrame 和 Series 可以直接合并

s = pd.Series(["X0", "X1", "X2", "X3"], name="X")
s
0    X0
1    X1
2    X2
3    X3
Name: X, dtype: object
df = pd.DataFrame({'Name': ['Sam', 'Emma'], 'Age': [14, 15]})
df

NameAge
0Sam14
1Emma15
pd.concat([df, s], axis=1)

NameAgeX
0Sam14.0X0
1Emma15.0X1
2NaNNaNX2
3NaNNaNX3

merge: 功能更强大的数据库连接操作

left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
left

keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
right

keyCD
0K0C0D0
1K1C1D1
2K2C2D2
3K3C3D3
pd.merge(left, right, on='key')

keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2C2D2
3K3A3B3C3D3

连接方式

merge 可以实现关系代数当中的除自然连接和外连接更强大的连接操作。因此,它的 how 参数支持如下五种运算:

  • outer:外连接
  • inner:自然连接
  • left:左连接
  • right:右连接
  • cross:笛卡尔积

这种运算要求左表和右表有交叉点 intersection 存在,因此连接方式默认值为 inner

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
left

key1key2AB
0K0K0A0B0
1K0K1A1B1
2K1K0A2B2
3K2K1A3B3
right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
right

key1key2CD
0K0K0C0D0
1K1K0C1D1
2K1K0C2D2
3K2K0C3D3

这个例子是由两个字段组成主码的表的连接操作。

pd.merge(left, right, on=["key1", "key2"])

key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2

自然连接

pd.merge(left, right, how="inner", on=["key1", "key2"])

key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2

外连接

pd.merge(left, right, how="outer", on=["key1", "key2"])

key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K0NaNNaNC3D3
5K2K1A3B3NaNNaN

左连接

pd.merge(left, right, how="left", on=["key1", "key2"])

key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN

右连接

pd.merge(left, right, how="right", on=["key1", "key2"])

key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
3K2K0NaNNaNC3D3

笛卡尔积

pd.merge(left, right, how="cross")

key1_xkey2_xABkey1_ykey2_yCD
0K0K0A0B0K0K0C0D0
1K0K0A0B0K1K0C1D1
2K0K0A0B0K1K0C2D2
3K0K0A0B0K2K0C3D3
4K0K1A1B1K0K0C0D0
5K0K1A1B1K1K0C1D1
6K0K1A1B1K1K0C2D2
7K0K1A1B1K2K0C3D3
8K1K0A2B2K0K0C0D0
9K1K0A2B2K1K0C1D1
10K1K0A2B2K1K0C2D2
11K1K0A2B2K2K0C3D3
12K2K1A3B3K0K0C0D0
13K2K1A3B3K1K0C1D1
14K2K1A3B3K1K0C2D2
15K2K1A3B3K2K0C3D3

join: DataFrame 对象的强大连接操作

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)
left

AB
K0A0B0
K1A1B1
K2A2B2
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)
right

CD
K0C0D0
K2C2D2
K3C3D3
left.join(right)

ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
right.join(left)

CDAB
K0C0D0A0B0
K2C2D2A2B2
K3C3D3NaNNaN

DataFrame 的 join 默认值为 left,可以设置为除笛卡尔积外的连接方式。

left.join(right, how='inner')

ABCD
K0A0B0C0D0
K2A2B2C2D2
Previous
Next