pandas的去重方法df.drop_duplicates(subset=keys, keep='first')在按某列去重时,其他列每组只能保留其中一行。如何想要把这一组的所有内容合并在一行,或者进行简单的统计呢?我造了个轮子,以解决这个问题。

merge_row.py可以对指定的列进行去重,并输出指定的其他列合并后的内容,可自定义分隔符。可以对指定的其他列的每组求和,均值,最大值,最小值,中位数等,可以对分组进行去重,可以输出每个分组的数据个数。

输入输出支持csvtsv表格和excel表格,使用.csv.tsv.xlsx后缀加以区分。

使用方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
usage: merge_row.py [-h] -i INPATH [-o OUTPATH] [-header HEADER] -by BY [-c COLUMNS] [-m METHOD] [-sep SEP] [-delim DELIM] [-count]

optional arguments:
-h, --help show this help message and exit
-i INPATH Input table file. [required]. (default: None)
-o OUTPATH Output file, discard this parameter to print output to stdout. (default: None)
-header HEADER The selected row will be the header, discard this parameter if there is no header, 0 based. (default: None)
-by BY The same values in the selected columns will be merged, 0 based, use ',' to separate. [required]. (default: None)
-c COLUMNS Other columns selected will be shown in the output, discard this parameter to show all columns, 0 based, use ',' to separate. If an illegal parameter is set, no other columns will be appended to
the output table. (default: None)
-m METHOD Specify the operation that should be applied to -c. Valid operations: sum, min, max, mean, median, collapse, distinct, first, last, count, count_distinct. use ',' to separate. (default:
distinct)
-sep SEP Specify a custom delimiter to separate the input file. (default: )
-delim DELIM Specify a custom delimiter for the collapse operations. (default: ,)
-count Add a count column to the output table. (default: False)

在终端执行:

1
./merge_row.py -i $输入 -o $输出 -header $表头 -by $哪几列 -c $其它列 -m $统计方法 -sep $输入文件的分隔符 -delim $合并列的分隔符 -count

-i:输入文件,可以是文本格式文件或xlsx格式的excel表格。配合-sep指定分隔符,-header指定表头。

-o:输出文件,不指定-o就会把输出打印出来。

-header:表头。0表示第0列作为表头,没有表头则不指定该参数。

-by:按哪几列去重,-by 0 表示按第0列去重,-by 0,1,2表示按第0,1,2列去重,用英文逗号分开。

-c:除了-by指定的列,其它需要输出的列。-c 4,5,6输出4,5,6列,不指定-c则输出所有列。指定一个不合法的值则不会输出其它列,如-c false

-m:对应-c中每一列的统计方式,-m distinct,distinct,sum表示对-c 4,5,6第4,5列去重然后合并,第6列计算和。单个值-c distinct表示对c中所有列都采用distinct。默认值distinct。选项有:sum, min, max, mean, median, collapse, distinct, first, last,count, count_distinctcountcount_distinct不统计NA。

-sep: 输入表格的分隔符,默认为\t,对于excel表格不需要指定。

-delim:对于-m中的distinct(去重合并)和collapse(直接合并),采用指定的分隔符连接。默认,

-count:添加该参数,会在输出最后一列之后添加一列count,表示被合并的行数。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
#!/usr/bin/env python3
"""
Remove duplicated rows of the table by specified columns and keep the contents of other columns.
Author: liqiming@whu.edu.cn
"""
import sys
import argparse
import pandas as pd


def df_merge_row(df, by=None, columns=None, method="distinct", sep=",", count=False):
if not columns:
return df.drop_duplicates(subset=by, keep='first')

def generate_content(x):
data = {}
for c, m in zip(columns, method):
try:
series = pd.to_numeric(x[c])
if m == "mean":
data[c] = series.mean()
elif m == 'medium':
data[c] = series.median()
except ValueError:
series = x[c]
if m == 'distinct':
data[c] = sep.join(map(str, series.unique()))
elif m == 'collapse':
data[c] = sep.join(map(str, series))
elif m == 'max':
data[c] = series.max()
elif m == 'min':
data[c] = series.min()
elif m == 'sum':
data[c] = series.sum()
elif m == 'first':
data[c] = series.to_list()[0]
elif m == 'last':
data[c] = series.to_list()[-1]
elif m == 'count':
data[c] = series.count()
elif m == 'count_distinct':
data[c] = pd.Series(series.unique()).count()
else:
if c not in data:
data[c] = sep.join(map(str, series.unique()))
if count:
data["count"] = len(x[columns[0]])
return pd.Series(data)

merge_df = df.groupby(by).apply(generate_content).reset_index()
return merge_df


def main(inpath, outpath, header, by, columns=None, method="distinct", sep="\t", delim=",", count=False):
if inpath.endswith(".xlsx"):
df = pd.read_excel(inpath, header=header)
else:
df = pd.read_csv(inpath, header=header, delimiter=sep)
by = by.split(",")
method = method.split(",")
if columns:
by_name = [df.columns[int(i)] for i in by]
columns = columns.split(",")
if len(method) == 1:
method = len(columns) * method
assert len(columns) == len(method)
try:
column_name = [df.columns[int(i)] for i in columns if i not in by]
output_name = [df.columns[int(i)] for i in sorted(by + columns)]
if count:
output_name.append("count")
except Exception:
column_name = None
output_name = by_name
else:
by_name = [df.columns[int(i)] for i in by]
column_name = [i for i in df.columns if i not in by_name]
if len(method) == 1:
method = len(column_name) * method
assert len(column_name) == len(method)
output_name = df.columns.to_list()
if count:
output_name.append("count")

out_df = df_merge_row(df, by_name, column_name, method, delim, count)
out_df = out_df[output_name]

out_header = True if header is not None else False
if outpath:
if outpath.endswith(".csv"):
out_df.to_csv(outpath, index=False, header=out_header, sep=",")
elif outpath.endswith(".xlsx"):
out_df.to_excel(outpath, index=False, header=out_header)
else:
out_df.to_csv(outpath, index=False, header=out_header, sep="\t")
else:
if out_header:
print("\t".join(map(str, out_df.columns)))
for _, row in out_df.iterrows():
print("\t".join(map(str, dict(row).values())))


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-i", dest="inpath",
type=str, required=True, help="Input table file. [required].")
parser.add_argument("-o", dest="outpath", type=str,
help="Output file, discard this parameter to print output to stdout.")
parser.add_argument("-header", dest="header", type=int,
default=None, help=("The selected row will be the header, "
"discard this parameter if there is no header, 0 based."))
parser.add_argument("-by", dest="by", type=str, required=True,
help="The same values in the selected columns will be merged, 0 based, use ',' to separate. [required].")
parser.add_argument("-c", dest="columns", type=str, default=None,
help=("Other columns selected will be shown in the output, "
"discard this parameter to show all columns, 0 based, use ',' to separate. "
"If an illegal parameter is set, no other columns will be appended to the output table."))
parser.add_argument("-m", dest="method", type=str, default="distinct",
help=("Specify the operation that should be applied to -c. "
"Valid operations: sum, min, max, mean, median, collapse, distinct, first, last, count, count_distinct. "
"use ',' to separate."))
parser.add_argument("-sep", dest="sep", type=str, default="\t",
help="Specify a custom delimiter to separate the input file.")
parser.add_argument("-delim", dest="delim", type=str, default=",",
help="Specify a custom delimiter for the collapse operations.")
parser.add_argument("-count", dest="count", action="store_true",
help="Add a count column to the output table.")
args = parser.parse_args()

main(**args.__dict__)