我正在尝试将包含事件Web数据的12,000个JSON文件转换为单个pandas数据框。
该代码花费的时间太长了。
关于如何提高效率的任何想法?
加载的JSON文件的示例:
{'$schema': 12,
'amplitude_id': None,
'app': '',
'city': ' ',
'device_carrier': None,
'dma': ' ',
'event_time': '2018-03-12 22:00:01.646000',
'group_properties': {'[Segment] Group': {'': {}}},
'ip_address': ' ',
'os_version': None,
'paying': None,
'platform': 'analytics-ruby',
'processed_time': '2018-03-12 22:00:06.004940',
'server_received_time': '2018-03-12 22:00:02.993000',
'user_creation_time': '2018-01-12 18:57:20.212000',
'user_id': ' ',
'user_properties': {'initial_referrer': '',
'last_name': '',
'organization_id': 2},
'uuid': ' ',
'version_name': None}
谢谢!
import os
import pandas as pd
data = pd.DataFrame()
for filename in os.listdir('path'):
file = open(filename, "r")
file_read1 = file.read()
file_read1 = pd.read_json(file_read1, lines = True)
data = data.append(file_read1, ignore_index = True)
参考方案
将JSON字符串转换为数据帧的最快方法似乎是pd.io.json.json_normalize
。根据JSON的数量,它比附加到现有数据帧快约15到> 500倍。它以13到170的比率击败pd.concat
。
副作用是JSON的嵌套部分(group_properties
和user_properties
)也将变平,并且dtypes
需要手动设置。
12,000个JSON的运行时(不考虑磁盘I / O)
附加:〜177秒
concat:〜126秒
json_normalize:〜0.7秒
import pandas as pd
import json
import os
data = []
for filename in os.listdir('path'):
with open(filename, 'r') as f:
data.append(f)
# read one JSON and use it as a reference dataframe
df_ref = pd.read_json(data[0], lines=True)
# create a temporary dataframe, get its column 0 and flatten it via json_normalize
df_temp = pd.DataFrame(data)[0]
df = pd.io.json.json_normalize(df_temp.apply(json.loads))
# fix the column dtypes
for col, dtype in df_ref.dtypes.to_dict().items():
if col not in df.columns:
continue
df[col] = df[col].astype(dtype, inplace=True)
完整的代码
import pandas as pd
import json
import time
j = {'$schema': 12,
'amplitude_id': None,
'app': '',
'city': ' ',
'device_carrier': None,
'dma': ' ',
'event_time': '2018-03-12 22:00:01.646000',
'group_properties': {'[Segment] Group': {'': {}}},
'ip_address': ' ',
'os_version': None,
'paying': None,
'platform': 'analytics-ruby',
'processed_time': '2018-03-12 22:00:06.004940',
'server_received_time': '2018-03-12 22:00:02.993000',
'user_creation_time': '2018-01-12 18:57:20.212000',
'user_id': ' ',
'user_properties': {'initial_referrer': '',
'last_name': '',
'organization_id': 2},
'uuid': ' ',
'version_name': None}
json_str = json.dumps(j)
def df_append():
t0 = time.time()
df = pd.DataFrame()
for _ in range(n_lines):
file_read1 = pd.read_json(json_str, lines=True)
df = df.append(file_read1, ignore_index=True)
return df, time.time() - t0
def df_concat():
t0 = time.time()
data = []
for _ in range(n_lines):
file_read1 = pd.read_json(json_str, lines=True)
data.append(file_read1)
df = pd.concat(data)
df.index = list(range(len(df)))
return df, time.time() - t0
def df_io_json():
df_ref = pd.read_json(json_str, lines=True)
t0 = time.time()
data = []
for _ in range(n_lines):
data.append(json_str)
df = pd.io.json.json_normalize(pd.DataFrame(data)[0].apply(json.loads))
for col, dtype in df_ref.dtypes.to_dict().items():
if col not in df.columns:
continue
df[col] = df[col].astype(dtype, inplace=True)
return df, time.time() - t0
n_datapoints = (10, 10**2, 10**3, 12000, 10**4, 10**5)
times = {}
for n_lines in n_datapoints:
times[n_lines] = [[], [], []]
for _ in range(3):
df1, t1 = df_append()
df2, t2 = df_concat()
df3, t3 = df_io_json()
times[n_lines][0].append(t1)
times[n_lines][1].append(t2)
times[n_lines][2].append(t3)
pd.testing.assert_frame_equal(df1, df2)
pd.testing.assert_frame_equal(df1[df1.columns[0:7]], df3[df3.columns[0:7]])
pd.testing.assert_frame_equal(df2[df2.columns[8:16]], df3[df3.columns[7:15]])
pd.testing.assert_frame_equal(df2[df2.columns[17:]], df3[df3.columns[18:]])
for i in range(3):
times[n_lines][i] = sum(times[n_lines][i]) / 3
times
x = n_datapoints
fig = plt.figure()
plt.plot(x, [t[0] for t in times.values()], 'o-', label='append')
plt.plot(x, [t[1] for t in times.values()], 'o-', label='concat')
plt.plot(x, [t[2] for t in times.values()], 'o-', label='json_normalize')
plt.xlabel('number of JSONs', fontsize=16)
plt.ylabel('time in seconds', fontsize=18)
plt.yscale('log')
plt.legend()
plt.show()
我不明白为什么sum(df ['series'])!= df ['series']。sum() - python我正在汇总一系列值,但是根据我的操作方式,我会得到不同的结果。我尝试过的两种方法是:sum(df['series']) df['series'].sum() 他们为什么会返回不同的值?示例代码。s = pd.Series([ 0.428229 , -0.948957 , -0.110125 , 0.791305 , 0…
R'relaimpo'软件包的Python端口 - python我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。
将字符串分配给numpy.zeros数组[重复] - pythonThis question already has answers here: Weird behaviour initializing a numpy array of string data (4个答案) …
Python:传递记录器是个好主意吗? - python我的Web服务器的API日志如下:started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求,我为每个请求创建了一个随机数,并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…
Python:如何根据另一列元素明智地查找一列中的空单元格计数? - pythondf = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…