我正在尝试返回一个字典,该字典按其最近的州中心汇总推文。我正在遍历所有推文,对于每个推文,我正在检查所有状态以查看哪个状态最接近。
有什么更好的方法可以做到这一点?
def group_tweets_by_state(tweets):
"""
The keys of the returned dictionary are state names, and the values are
lists of tweets that appear closer to that state center than any other.
tweets -- a sequence of tweet abstract data types """
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
for state in us_states:
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
result_state = state
if result_state not in tweets_by_state:
tweets_by_state[result_state]= []
tweets_by_state[result_state].append(tweet)
else:
tweets_by_state[result_state].append(tweet)
return tweets_by_state
python大神给出的解决方案
当推文的数量非常大时,巨大的for循环中的每一个小改进都会导致时间复杂度的巨大性能提升,我几乎没有什么可以想到的:
1.只需拨打一次geo_distance()
,特别是在费用高昂的时候
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
而不是
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
2.如果倾向于经常重复职位:
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
因此,假设您有200条不同位置的1000条推文,而us_states
为50,则原始算法将调用geo_distance()
1000 * 50 * 2次,现在可以将其减少为200 * 50 * 1次调用。
3.减少在find_state_center()
上的调用时间
与#2相似,现在每个状态的每个推文都被冗余调用。
state_center_dict = {}
for state in us_states:
state_center_dict[state] = find_state_center(us_states[state])
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, state_center_dict[state])
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
现在find_state_center()
仅被调用50次(状态数),而不是50 * 1000(推特数),我们又做了一次巨大的改进!
绩效成果摘要
通过#1,我们的性能提高了一倍。 #2我们将其增强(推文数量/职位数量)倍。 #3是三者中最大的,与原始代码相比,我们将时间复杂度降低到仅为(/ tweets数量)的1 /。