まったく 入門とは思えない オライリーのMining the Social Web(入門 ソーシャルデータ)その3― 1章 終わる2013/08/01 03:10

やっと 1章 終わる

例1-12を動かして snl_search_results.dotを作成
dataから pnp(grafic)を作成
 ノードが結構出てきて面白い

$circo -Tpng -Osnl_search_results snl_search_results.dot

例1−12のpython
# --- sampleソース で認証の所は 隠してあります
import twitter

# Go to http://twitter.com/apps/new to create an app and get these items
# See https://dev.twitter.com/docs/auth/oauth for more information on Twitter's OAuth implementation

CONSUMER_KEY = '***' #<-各自直してね
CONSUMER_SECRET = '***' #<-各自直してね
OAUTH_TOKEN = '***' #<-各自直してね
OAUTH_TOKEN_SECRET = '***' #<-各自直してね

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(domain='api.twitter.com',
api_version='1.1',
auth=auth
)

# <markdowncell>

# **Example 1-3. Retrieving Twitter trends**

# <codecell>

# With an authenticated twitter_api in existence, you can now use it to query Twitter resources as usual.
# However, the trends resource is cleaned up a bit in v1.1, so requests are a bit simpler than in the latest
# printing. See https://dev.twitter.com/docs/api/1.1/get/trends/place

# The Yahoo! Where On Earth ID for the entire world is 1
WORLD_WOE_ID = 1

# Prefix id with the underscore for query string parameterization.
# Without the underscore, it's appended to the URL itself
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
print world_trends

# <markdowncell>

# IPython Notebook didn't display the output because the result of the API call was captured as a variable. Here's how you could print a readable version of the response.

# <codecell>

import json
print json.dumps(world_trends, indent=1)

# <markdowncell>

# Now that you're authenticated and understand a bit more about making requests with Twitter's new API, let's look at some examples that involve search requests, which are a bit different with v1.1 of the API.

# <markdowncell>

# **Example 1-4. Paging through Twitter search results**

# <codecell>

# Like all other APIs, search requests now require authentication and have a slightly different request and
# response format. See https://dev.twitter.com/docs/api/1.1/get/search/tweets

q = "SNL"
count = 100

search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']

# v1.1 of Twitter's API provides a value in the response for the next batch of results that needs to be parsed out
# and passed back in as keyword args if you want to retrieve more than one page. It appears in the 'search_metadata'
# field of the response object and has the following form:'?max_id=313519052523986943&q=NCAA&include_entities=1'
# The tweets themselves are encoded in the 'statuses' field of the response


# Here's how you would grab five more batches of results and collect the statuses as a list
for _ in range(5):
try:
next_results = search_results['search_metadata']['next_results']
except KeyError, e: # No more results when next_results doesn't exist
break

kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ]) # Create a dictionary from the query string params
search_results = twitter_api.search.tweets(**kwargs)
statuses += search_results['statuses']

import json
print json.dumps(statuses[0:2], indent=1)


tweets = [ status['text'] for status in statuses ]

print tweets[0]

words = []
for t in tweets:
words += [ w for w in t.split() ]

# total words
print len(words)

# unique words
print len(set(words))

# lexical diversity
print 1.0*len(set(words))/len(words)

# avg words per tweet
print 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)

import nltk

freq_dist = nltk.FreqDist(words)
print freq_dist.keys()[:50] # 50 most frequent tokens
print freq_dist.keys()[-50:] # 50 least frequent tokens

import re
rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?",
"Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"]
for t in example_tweets:
print rt_patterns.findall(t)

import networkx as nx
import re
g = nx.DiGraph()

def get_rt_sources(tweet):
rt_patterns = re.compile(r'(RT|via)((?:\b\W*@\w+)+)', re.IGNORECASE)
return [ source.strip()
for tuple in rt_patterns.findall(tweet)
for source in tuple
if source not in ("RT", "via") ]

for status in statuses:
rt_sources = get_rt_sources(status['text'])
if not rt_sources: continue
for rt_source in rt_sources:
g.add_edge(rt_source, status['user']['screen_name'], {'tweet_id' : status['id']})

print nx.info(g)
print g.edges(data=True)[0]
print len(nx.connected_components(g.to_undirected()))
print sorted(nx.degree(g).values())

#import ex1_11
OUT = "snl_search_results.dot"
try:
nx.drawing.write_dot(g, OUT)
except ImportError, e:
dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) \
for n1, n2 in g.edges()]
f = open(OUT, 'w')
f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))
f.close()

コメント

コメントをどうぞ

※メールアドレスとURLの入力は必須ではありません。 入力されたメールアドレスは記事に反映されず、ブログの管理者のみが参照できます。

※なお、送られたコメントはブログの管理者が確認するまで公開されません。

※投稿には管理者が設定した質問に答える必要があります。

名前:
メールアドレス:
URL:
次の質問に答えてください:
下記の文字を入力して下さい(半角スペースあります)
Hoge desu

コメント:

トラックバック