利用PyCharm拿取Package

PyCharm在開發Python專案時是很方便的IDE,隨時增加Package只需要利用快速鍵就會啓動pip進行安裝。但這次利用BeautifulSoup,就碰到了PyCharm一直想要用pip3拿取BeautifulSoup 3的情況,但怎麼安裝都會出現錯誤。只能自行手動的方式進行Package安裝,來規避這個問題

手動安裝成功後就可以開始利用BeautifulSoup進行後續的工作

拿取關卡資料

Sokoban.info上有著目前數量很完整的關卡資料,這些資料主要是放在CDATA裡,因此要利用web scraping的方式,將這些資料取回

<script type="text/javascript">
		// <![CDATA[

			var Board			="xxxx#####xxxxxxxxxx!xxxx#   #xxxxxxxxxx!xxxx#$  #xxxxxxxxxx!xx###  $##xxxxxxxxx!xx#  $ $ #xxxxxxxxx!### # ## #xxx######!#   # ## #####  ..#!# $  $          ..#!##### ### #@##  ..#!xxxx#     #########!xxxx#######xxxxxxxx!"				;
			var BoardXMax		= 19			;
			var BoardYMax		= 11			;

			var CollectionId	= 1		;
			var CollectionLevel	= 1				;

			var HighScoreViewing= false	;

			var MoveCount		,
				MoveCountMax	;

		//]]>
</script>

目前最適合用來進行web scraping的工作,非python莫屬。利用Beautiful Soup可以很順利的進行解析html裡各tag。參考了拿取CData的文章和討論

但怎麼樣都無法順利的抽取出CData裡的資料,最後只能用find_all加上regular expression,讓主要的script區留下

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
scripts = soup.find_all("script", string=re.compile("var Board"))

參考了這些資料後,再利用regular expressin進行取代且搭配split,則可以讓關卡中的資料順利取出

利用requests拿取整份html後,進行抽取、取代的部份可以從這份gist看到程式碼

利用requests拿取時可以加入params的參考料可以從下述文章了解

params = {'130_1': ''}
r = requests.get('https://sokoban.info/', params)
print(r.text)

量化拿取關卡資料

利用requests加上BeautifulSoup,拿取單一關上可說是迎刃有餘,接下來要拿取全部的關卡,這就需要一些調整。首先,要知道到底有多少關以及接下來是一批批的拿取關卡,還是一關關的拿取,不論哪一種拿取方式,若是無法一次性完成,那要怎麼樣進行記錄,方便下次拿取剩下來的部份?

有多少關這樣的資訊似乎是每一頁(關)都有,資訊是放在option tag裡

<optgroup label="Thinking Rabbit">
<option value="1" selected="selected">Original &amp; Extra &nbsp; (90)</option>
<option value="2">Boxxle 1 &nbsp; (108)</option>
<option value="3">Boxxle 2 &nbsp; (120)</option>
</optgroup>
<optgroup label="Aymeric du Peloux">
<option value="4">Mini Cosmos &nbsp; (40)</option>
<option value="5">Micro Cosmos &nbsp; (40)</option>
<option value="6">Pico Cosmos &nbsp; (20)</option>
<option value="7">Nabo Cosmos &nbsp; (40)</option>
</optgroup>

再次利用BeautifulSoup加上regular expression和下述的參考資料

抽取資料和文字取代的代碼看起來如下

options = soup.find_all("option")
for o in options:
    index = o['value']
    content = o.string
    pattern = "\([0-9]*\)"
    amount_with_parentheses = re.search(pattern, content).group(0)
    amount = re.sub(r"\(|\)", '', amount_with_parentheses)
    title = re.sub(r"   \([0-9]*\)", '', content)

特別注意的是tile那的regular expressin,前方有三個空格,但第二個空格是nbsp,是直接從PyCharm的console裡回貼進來的。

拿取到關卡的資訊後,接下來就是要將這些資訊存下來,方便後續利用。自行產生檔案進行資料存取是個方法,延伸這個想法就進化成用File Db,存取的方式和自行處理類似,但在拿取、回放資料是應該是比較方便的。依據這樣的想方進行查訽,可以看到在python已經有現成的File Db也就是TinyDB。雖然有現成的File Db可用,但實際使用仍是要花些時間了解怎麼用

主要是利用table和document進行資料分類,先產生一個名為overview的table,確認裡面有所有關卡概要資料,也就是多少大關,每大關有多少小關。若是沒有此資訊,則先行拿取大關卡資料。

之後利用dictonary和list及其增減的操作,看哪些大關裡的小關資料需要被拿取。

資料型別也要進行轉換

調整過後,雖然有些淩亂,但可以進行拿取各大關中的小關卡資料,並將其存下,方便日後使用。這個版本暫時可用,純以拿取資料而言,已不需要大幅度的調整,但這個版本中為了避免一次拿取太多資料而導致程式執行時卡住,暫定一次拿取20筆小關卡的資料。且io部份也沒有進行try catch的處理。待日後對python更了解且有時間進行優化時,或許會利用async等機制調整並加入exception處理。

import requests
import re
from bs4 import BeautifulSoup
from bs4 import CData
from bs4.element import CData
from tinydb import TinyDB, where
from datetime import datetime as dt
# params = {'130_1': ''}
# r = requests.get('https://sokoban.info/', params)
# soup = BeautifulSoup(html_doc, 'html.parser')
# soup = BeautifulSoup(r.text, 'html.parser')
def use_db():
print("use db")
db = TinyDB("data.json")
overview_table = db.table('Overview')
# overview_table.search(Query()[])
no_item_in = overview_table.count(where('extracted') == True) == 0
stages = []
if no_item_in:
print('has no data')
stages = extract_stage()
overview_table.insert({
"dateTime": dt.now().strftime("%Y/%m/%d %H:%M:%S"),
"extracted": True,
"stages": stages
})
else:
print('has data, can just proceed')
# processed_table = db.table('Processed')
can_process_amount = 20
current_process_amount = 0
extracted_items = overview_table.all()
document = extracted_items[0]
stages = document['stages']
for stage in stages:
stage_title = stage['title']
stage_id = stage['index']
stage_amount = int(stage['amount'])
stage_table = db.table(stage_title)
current_stage_to_be_processed = {}
for i in range(0, stage_amount):
current_stage_to_be_processed[i + 1] = {
'index': i + 1
}
all_stage_documents = stage_table.all()
processed_stages = map(lambda x: x['index'], all_stage_documents)
to_be_popped_later = []
for p_stage in processed_stages:
print("processed stage")
print(p_stage)
# current_stage_to_be_processed.pop(str(p_stage['index']), None)
to_be_popped_later.append(p_stage)
for popping in to_be_popped_later:
print("popping")
print(popping)
current_stage_to_be_processed.pop(popping, None)
#
# print(stage_title)
for to_process in current_stage_to_be_processed:
if current_process_amount < can_process_amount:
combined = stage_id + '_' + str(to_process)
# print(combined)
params = {combined: ''}
r = requests.get('https://sokoban.info/', params)
soup = BeautifulSoup(r.text, 'html.parser')
board_result = extract_board(soup)
stage_table.insert({
'index': to_process,
'board': board_result
})
current_process_amount += 1
def extract_stage():
print("extract stage")
stages = []
options = soup.find_all("option")
for o in options:
index = o['value']
content = o.string
pattern = "\([0-9]*\)"
amount_with_parentheses = re.search(pattern, content).group(0)
amount = re.sub(r"\(|\)", '', amount_with_parentheses)
title = re.sub(r"   \([0-9]*\)", '', content)
stages.append({
'index': index,
'title': title,
'amount': int(amount)
})
return stages
def extract_board(soup):
scripts = soup.find_all("script", string=re.compile("var Board"))
board_result = {}
rows = []
for s in scripts:
content = s.string
split_lines = content.split('\n')
for sl in split_lines:
result = re.match(r"\s*var Board\s*=", sl)
if result:
stripped = re.sub(r"\s*var Board\s*=", '', sl)
remove_space_and_semicolon = re.sub(r"\"\s*;", '', stripped)
remove_first_quote = re.sub(r"\"", '', remove_space_and_semicolon)
with_end_char_rows = remove_first_quote.split('!')
for row in with_end_char_rows:
clean_row = re.sub(r"!", '', row)
# print(clean_row)
rows.append(clean_row)
board_result = {
'rows': rows
}
return board_result
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
use_db()
# stages = extract_stage()
# for s in stages:
# print(s)
# extract_board()
# See PyCharm help at https://www.jetbrains.com/help/pycharm/
view raw main.py hosted with ❤ by GitHub

By ApprenticeGC