從Sokoban info裡拿取關卡資料

利用PyCharm拿取Package

PyCharm在開發Python專案時是很方便的IDE，隨時增加Package只需要利用快速鍵就會啓動pip進行安裝。但這次利用BeautifulSoup，就碰到了PyCharm一直想要用pip3拿取BeautifulSoup 3的情況，但怎麼安裝都會出現錯誤。只能自行手動的方式進行Package安裝，來規避這個問題

手動安裝成功後就可以開始利用BeautifulSoup進行後續的工作

拿取關卡資料

Sokoban.info上有著目前數量很完整的關卡資料，這些資料主要是放在CDATA裡，因此要利用web scraping的方式，將這些資料取回

<script type="text/javascript">
		// <![CDATA[

			var Board			="xxxx#####xxxxxxxxxx!xxxx#   #xxxxxxxxxx!xxxx#$  #xxxxxxxxxx!xx###  $##xxxxxxxxx!xx#  $ $ #xxxxxxxxx!### # ## #xxx######!#   # ## #####  ..#!# $  $          ..#!##### ### #@##  ..#!xxxx#     #########!xxxx#######xxxxxxxx!"				;
			var BoardXMax		= 19			;
			var BoardYMax		= 11			;

			var CollectionId	= 1		;
			var CollectionLevel	= 1				;

			var HighScoreViewing= false	;

			var MoveCount		,
				MoveCountMax	;

		//]]>
</script>

目前最適合用來進行web scraping的工作，非python莫屬。利用Beautiful Soup可以很順利的進行解析html裡各tag。參考了拿取CData的文章和討論

但怎麼樣都無法順利的抽取出CData裡的資料，最後只能用find_all加上regular expression，讓主要的script區留下

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
scripts = soup.find_all("script", string=re.compile("var Board"))

參考了這些資料後，再利用regular expressin進行取代且搭配split，則可以讓關卡中的資料順利取出

利用requests拿取整份html後，進行抽取、取代的部份可以從這份gist看到程式碼

利用requests拿取時可以加入params的參考料可以從下述文章了解

Python 使用 requests 模組產生 HTTP 請求，下載網頁資料教學

params = {'130_1': ''}
r = requests.get('https://sokoban.info/', params)
print(r.text)

量化拿取關卡資料

利用requests加上BeautifulSoup，拿取單一關上可說是迎刃有餘，接下來要拿取全部的關卡，這就需要一些調整。首先，要知道到底有多少關以及接下來是一批批的拿取關卡，還是一關關的拿取，不論哪一種拿取方式，若是無法一次性完成，那要怎麼樣進行記錄，方便下次拿取剩下來的部份？

有多少關這樣的資訊似乎是每一頁(關)都有，資訊是放在option tag裡

<optgroup label="Thinking Rabbit">
<option value="1" selected="selected">Original &amp; Extra &nbsp; (90)</option>
<option value="2">Boxxle 1 &nbsp; (108)</option>
<option value="3">Boxxle 2 &nbsp; (120)</option>
</optgroup>
<optgroup label="Aymeric du Peloux">
<option value="4">Mini Cosmos &nbsp; (40)</option>
<option value="5">Micro Cosmos &nbsp; (40)</option>
<option value="6">Pico Cosmos &nbsp; (20)</option>
<option value="7">Nabo Cosmos &nbsp; (40)</option>
</optgroup>

再次利用BeautifulSoup加上regular expression和下述的參考資料

抽取資料和文字取代的代碼看起來如下

options = soup.find_all("option")
for o in options:
    index = o['value']
    content = o.string
    pattern = "\([0-9]*\)"
    amount_with_parentheses = re.search(pattern, content).group(0)
    amount = re.sub(r"\(|\)", '', amount_with_parentheses)
    title = re.sub(r"   \([0-9]*\)", '', content)

特別注意的是tile那的regular expressin，前方有三個空格，但第二個空格是nbsp，是直接從PyCharm的console裡回貼進來的。

拿取到關卡的資訊後，接下來就是要將這些資訊存下來，方便後續利用。自行產生檔案進行資料存取是個方法，延伸這個想法就進化成用File Db，存取的方式和自行處理類似，但在拿取、回放資料是應該是比較方便的。依據這樣的想方進行查訽，可以看到在python已經有現成的File Db也就是TinyDB。雖然有現成的File Db可用，但實際使用仍是要花些時間了解怎麼用

主要是利用table和document進行資料分類，先產生一個名為overview的table，確認裡面有所有關卡概要資料，也就是多少大關，每大關有多少小關。若是沒有此資訊，則先行拿取大關卡資料。

python datetime format 教學

之後利用dictonary和list及其增減的操作，看哪些大關裡的小關資料需要被拿取。

資料型別也要進行轉換

How to Convert a Python String to int

調整過後，雖然有些淩亂，但可以進行拿取各大關中的小關卡資料，並將其存下，方便日後使用。這個版本暫時可用，純以拿取資料而言，已不需要大幅度的調整，但這個版本中為了避免一次拿取太多資料而導致程式執行時卡住，暫定一次拿取20筆小關卡的資料。且io部份也沒有進行try catch的處理。待日後對python更了解且有時間進行優化時，或許會利用async等機制調整並加入exception處理。

【Python教學】淺談 Coroutine 協程使用方法

	import requests

	import re
	from bs4 import BeautifulSoup
	from bs4 import CData
	from bs4.element import CData

	from tinydb import TinyDB, where
	from datetime import datetime as dt

	# params = {'130_1': ''}
	# r = requests.get('https://sokoban.info/', params)

	# soup = BeautifulSoup(html_doc, 'html.parser')
	# soup = BeautifulSoup(r.text, 'html.parser')

	def use_db():
	print("use db")
	db = TinyDB("data.json")
	overview_table = db.table('Overview')

	# overview_table.search(Query()[])
	no_item_in = overview_table.count(where('extracted') == True) == 0
	stages = []
	if no_item_in:
	print('has no data')
	stages = extract_stage()
	overview_table.insert({
	"dateTime": dt.now().strftime("%Y/%m/%d %H:%M:%S"),
	"extracted": True,
	"stages": stages
	})
	else:
	print('has data, can just proceed')

	# processed_table = db.table('Processed')
	can_process_amount = 20
	current_process_amount = 0

	extracted_items = overview_table.all()
	document = extracted_items[0]
	stages = document['stages']
	for stage in stages:
	stage_title = stage['title']
	stage_id = stage['index']
	stage_amount = int(stage['amount'])
	stage_table = db.table(stage_title)

	current_stage_to_be_processed = {}
	for i in range(0, stage_amount):
	current_stage_to_be_processed[i + 1] = {
	'index': i + 1
	}

	all_stage_documents = stage_table.all()
	processed_stages = map(lambda x: x['index'], all_stage_documents)
	to_be_popped_later = []
	for p_stage in processed_stages:
	print("processed stage")
	print(p_stage)
	# current_stage_to_be_processed.pop(str(p_stage['index']), None)
	to_be_popped_later.append(p_stage)
	for popping in to_be_popped_later:
	print("popping")
	print(popping)
	current_stage_to_be_processed.pop(popping, None)

	#
	# print(stage_title)
	for to_process in current_stage_to_be_processed:
	if current_process_amount < can_process_amount:
	combined = stage_id + '_' + str(to_process)
	# print(combined)
	params = {combined: ''}
	r = requests.get('https://sokoban.info/', params)
	soup = BeautifulSoup(r.text, 'html.parser')
	board_result = extract_board(soup)
	stage_table.insert({
	'index': to_process,
	'board': board_result
	})
	current_process_amount += 1


	def extract_stage():
	print("extract stage")

	stages = []

	options = soup.find_all("option")
	for o in options:
	index = o['value']
	content = o.string
	pattern = "\([0-9]*\)"
	amount_with_parentheses = re.search(pattern, content).group(0)
	amount = re.sub(r"\(\|\)", '', amount_with_parentheses)
	title = re.sub(r" \([0-9]*\)", '', content)

	stages.append({
	'index': index,
	'title': title,
	'amount': int(amount)
	})

	return stages


	def extract_board(soup):
	scripts = soup.find_all("script", string=re.compile("var Board"))
	board_result = {}
	rows = []
	for s in scripts:
	content = s.string
	split_lines = content.split('\n')
	for sl in split_lines:
	result = re.match(r"\svar Board\s=", sl)
	if result:
	stripped = re.sub(r"\svar Board\s=", '', sl)
	remove_space_and_semicolon = re.sub(r"\"\s*;", '', stripped)
	remove_first_quote = re.sub(r"\"", '', remove_space_and_semicolon)
	with_end_char_rows = remove_first_quote.split('!')
	for row in with_end_char_rows:
	clean_row = re.sub(r"!", '', row)
	# print(clean_row)
	rows.append(clean_row)

	board_result = {
	'rows': rows
	}

	return board_result


	# Press the green button in the gutter to run the script.
	if __name__ == '__main__':
	use_db()
	# stages = extract_stage()
	# for s in stages:
	# print(s)
	# extract_board()

	# See PyCharm help at https://www.jetbrains.com/help/pycharm/

view raw main.py hosted with ❤ by GitHub

By ApprenticeGC

Mar 10th 2025