ð· FineWeb2 Edu Japanese - é«å質ãªæè²åãæ¥æ¬èªããŒã¿ã»ãã
ð· FineWeb2 Edu Japanese: é«å質ãªæè²åãæ¥æ¬èªããŒã¿ã»ããããå ¬éããŸããã
以äžã®å 容ã¯ãäžèšããŒãžã®æ¥æ¬èªèš³ã§ãã
æ¬ããŒã¿ã»ããã¯ãFineWeb2 ã®æ¥æ¬èªããŒã¿ïŒ376M件ïŒã®ãã¡ãæè²åãã³ã³ãã³ããšå€æãã120M件ïŒçŽ89.3BããŒã¯ã³ïŒã®æç« ããã£ã«ã¿ãããã®ã§ãã以äžã®ãµãã»ãããæäŸããŠããŸãã
- default: çŽ120M件ïŒ1.2å件ïŒã®ããŒã¿ã»çŽ89.3BããŒã¯ã³
- sample_10BT: default ããã©ã³ãã ãµã³ããªã³ã°ããçŽ10BããŒã¯ã³ã®ããŒã¿
- small_tokens: ããŒã¯ã³æ°ã512以äžã®çãæç« ã®ã¿ããæ§æãããããŒã¿
- small_tokens_cleaned: small_tokens ãã Web ç¹æã®ããã¹ããã€ãºãé€å»ããããŒã¿
ããŒã¿ã»ããäœæã®èæ¯
FineWebïŒè±èªã®ã¿ïŒã¯ãWebããŒã¿ã®éè€é€å»ãšé«å質ããã¹ãæœåºãç®çãšããŠäœæãããŸãããããã«ãæè²åãã«è³ªã®é«ãããã¹ããæœåºãã FineWeb-Edu ã«ãããããå°ãªãããŒã¯ã³æ°ã§ãå¹ççãªåŠç¿ãå®çŸå¯èœãšãªã£ãŠããŸãã
2024幎12æã«å ¬éããã FineWeb2 ã¯å€èšèªå¯Ÿå¿ïŒæ¥æ¬èªãå«ãïŒã®é«å質ããŒã¿ã»ããã§ããã2025幎2æçŸåšãæè²åãã«äŸ¡å€ãé«ããEduãããŒã¿ã»ããã¯æªå ¬éã§ããããã§ãæ¬ãããžã§ã¯ãã§ã¯ FineWeb2 Edu Japanese ããŒã¿ã»ãã ãäœæããå ¬éããŸããã
æè²çããŒã¿ã®ãã£ã«ã¿ãªã³ã°
æ¬ããŒã¿ã»ããã®æ§ç¯ã«ã¯ãFineWeb2 æ¥æ¬èªããŒã¿ãããæè²åãæç« ãå€å®ããããã®ã¢ãã« fineweb-2-edu-japanese-classifier ãå©çšããŠãã£ã«ã¿ãªã³ã°ããŸãããå€å®ã¢ãã«ã®ã¹ã³ã¢ãªã³ã°ã®æåž«ããŒã¿ã«ã¯ãDeepSeek-API (deepseek-chat) ã«ãã£ãŠè©äŸ¡ããã fineweb-2-edu-japanese-scores ã䜿ã£ãŠããŸãããªããæ¬ããŒã¿ã»ããã§ã¯ãã¹ã³ã¢ã2.5以äžã®æç« ã®ã¿ãæœåºããŠããããã®ã¹ã³ã¢ã¯ score
ã«ã©ã ã«èšèŒããŠããŸãã
ããŒã¯ã³ã«ãŠã³ãã®ä»äž
ModernBERT-Ja-130M ã®ããŒã¯ãã€ã¶ãçšããŠã«ãŠã³ãããããŒã¯ã³æ°ã token_count
ã«ã©ã ãšããŠä»äžãããŠããŸãã
Webç¹æã®ãã€ãºé€å»
FineWeb2 ã®æ¥æ¬èªããŒã¿ã«ã¯ãWebç¹æã®ãã€ã©ãŒãã¬ãŒããäžèŠãªãã€ãºãå«ãŸããããšããããŸããäŸãã°ã以äžã®ãããªæç« ãå«ãŸããŸãã
ãã®æç« ã¯90æ¥ä»¥äžæŽæ°ã®ç¡ããµã€ãã«è¡šç€ºãããŠããŸãã
ãã°ã€ã³ ãã°ã¢ãŠã
æ¬åœã«å¿
èŠãªæç« ä»¥å€ã«ããããŸããŸãªãã€ãºãå«ãŸããŠããããšããããŸããäŸãã°ããã®æç« ããã®äžäŸã§ããæ¬æ¥äžèŠãªããã¹ããå
¥ã£ãŠããŸãããšããã®ããã«ããã§ãããã
ä»ãªã50%ãªãïŒã¯ãªãã¯ããŠãªã³ã¯å
ã®ååã衚瀺
ãšãããæç« é·ãçãå Žåãæç« ã®ã»ãšãã©ããã€ãºãå«ãå¯èœæ§ããããŸããããããåãé€ãããšã§ãããé«å質ã®æç« ãæœåºã§ããªãããšèããŠããŸãã
åã®ããŒãž 次ã®ããŒãž
ãã®ãããªäžèŠãªããã¹ããåãé€ãããã®ã¢ãã«ãfineweb-2-japanese-text-cleaner ãéçºããŸããããã€ãºå€å®ã®æåž«ããŒã¿ãšããŠã¯ãfineweb-2-japanese-noise-spans ãå©çšããŠããŸãããã®æåž«ããŒã¿ã¯cyberagent/DeepSeek-R1-Distill-Qwen-32B-Japanese ã掻çšããŠäœãããŸããã
ãã®ã¢ãã«ã«ããã以äžã®ããã«ãã€ãºç®æãæ€åºãããŸãã
[NOISE]ãã®æç« ã¯90æ¥ä»¥äžæŽæ°ã®ç¡ããµã€ãã«è¡šç€ºãããŠããŸãã[/NOISE]
[NOISE]ãã°ã€ã³[/NOISE] [NOISE]ãã°ã¢ãŠã[/NOISE]
æ¬åœã«å¿
èŠãªæç« ä»¥å€ã«ããããŸããŸãªãã€ãºãå«ãŸããŠããããšããããŸããäŸãã°ããã®æç« ããã®äžäŸã§ããæ¬æ¥äžèŠãªããã¹ããå
¥ã£ãŠããŸãããšããã®ããã«ããã§ãããã
[NOISE]
ä»ãªã50%ãªãïŒã¯ãªãã¯ããŠãªã³ã¯å
ã®ååã衚瀺[/NOISE]
ãšãããæç« é·ãçãå Žåãæç« ã®ã»ãšãã©ããã€ãºãå«ãå¯èœæ§ããããŸããããããåãé€ãããšã§ãããé«å質ã®æç« ãæœåºã§ããªãããšèããŠããŸãã
[NOISE]åã®ããŒãž[/NOISE] [NOISE]次ã®ããŒãž[/NOISE]
æ¬ããŒã¿ã»ããã«å«ãŸããsmall_tokens_cleaned
ãµãã»ããã¯ãsmall_tokens
ããããã« fineweb-2-japanese-text-cleaner ã¢ãã«ãé©çšãããã€ãºãé€å»ããããŒã¿ãšãªããŸãããªããã¢ãã«ã䜿ã£ãŠãã€ãºæ€åºãããçããŒã¿ã¯ fineweb-2-edu-japanese-noise-detect-raw ã§å
¬éããŠããŸãã
ãªããã€ãºæ€åºã¯å®ç§ã§ã¯ãªããããå Žåã«ãã£ãŠã¯æ£ããæç« ã®äžéšã誀ã£ãŠé€å€ãããŠããå¯èœæ§ããããŸãã®ã§ã泚æãã ããã
泚æäºé
æ¬ããŒã¿ã»ãããFineWeb2 Edu JapaneseããšãEduãã£ã«ã¿ãªã³ã°ãå®æœããŠããªã倧å ã®ãFineWeb2ãããŒã¿ã»ãããšã®æ¯èŒå®éšã¯è¡ã£ãŠãããŸããããã®ãããå®éã®LLMåŠç¿ã«ãããŠã©ã®çšåºŠã®å¹æå·®ãçãããã¯æªæ€èšŒã§ãã
ãŸããæè²åãããã¹ããã©ããã®åé¡ç²ŸåºŠãå®ç§ã§ã¯ãªããäžéšæè²åãã§ã¯ãªãããã¹ããå«ãŸããŸãã
ã©ã€ã»ã³ã¹
æ¬ããŒã¿ã»ããã¯ãå ã® FineWeb2 ãšåæ§ã« Open Data Commons Attribution License (ODC-By) v1.0 ã®äžã§å ¬éããŸãããŸãã䜿çšã«ããã£ãŠã¯ CommonCrawlã®å©çšèŠçŽ ãé©çšãããŸãã
Citation Information
@software{yuichi2025fineweb-2-edu-japanese,
author = {Yuichi Tateno},
title = {FineWeb2 Edu Japanese},
month = feb,
year = 2025,
url = {https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese/}
}